Monday, May 6, 2019

2019-05-06: Twitter broke my scrapers

Fig. 1: The old tweet DIV showing four (data-tweet-id, data-conversation-id, data-screen-name, and tweet-text) attributes with meaningful names. These attributes are absent in the new tweet DIV (Fig. 2).
On April 23, 2019, my Twitter desktop layout changed. I initially thought a glitch caused me to see  the mobile layout on my desktop instead of the standard desktop layout, but I soon learned this was no accident. I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout. 
New desktop look 
While others might have focused on the cosmetic or functional changes, my immediate concern was to understand the extent of the structural changes to the Twitter DOM. So I immediately opened my Google Chrome Developer Tools to inspect the Twitter DOM, and I was displeased to learn that the changes to the layout seeped beyond the cosmetic new looks of the icons into the DOM. This meant that I would have to rewrite all my research applications built to scrape data from the old Twitter layout.
Old Twitter desktop look
At the moment, I am unsure if it would be possible to extract all the data previously accessible from the old layout. It is important to note that scraping goes against Twitter's Terms of Service's and Twitter offers an API that fulfills some requests invalidating the need for scraping. However, the Twitter API is limited in search, but most importantly, the API does not offer a method for extracting all tweets from a conversation. Extracting tweets from a conversation is a task fundamental to my PhD research, so I scrape Twitter privately for research. In this blogpost, I will use the tweet below to highlight some of the major changes to the Twitter DOM, specifically the tweet DIV by comparing the old and new layouts. 
Fig. 2: In the new tweet DIV, semantic items (e.g, the four semantic items in Fig. 1) are absent or obscured.
Old Tweet DIV vs New Tweet DIV
The most consequential (to me) structural difference between the old and new tweet DIVs is that the old tweet DIV includes many attributes with meaningful names while the new tweet DIV does not. In fact, in the old tweet layout, the fundamental unit, the tweet, was explicitly labeled a "tweet" by a DIV with classname="tweet," unlike the new layout. Let us consider the difference between the old and new tweet DIVs from the perspective of the four important attributes marked in Fig. 1:
  1. data-tweet-id: In the old layout, data-tweet-id (contains the tweet ID - unique string that uniquely identifies a tweet) was explicitly marked. In the new layout, the data-tweet-id attribute is absent.
  2. data-conversation-id: This attribute, absent in the new layout, and present in the old layout is responsible for chaining tweets, and thus required for identifying tweets in a reply or conversation thread. A tweet that is a reply includes the Tweet ID of its parent tweet as a value of the data-conversation-id attribute.
  3. data-screen-name: The data-screen-name attribute labels the Twitter handle of the tweet author. This attribute is marked explicitly in the old tweet DIV, but not in the new tweet DIV.
  4. tweet-text: Within the old tweet DIV, the DIV with class name, "tweet-text," marks the text of the tweet, but in the new tweet DIV, there is no such semantic label for the tweet-text.
The new Twitter layout is still under-development, so it comes as no surprise that I discovered a glitch. I noticed that reloading my timeline caused Twitter to load and subsequently quickly remove sponsored tweets from my timeline. This happens too fast to capture with a screenshot, so I recorded my screen to capture the glitch (Fig. 3).
Fig. 3: New Twitter layout glitch showing the loading and subsequent removal of sponsored tweets
It is not clear if the structural changes to the Twitter DOM is a merely coincidental with the rollout of the new layout or if the removal of semantic attributes is part of an intentional effort to discourage scraping. Whatever the actual reason, the consequence is obvious - scraping Twitter has just gotten harder.

-- Alexander C. Nwala (@acnwala)

No comments:

Post a Comment