2019-05-06: Twitter broke my scrapers

Fig. 1: The old tweet DIV showing four (data-tweet-id, data-conversation-id, data-screen-name, and tweet-text) attributes with meaningful names. These attributes are absent in the new tweet DIV (Fig. 2).

On April 23, 2019, my Twitter desktop layout changed. I initially thought a glitch caused me to see the mobile layout on my desktop instead of the standard desktop layout, but I soon learned this was no accident. I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout.

New desktop look

While others might have focused on the cosmetic or functional changes, my immediate concern was to understand the extent of the structural changes to the Twitter DOM. So I immediately opened my Google Chrome Developer Tools to inspect the Twitter DOM, and I was displeased to learn that the changes to the layout seeped beyond the cosmetic new looks of the icons into the DOM. This meant that I would have to rewrite all my research applications built to scrape data from the old Twitter layout.

Old Twitter desktop look

At the moment, I am unsure if it would be possible to extract all the data previously accessible from the old layout. It is important to note that scraping goes against Twitter's Terms of Service's and Twitter offers an API that fulfills some requests invalidating the need for scraping. However, the Twitter API is limited in search, but most importantly, the API does not offer a method for extracting all tweets from a conversation. Extracting tweets from a conversation is a task fundamental to my PhD research, so I scrape Twitter privately for research. In this blogpost, I will use the tweet below to highlight some of the major changes to the Twitter DOM, specifically the tweet DIV by comparing the old and new layouts.

.@WebSciDL 365 dots in 2018 highlights the top news stories everyday in 2018,
and the top 10 news stories of 2018.

See our blogpost about the transparent graph-theoretic method @storygraphbot uses to quantify the magnitude of news stories:https://t.co/hjF6EsyaWI pic.twitter.com/ILqn91lVLT
— Alexander C. Nwala (@acnwala) March 6, 2019

Fig. 2: In the new tweet DIV, semantic items (e.g, the four semantic items in Fig. 1) are absent or obscured.

Old Tweet DIV vs New Tweet DIV

The most consequential (to me) structural difference between the old and new tweet DIVs is that the old tweet DIV includes many attributes with meaningful names while the new tweet DIV does not. In fact, in the old tweet layout, the fundamental unit, the tweet, was explicitly labeled a "tweet" by a DIV with classname="tweet," unlike the new layout. Let us consider the difference between the old and new tweet DIVs from the perspective of the four important attributes marked in Fig. 1:

data-tweet-id: In the old layout, data-tweet-id (contains the tweet ID - unique string that uniquely identifies a tweet) was explicitly marked. In the new layout, the data-tweet-id attribute is absent.
data-conversation-id: This attribute, absent in the new layout, and present in the old layout is responsible for chaining tweets, and thus required for identifying tweets in a reply or conversation thread. A tweet that is a reply includes the Tweet ID of its parent tweet as a value of the data-conversation-id attribute.
data-screen-name: The data-screen-name attribute labels the Twitter handle of the tweet author. This attribute is marked explicitly in the old tweet DIV, but not in the new tweet DIV.
tweet-text: Within the old tweet DIV, the DIV with class name, "tweet-text," marks the text of the tweet, but in the new tweet DIV, there is no such semantic label for the tweet-text.

It is not clear if the structural changes to the Twitter DOM is a merely coincidental with the rollout of the new layout or if the removal of semantic attributes is part of an intentional effort to discourage scraping. Whatever the actual reason, the consequence is obvious - scraping Twitter has just gotten harder.

Update (2019-06-29)

I previously had a paragraph in this blogpost discussing what I thought was a glitch:

I noticed that reloading my timeline caused Twitter to load and subsequently quickly remove sponsored tweets from my timeline.

During a discussion with Sawood about Twitter Ads, he raised the issue of Adblock which quickly made me realize that "the glitch," might not be a glitch, but Adblock in action. I have had Adblock on for so long that I failed to realize that it interfered with my conclusion. Further investigation validated this realization that I incorrectly attributed Adblock's removal of promoted tweets on my timeline as a glitch.

-- Alexander C. Nwala (@acnwala)

Search This Blog

Web Science and Digital Libraries Research Group

2019-05-06: Twitter broke my scrapers

Comments

Post a Comment