2019-05-06: Twitter broke my scrapers
I soon learned this was no accident. I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout.
New desktop look |
Old Twitter desktop look |
.@WebSciDL 365 dots in 2018 highlights the top news stories everyday in 2018,— Alexander C. Nwala (@acnwala) March 6, 2019
and the top 10 news stories of 2018.
See our blogpost about the transparent graph-theoretic method @storygraphbot uses to quantify the magnitude of news stories:https://t.co/hjF6EsyaWI pic.twitter.com/ILqn91lVLT
Fig. 2: In the new tweet DIV, semantic items (e.g, the four semantic items in Fig. 1) are absent or obscured. |
The most consequential (to me) structural difference between the old and new tweet DIVs is that the old tweet DIV includes many attributes with meaningful names while the new tweet DIV does not. In fact, in the old tweet layout, the fundamental unit, the tweet, was explicitly labeled a "tweet" by a DIV with classname="tweet," unlike the new layout. Let us consider the difference between the old and new tweet DIVs from the perspective of the four important attributes marked in Fig. 1:
- data-tweet-id: In the old layout, data-tweet-id (contains the tweet ID - unique string that uniquely identifies a tweet) was explicitly marked. In the new layout, the data-tweet-id attribute is absent.
- data-conversation-id: This attribute, absent in the new layout, and present in the old layout is responsible for chaining tweets, and thus required for identifying tweets in a reply or conversation thread. A tweet that is a reply includes the Tweet ID of its parent tweet as a value of the data-conversation-id attribute.
- data-screen-name: The data-screen-name attribute labels the Twitter handle of the tweet author. This attribute is marked explicitly in the old tweet DIV, but not in the new tweet DIV.
- tweet-text: Within the old tweet DIV, the DIV with class name, "tweet-text," marks the text of the tweet, but in the new tweet DIV, there is no such semantic label for the tweet-text.
It is not clear if the structural changes to the Twitter DOM is a merely coincidental with the rollout of the new layout or if the removal of semantic attributes is part of an intentional effort to discourage scraping. Whatever the actual reason, the consequence is obvious - scraping Twitter has just gotten harder.
Update (2019-06-29)
Update (2019-06-29)
I previously had a paragraph in this blogpost discussing what I thought was a glitch:
I noticed that reloading my timeline caused Twitter to load and subsequently quickly remove sponsored tweets from my timeline.
During a discussion with Sawood about Twitter Ads, he raised the issue of Adblock which quickly made me realize that "the glitch," might not be a glitch, but Adblock in action. I have had Adblock on for so long that I failed to realize that it interfered with my conclusion. Further investigation validated this realization that I incorrectly attributed Adblock's removal of promoted tweets on my timeline as a glitch.
-- Alexander C. Nwala (@acnwala)
Comments
Post a Comment