Posts

Showing posts with the label Heritrix

2019-03-18: Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages

Image
Figure 1: Mixed language blocks on a memento of a Twitter timeline. Highlighted with blue colored box for Portuguese, orange for English, and red for Urdu. Dotted border indicates the template present in the original HTML response while blocks with solid borders indicate lazily loaded content. Would you be surprised if I were to tell you that Twitter is a multi-lingual website, supporting 47 different international languages ? How about if I were to tell you that a usual Twitter timeline page can contain tweets in whatever languages the owner of the handle chooses to tweet, but can also show navigation bar and various sidebar blocks in many different languages simultaneously, now surprised? Well, while it makes no sense, it may actually happen in web archives when a memento of a timeline is accessed as shown in Figure 1. Spoiler alert! Cookies are to be blamed, once again . Last month, I was investigating a real life version of " Ron Burgundy will read anything on the

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English

Image
Fig. 1 - Barack Obama's Twitter page in Urdu The  ODU   WSDL  lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President  Barack Obama ’s archived Twitter page , shown in the image above, is in English, but the page template is in  Urdu . You may notice that some of the information, such as, "followers", "following", "log in", etc. are not displayed in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman  in " The vulnerability in the US digital registry, Twitter, and the Internet Archive ". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications whe

2016-12-20: Archiving Pages with Embedded Tweets

Image
I'm from Louisiana and used Archive-It to build a collection of webpages about the September flood there ( https://www.archive-it.org/collections/7760/ ). One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy' , highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page): Live page, screenshot generated on Sep 9, 2016 To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived. First, let's look at how embedded Tweets work on the live web. According to Twitter : "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on T