Posts

Showing posts from March, 2018

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English

Image
Fig. 1 - Barack Obama's Twitter page in Urdu The  ODU   WSDL  lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President  Barack Obama ’s archived Twitter page , shown in the image above, is in English, but the page template is in  Urdu . You may notice that some of the information, such as, "followers", "following", "log in", etc. are not displayed in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman  in " The vulnerability in the US digital registry, Twitter, and the Internet Archive ". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications whe

2018-03-15: Paywalls in the Internet Archive

Image
Paywall page from The Advertister Paywalls  have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times  or Wall Street Journal , because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times , as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below: Memento Responses All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respect

2018-03-14: Twitter Follower Count History via the Internet Archive

Image
The USA Gymnastics team shows significant growth during the years the Olympics are held. Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive . If the account is popular enough to be archived, then a follower count for a specific date can be collected.  The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation

2018-03-12: NEH ODH Project Directors' Meeting

Image
Michael and I attended the NEH Office of Digital Humanities (ODH) Project Directors' Meeting and the "ODH at Ten" celebration  ( #ODHatTen ) on February 9 in DC.  We were invited because of our recent NEH Digital Humanities Advancement Grant,  "Visualizing Webpage Changes Over Time"  (described briefly in a previous blog post when the award was first announced), which is joint work with Pamela Graham and Alex Thurman  from  Columbia University Libraries and Deborah Kempe from the Frick Art Reference Library and NYARC . The presentations were recorded, so I expect to see a set of videos available in the future, as was done for the 2014 meeting  ( my 2014 trip report ).    Update: 2018 Lightning Round Talk Videos The afternoon keynote was given by  Kate Zwaard , Chief of National Digital Initiatives at the Library of Congress. She highlighted the great work being done at  LC Labs . Kate Zwaard is today's #ODHatTEN keynote! She's shared a

2018-03-04: Installing Stanford CoreNLP in a Docker Container

Image
Fig. 1: Example of Text Labeled with the CoreNLP Part-of-Speech , Named-Entity Recognizer and Dependency Annotators . Click to expand image. The  Stanford CoreNLP  suite provides a wide range of important natural language processing applications such as Part-of-Speech (POS) Tagging and Named-Entity Recognition (NER) Tagging. CoreNLP is written in Java and there is support for other languages . I tested a couple of the latest Python wrappers that provide access to CoreNLP but was unable to get them working due to different environment-related complications. Fortunately, with the help of Sawood Alam , our very able Docker  campus ambassador at Old Dominion University, I was able to create a Dockerfile  that installs and runs the CoreNLP server ( version 3.8.0 ) in a container. This eliminated the headaches of installing the server and also provided a simple method of accessing CoreNLP services through HTTP requests. How to run the CoreNLP server on localhost port 9000 from a D