Posts

Showing posts from March, 2018

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English

Image
The ODUWSDL lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President Barack Obama’s archived Twitter page, shown in the image above, is in English, but the page template is in Urdu. You may notice that some of the information, such as, "followers", "following", "log in", etc. are not displayed in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman in "The vulnerability in the US digital registry, Twitter, and the Internet Archive". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications when looked at from a digital archivist perspective.

The problem became…

2018-03-15: Paywalls in the Internet Archive

Image
Paywalls have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times or Wall Street Journal, because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times, as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below:

Memento Responses All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respective news sites and tried to see how the Internet A…

2018-03-14: Twitter Follower Count History via the Internet Archive

Image
Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive. If the account is popular enough to be archived, then a follower count for a specific date can be collected. 

The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation date of the follower is the lower bound for when they could have started following the account under obs…

2018-03-12: NEH ODH Project Directors' Meeting

Image
Michael and I attended the NEH Office of Digital Humanities (ODH)Project Directors' Meeting and the "ODH at Ten" celebration (#ODHatTen) on February 9 in DC.  We were invited because of our recent NEH Digital Humanities Advancement Grant, "Visualizing Webpage Changes Over Time" (described briefly in a previous blog post when the award was first announced), which is joint work with Pamela Graham and Alex Thurman from Columbia University Libraries and Deborah Kempe from the Frick Art Reference Library and NYARC.

The presentations were recorded, so I expect to see a set of videos available in the future, as was done for the 2014 meeting (my 2014 trip report). 
Update:2018 Lightning Round Talk Videos

The afternoon keynote was given by Kate Zwaard, Chief of National Digital Initiatives at the Library of Congress. She highlighted the great work being done at LC Labs.
Kate Zwaard is today's #ODHatTEN keynote! She's shared a bit about our approaches & projec…

2018-03-04: Installing Stanford CoreNLP in a Docker Container

Image
The Stanford CoreNLP suite provides a wide range of important natural language processing applications such as Part-of-Speech (POS) Tagging and Named-Entity Recognition (NER) Tagging. CoreNLP is written in Java and there is support for other languages. I tested a couple of the latest Python wrappers that provide access to CoreNLP but was unable to get them working due to different environment-related complications. Fortunately, with the help of Sawood Alam, our very able Docker campus ambassador at Old Dominion University, I was able to create a Dockerfile that installs and runs the CoreNLP server (version 3.8.0) in a container. This eliminated the headaches of installing the server and also provided a simple method of accessing CoreNLP services through HTTP requests. How to run the CoreNLP server on localhost port 9000 from a Docker container Install Docker if not already availablePull the image from the repository and run the container:$ docker pull anwala/stanfordcorenlp $ docker ru…