Posts

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English

Image
Fig. 1 - Barack Obama's Twitter page in Urdu The  ODU   WSDL  lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President  Barack Obama ’s archived Twitter page , shown in the image above, is in English, but the page template is in  Urdu . You may notice that some of the information, such as, "followers", "following", "log in", etc. are not displayed in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman  in " The vulnerability in the US digital registry, Twitter, and the Internet Archive ". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications whe

2018-03-15: Paywalls in the Internet Archive

Image
Paywall page from The Advertister Paywalls  have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times  or Wall Street Journal , because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times , as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below: Memento Responses All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respect

2018-03-14: Twitter Follower Count History via the Internet Archive

Image
The USA Gymnastics team shows significant growth during the years the Olympics are held. Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive . If the account is popular enough to be archived, then a follower count for a specific date can be collected.  The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation

2018-03-12: NEH ODH Project Directors' Meeting

Image
Michael and I attended the NEH Office of Digital Humanities (ODH) Project Directors' Meeting and the "ODH at Ten" celebration  ( #ODHatTen ) on February 9 in DC.  We were invited because of our recent NEH Digital Humanities Advancement Grant,  "Visualizing Webpage Changes Over Time"  (described briefly in a previous blog post when the award was first announced), which is joint work with Pamela Graham and Alex Thurman  from  Columbia University Libraries and Deborah Kempe from the Frick Art Reference Library and NYARC . The presentations were recorded, so I expect to see a set of videos available in the future, as was done for the 2014 meeting  ( my 2014 trip report ).    Update: 2018 Lightning Round Talk Videos The afternoon keynote was given by  Kate Zwaard , Chief of National Digital Initiatives at the Library of Congress. She highlighted the great work being done at  LC Labs . Kate Zwaard is today's #ODHatTEN keynote! She's shared a

2018-03-04: Installing Stanford CoreNLP in a Docker Container

Image
Fig. 1: Example of Text Labeled with the CoreNLP Part-of-Speech , Named-Entity Recognizer and Dependency Annotators . Click to expand image. The  Stanford CoreNLP  suite provides a wide range of important natural language processing applications such as Part-of-Speech (POS) Tagging and Named-Entity Recognition (NER) Tagging. CoreNLP is written in Java and there is support for other languages . I tested a couple of the latest Python wrappers that provide access to CoreNLP but was unable to get them working due to different environment-related complications. Fortunately, with the help of Sawood Alam , our very able Docker  campus ambassador at Old Dominion University, I was able to create a Dockerfile  that installs and runs the CoreNLP server ( version 3.8.0 ) in a container. This eliminated the headaches of installing the server and also provided a simple method of accessing CoreNLP services through HTTP requests. How to run the CoreNLP server on localhost port 9000 from a D

2018-02-27: Summary of Gathering Alumni Information from a Web Social Network

Image
While researching my dissertation topic  (slides 2--28) on social media profile discovery, I encountered a related paper titled Gathering Alumni Information from a Web Social Network  written by Gabriel Resende Gonçalves , Anderson Almeida Ferreira , and Guilherme Tavares de Assis , which was published in the proceedings of the  9th IEEE Latin American Web Congress (LA-WEB) . In this paper, the authors detailed their approach to define a semi-automated method to gather information regarding alumni of a given undergraduate program at Brazilian higher education institutions. Specifically, they use the  Google Custom Search Engine (CSE) to identify candidate LinkedIn pages based on a comparative evaluation of similar pages in their training set. The authors contend alumni are efficiently found through their process, which is facilitated by focused crawling of data publicly available on social networks posted by the alumni themselves. The proposed methodology consists of three main modul