2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Archive Unleashed 3.0 took place in the Internet Archive, San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium, hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam, Mohamed Aturban, Erika Siregar, and myself. This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0, and Web Archive Hackathon 2.0.

This workshop, was supported by the Internet ArchiveRutgers University, and the University of Waterloo. The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools to web archives. The three organizers of this workshop were: Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan, (Assistant Professor, Department of History, University of Waterloo), and Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo).
It was a big moment for me as I first saw the Internet Archive building, it had an Internet Archive truck parked outside. Since 2009, the IA headquarters have been at 300 Funston Avenue in San Fransisco, a former Christian Science Church. Inside the building in the main hall there were multiple mini statues for every archivist who worked in the IA for over three years.
On Wednesday night, we had a welcome dinner and a small introduction of the members that have arrived.
Day 1 (February 23, 2017)
On Thursday, we started with a breakfast and headed to the main hall where several presentations occurred. Matthew Weber presented “Stating the Problem, Logistical Comments”. Dr. Weber started by stating the goals which include developing a common vision of web archiving development and tool development, and to learn to work with born digital resources for humanities and social science research.
Next, Ian Milligan presented “History Crawling with Warcbase”. Dr. Milligan gave an overview of Warcbase. Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The tool is used to analyze web archives using Spark, and to take advantage of HBase to provide random access as well as analytics capabilities.

Next, Jefferson Bailey (Internet Archive) presented “Data Mining Web Archives”. He talked about the conceptual issues in Access to WA which include: Provenance (much data, but not all as expected), Acquisition (highly technical; crawl configs; soft 404s), Border issues (the web never really ends), Lure of evermore data (more data is not better data), and Attestation (higher sensitivity to elision than in traditional archives?). He also explained the different formats that the Internet Archive can save its data, which include CDX, Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). In addition, he presented an overview of some research projects based on the IA collaboration. Some of the researches he mentioned were: The ALEXANDRIA project, Web Archives for Longitudinal Knowledge, Global Event and Trend Archive Research & Integrated Digital Event Archiving, and many more.
Next, Vinay Goel (Internet Archive) presented “API Overview”. He presented the Beta WayBack Machine, which searches the IA based on a URL or a word related to a sites home page. He mentioned that search results are presented based on the anchor text search.
Justin Littman (George Washington University Libraries), presented “Social Media Collecting with Social Feed Manager”. SFM is an open source software that collects social media from APIs of Twitter, Tumblr, Flickr, and Sina Weibo.

The final talk was by “Ilya Kreymer” (Rhizome), where he presents an overview of the tool “Webrecorder”. The tool provides an integrated platform for creating high-fidelity web archives while browsing, sharing, and disseminating archived content.
After that, we had a short coffee break and started to form three groups. In order to form the groups, all participants were encouraged to write a few words on the topic they would like to work on, some words that appeared were: fake news, news, twitter, etc. Similar notes are grouped together and associating members. The resulting groups were Local News, Fake News, and End of Term Transition.

Group Name Group Members
Local News:
Good News/Bad News
Sawood Alam, Old Dominion University
Lulwah Alkwai, Old Dominion University
Mark Beasley, Rhizome
Brenda Berkelaar, University of Texas at Austin
Frances Corry, University of Southern California
Ilya Kreymer, Rhizome
Nathalie Casemajor, INRS
Lauren Ko, University of North Texas
Fake News Erika Siregar, Old Dominion University
Allison Hegel, University of California, Los Angeles
Liuqing Li, Virginia Tech
Dallas Pillen, University of Michigan
Melanie Walsh, Washington University
End of Term Transition Mohamed Aturban, Old Dominion University
Justin Littman, George Washington University
Jessica Ogden, University of Southampton
Yu Xu, University of Southern California
Shawn Walker, University of Washington
Every group started to work on its dataset and brain storm different research questions to answer, and formed a plan of work. Then we basically worked all through the day, and ended the night with a working dinner at the IA.

Day 2 (February 24, 2017)
On Friday we started by eating breakfast, and then each team continued to work on their projects.
Every Friday the IA has free lunches where hundreds of people join together; some were artists, activists, engineers, librarians and many more. After that, a public tour of the IA takes place.
We had some light talks after lunch. The first talk was by Justin Littman, were he presented an overview of his new tool called “Fbarc”. This tool archive webpages from Facebook using the Graph API.
Nick Ruest (Digital Assets Librarian at York University), gave a talk on “Twitter”. Next, Shawn Walker (University of Washington), presented “We are doing it wrong!”. He explained how the current collecting process of social media is not how people view social media.

After that all the teams presented their projects. Starting with our team, we called our project "Good News/Bad News". We utilized historical captures (mementos) of various local news sites' homepages from Archive-It to prepare our seed dataset. In order to transform the data for our usage we utilized the Webrecorder, WAT converter, and some custom scripts. We extracted various headlines featured on the homepages of the each site for each day. With the extracted headlines we analyzed the sentiments on various levels including individual headlines, individual sites, and over the whole nation using the VADER-Sentiment-Analysis Python library. To leverage more machine learning capabilities for clustering and classification, we built a Latent Semantic Indexer (LSI) model using a Ruby library called Classifier Reborn. Our LSI model helped us convey the overlap of discourse across the country. We also explored the possibility of building Word2Vec model using TensorFlow for advanced machine learning, but due to limited amount of available time, despite the great potential, we could not pursue it. To distinguish between the local and the national discourse we planned on utilizing Term Frequency-Inverse Document Frequency, but could not put it together on time. For the visualization we planned on showing the interactive US map along with the heat map of the newspaper location with the newspaper ranking as the size of the spot and the color indicating if it is good news (green) or bad news (red). Also, when a newspaper is selected a list of associated headlines is revealed (color coded as Good/Bad), a Pie chart showing the overall percentage Good/Bad/Neutral, related headlines from various other news sites across the country, and a word cloud of the top 200 most frequently used words. This visualization could also have a time slider that show the change of the sentiment for the newspapers over time. We had many more interesting visualization ideas to express our findings, but the limited amount of time only allowed us to go this far. We have made all of our code and necessary data available in a GitHub repo and trying to make a live installation available for exploration soon.

Next, the team “Fake News” presented their work. The team started with the research questions: “Is it fake news to misquote a presidential candidate by just one word? What about two? Three? When exactly does fake news become fake?”. Based on these question, they hypothesis that “Fake news doesn’t only happen from the top down, but also happens at the very first moment of interpretation, especially when shared on social media networks". With this in mind, they want to determine how Twitter users were recording, interpreting, and sharing the words spoken by Donald Trump and Hillary Clinton in real time. Furthermore, they also want to find out how the “facts” (the accurate transcription of the words) began to evolve into counter-facts or alternate versions of their words. They analyzed the twitter data from the second presidential debate and focused on the most famous keywords such as "locker room", "respect for women", and "jail". The analysis result is visualized using word tree and bar chart. They also conducted a sentiment analysis which outputs a surprising result: most twitter result has positive sentiments towards the locker-room talk. Further analysis showed that apparently sarcastic/insincere comments skewed the sentiment analysis, hence the positive sentiments.

After that, the team “End of Term Transition” presented their project. The group were trying to use public archives to estimate change in the main government domains at the time of each US presidential administration transition. For each of these official websites, they planned to identify the kind and the rate of change using multiple techniques including the Simhash, TF–IDF, edit distance, and efficient thumbnail generation. They investigated each of these techniques in terms of its performance and accuracy. The datasets were collected from the Internet Archive Wayback Machine around the 2001, 2005, 2009, 2013, and 2017 transitions. The team made their work available on Github.

Finally, a surprise new team joined, it was team “Nick”. It was presented by Nick Ruest, (Digital Assets Librarian at York University). Nick has been exploring Twitter API mysteries, he showed some visualizations showing some odd peaks that occurred.

After the teams presented their work, the judges announced the team with the most points, and the winner team was “End of Term Transition”.

This workshop was extremely interesting and I enjoyed it fully. The fourth Datathon Archives Unleashed 4.0: Web Archive Datathon was announced, and will occur at the British Library, London, UK, at June 11 – 13, 2017. Thanks to Matthew Weber, Ian Milligan, and Jimmy Lin for organizing this event, and for Jefferson Bailey, and Vinay Goel, and everyone at the Internet Archive.

-Lulwah M. Alkwai