2016-06-27: Archives Unleashed 2.0 Web Archive Hackathon Trip Report

Members from WSDL who participated in the Hackathon 2.0
Last week, June 13-15, 2016, six members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend the second Archives Unleashed 2.0 at the Library of Congress in Washington DC. This event is a follow-up to the Archives Unleashed (Web Archive Hackathon 1.0) held in March 2015 at the University of Toronto Library, Toronto, Ontario Canada. We (Mat Kelly, Alexander Nwala, John Berlin, Sawood Alam, Shawn Jones, and Mohamed Aturban) met with other participants, from various countries, who have different backgrounds -- librarians, historians, computer scientists, etc. The main goal of this event is to build tools for web archives as well as to support this kind of ongoing community to have a common vision of how to access and extract data from web collections.

This event was made possible with generous support from the National Science Foundation, the Social Sciences and Humanities Research Council of Canada, the University of Waterloo’s Department of History, the David R. Cheriton School of Computer Science and the University of Waterloo, and the School of Communication and Information at Rutgers University.

The event was organized by Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan (assistant professor, Department of History, University of Waterloo), Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo), Nicholas Worby (Government Information & Statistics Librarian, University of Toronto), and Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec). Here are some details about different activities over the three days of Hackathon 2.0.

Day 1 (June 13, 2016)

Our evening gathering in the first day of Hackathon 2.0 was at Gelman Library, George Washington University, and it was for (1) the participants to briefly introduce themselves and their area of research, and (2) forming multiple groups to work on different Hackathon projects. In order to form groups, all participants were encouraged to write a few words in three separate sticky notes describing a general topic they were interested in (e.g., topic modeling, extracting metadata, study tweets, and analysis of Supreme Court nominations), what kind of dataset they wanted to work on (e.g., collected tweets, and dataset from 2004/2008 election), and what they wished to accomplish with the selected dataset.

Participants were trying to put those sticky notes that had similar ideas together. After that, the initial groups were formed and every group was given a few minutes to introduce their project idea. By the end of the first day, we all went to a restaurant to have our dinner and socialize. Here is a list of the different groups formed initially during the first day after the brainstorming session, and I will explain later in some details what each group had accomplished:

Group Name Members
Twitter Political Organization Allie Kosterich, Nich Worby, John Berlin, Laura Wrubel, Gregory Wiedeman
Mojitos Nathalie Casemajor, Federico Nanni, Alexander Nwala, Sylvain Rocheleau, Jana Hajzlerova, Petra Galuscakova
Museum Ed Summers, Emily Maemura, Sawood Alam, Jefferson Bailey
I Know What You Hid Last Summer Mat Kelly, Shawn Walker, Keesha Henderson, Jaimie Murdock, Jessica Ogden, Ramine Tinati, Niko Tsakalakis
The Supremes Nicholas Taylor, Ian Milligan, Jimmy Lin, Patrick Rourke, Todd Suomela, Andrew Weber
Team Turtle Mohamed Aturban, Niel Chah, Steve Marti, Imaduddin Amin
Counter-Terrorism Daniel Kerchner, Emily Gade
Campaign: Origins Allison Hegel,Debra A. Riley-Huff, Justin Littman, Shawn M. Jones, Kevin Foley, Ericka Menchen-Trevino, Nick Bennett, Louise Keen

Day 2 (June 14, 2016)

Colleen Shogan, from the Library of Congress, declared the Hackathon open. Colleen mentioned that researchers who have questions about politics, history, or any other aspects related to culture memory really need specialists like us to help them access data available in different repositories such as Internet Archive and the Library of Congress. She emphasized the importance of such events and finally she thanked people who made this event possible including the organizers and the steering committee.

Matthew Weber presented the agenda of the day including presentations, a brief tour at the Library of Congress, and revising groups formed the day before. Matthew gave an example related to his dissertation work in the past illustrating how difficult it is to use web archives to answer research questions without building tools. He stated that this ongoing community is to build a common vision for web archive development tools to help accessing and extracting data, and uncover important stories from web archives. Finally, Matthew listed several kind of datasets available for the participants to work on. 2004, 2008, and 2010 election data, and the Supreme Court nominations are example of such datasets.

Ian Milligan introduced Warcbase (slides) which was developed by a team of five historians, three computer scientists, and a network scholar. Ian showed how slow it is to browse web archives using the traditional way of entering a URL in the Wayback Machine (remembering that requiring the URL itself limits what you can find in the archives). Warcbase works beyond that where it can be used to access, extract, manage, and process data from WARC files (e.g., extracting names, locations, plain text, URIs, and others from WARC files and generating different formats like network graphs or metadata in JSON). Warcbase supports filtering data based on dates, domain names, languages, etc. In addition, Warcbase is scalable which means it may run on a laptop, a powerful desktop, or on a cluster. Users may use command line tools as well as an interactive web-based interface to run Warcbase.

Jefferson Bailey and Vinay Goel from the Internet Archive presented Archive Research Services Workshop. Jefferson mentioned that the Internet Archive focuses on collecting web resources and providing access to those collections. The Internet Archive does not allow researchers to access their infrastructure to do intensive research like data mining activities. The Internet Archive has huge web collections about 13 terabyte, and it collects about a billion URIs a week. Jefferson indicated also that WARC files are huge, and it is difficult to work with such files. Also, researchers might request huge collections in WARC format, but they may end up using only a small portion. For those reasons, Internet Archive is trying to support specific research questions, so instead of providing data in WARC format, they will allow users to have access to datasets in different formats like CDX which consists of metadata about the original WARC files. Other formats include Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). Having such formats allows us to have smaller datasets. For example, CDX is only one percent of the size of WARC files.

Next, Vinay Goel from the Internet Archive continued on the same topic (Archive Research Services Workshop). Vinay gave a quick overview of ArchiveSpark. The tool might help the community to search and filter Internet Archive collections (e.g., filtering could be based on date, MIME type, and HTTP Response code). A research paper about ArchiveSpark was accepted and will be presented at JCDL 2016.

Abigail Grotke and Andrew Weber introduced Library of Congress Data Resources. Abigail and Andrew are working on web archiving team along with other members at the Library of Congress. They indicated that most of the crawling process is done at the Internet Archive. The Library of Congress has been collecting web resources for more than 16 years. They have made some collections available on the web. These collections are searchable (not full text search), indexed, and can be accessed by the Wayback Machine. In addition, the Library of Congress archive supports Memento. Most of the collections are not allowed to be accessed on the web due to copyright issues and permissions policy, and researchers must be physically there at the Library of Congress to access these collections.

During the coffee break, we had the opportunity to make a short tour around the Library of Congress Jefferson Building which was the first building with electricity and an elevator in use in DC.

After the coffee break, each group explained briefly their project idea, and what kind of dataset they were going to use. At this time, some participants moved to other groups as they found more interesting ideas.

While having our lunch, we were listening to the five-minute lightning talks. Nicholas Taylor (from Stanford University) introduced WASAPI, Jefferson Bailey (Internet Archive) gave a short talk about Researcher Services, Ericka Menchen-Trevino (American University) presented Web Historian, Nathalie Casemajor, Petra Galuscakova, and Sylvain Rocheleau briefly explained NUAGE, Alexander Nwala (Old Dominion University) introduced the topic Generating Collections for Stories and Event, and finally John Berlin (Old Dominion University) presented Are Wails Electric?

After that, the groups were located in different rooms based on what kind of equipment every team might need to work on the project. Each group met and had the opportunity to work for about 5 hours (there were 30 minutes coffee break after the first 2 hours) on their project ideas. By the end of the second day, we all came together around 6 PM, and each group's representative gave an update about their team's progress. Then, all participants were invited for dinner.

Day 3 (June 15, 2016)

Most of last day's time was for the groups to intensively work on their projects and produce the final results. From the time we had our breakfast at 8:30 AM at the Madison Atrium til the end of the day at 6:30, we were working on our projects except the times for the coffee break and lunch. Some participants gave five-minute lightning talks during the lunch time. The voice was not really clear at the Madison Atrium, Justin Littman was standing on a chair to deliver his talk, yet the voice still was not delivered clearly. For this reason, I will briefly mention what those talks were about.

Laura Wrubel, Daniel Kerchner, and Justin Littman from George Washington University presented an introduction to the new Social Feed Manager, a sampling of research projects supported by Social Feed Manager, and the provenance of a tweet (as inspired by web archiving). Sawood Alam from Old Dominion University introduced MemGator – A Memento Aggregator CLI and Server in Go. Jaimie Murdock from Indiana, Polygraphic and Polymathic presented the Into Thomas Jefferson’s Mind. Finally, Mat Kelly from Old Dominion University gave a short talk about Exploring Aggregation of Personal, Private, and Institutional Web Archives.

Final presentations

By the end of the day, each group presented the findings of the project that they were working on for the last couple of days:

  • Mojitos (Slides)

  • The team's goal was to detect and track the events discussed between polar media in Cuba. This was done by processing news data from the state controlled Cuban media (Granma) and a media that caters to Cuba located in Florida (el Nuevo Herald).

  • Campaign: Origins (Slides)

  • Using tweets with #election2016, @realDonaldTrump, and @HillaryClinton, this team ​searched for narratives using the content of the web pages linked to from these tweets, rather than just the tweets themselves. The tweets were collected on June 14 - 15. The team's intention is to use the Internet Archive's Save Page Now feature to capture the web pages as they are tweeted so that such a study can be repeated on a larger set of tweets in the future. They produced the following streamgraph.

  • The Supremes (Slides and more details)

  • This group has tried to analyze web archived data, provided in ARC format by the Library of Congress, about the Supreme Court nominations for Justice Alito and Justice Roberts. The size of the datasets is 92 GB containing 2.2 million pages about both Alito and Roberts. The goal of the team was to explore and analyze the data and produce more possible research questions. They used Warcbase to extract datasets from the ARC files. In addition, Warcbase can produce files in a format that can be opened directly in other platforms like Gephi.

  • I Know What You Hid Last Summer (Slides)

  • The team took Twitter datasets from the UK and Canadian Parliament members, identified the deleted tweets, noted which tweets contained links, checked if those links died after the tweet was deleted, and tried to derive meaning from the deletion. Further visualization was also done.

  • Museum

  • This team tried to analyze CDX files from the Internet Archive's IMLS Museums crawl consisting of over 219 million captures. They also utilized the Museum Universe Data File from IMLS to enrich their findings. They evaluated the proportion of various content-types (such as images or PDFs) that were crawled. They also quantified the term frequencies in the URLs of each content type. Additionally, they demonstrated the domain name distribution in the collection in a hierarchical chart (using tree-map). A part of their analysis is published on GitHub.

  • Counter-Terrorism (Slides)

  • This team collected 383,527 tweets (between 2013 and 2016) from 1,153 accounts of suspected extremists. Approximately, 300 people are associated with these accounts. The tweets are in mix of English and Arabic. The goal is to identify ISIS supporters by running them through an ideology classifier.

  • Team Turtle (Slides)

  • The team used a dataset from 2004 Presidential Election provided by the Library of Congress. The dataset was collected during the day before the election, the election day, and the day after the election. The goal of this team is to answer questions like (1) if one candidate spends more time talking about issues related to a particular state than the other candidate does, would this lead him to win the state? (2) would candidates give more time to the "swing" states than others? and (3) what is the most important topic for each state? The dataset was available in ARC format. Warcbase tool is used to extract text from those files. After that, the dataset was analyzed using techniques like Stanford NER tagger to tag places, people, and organizations, and the LDA model and TF-IDF to identify topics. Finally, the team produced an interactive visualization using D3.js.

  • Twitter Political Organization (Slides)

  • The team created a timeline of mentions in candidate tweets to donations for the Service Employees International Union (SEIU) on twitter, graph of retweets per day of the candidates and sentiment analysis (Naive Bayes classifier) of the candidates tweets was performed in attempt to see if there was a correlation between donation amount over time to how positive or negatively the candidates tweeted.

    After all the groups presented their work, Jimmy Lin announced ArchivesUnleashed Inc. It is a Delaware non-profit corporation aiming to create knowledge around the scholarly use of web archives. The board of directors of this new organization includes:
    • Ian Milligan (assistant professor, Department of History, University of Waterloo)
    • Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University)
    • Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo)
    • Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec)
    • Nicholas Worby (Government Information & Statistics Librarian, University of Toronto)

    The winning team

    Ian Milligan announced the winning team Counter-Terrorism (Congratulations to Daniel Kerchner and Emily Gade). In addition, the top four teams (Counter-Terrorism, Team Turtle, I Know What You Hid Last Summer, and Mojitos) were selected to present their work during the next day event Saving the Web

    --Mohamed Aturban