2016-03-07: Archives Unleashed Web Archive Hackathon Trip Report (#hackarchives)

The Thomas Fisher Rare Book Library (University of Toronto)
Between March 3 - March 5, 2016, Librarians, Archivists, Historians, Computer Scientists, etc., came together for the Archives Unleashed Web Archive Hackathon at the University of Toronto Robarts Library, Toronto, Ontario Canada. This event gave researchers the opportunity to collaboratively develop open-source tools for web archives. The event was organized by Ian Milligan, (assistant professor of Canadian and digital history in the Department of History at the University of Waterloo), Nathalie Casemajor (assistant professor in communication studies in the Department of Social Sciences at the University of Québec in Outaouais (Canada)), Jimmy Lin (the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo), Matthew Weber (Assistant Professor in the School of Communication and Information at Rutgers University), and Nicholas Worby (the Government Information & Statistics Librarian at the University of Toronto’s Robarts Library).

Additionally, the event was made possible due to the support of the Social Sciences and Humanities Research Council of Canada, the National Science Foundation, the University of Waterloo, the University of Toronto, Rutgers University, the University of Québec in Outaouais, the Internet Archive, Library and Archives Canada, and Compute Canada. Sawood Alam, Mat Kelly and myself, joined researchers from Europe and North America to exchange ideas in efforts to unleash our web archives. The event was split across three days.


Ian Milligan kicked off the presentations by presenting the agenda. Following this, he presented his current research effort - 

HistoryCrawling with Warcbase (Ian Milligan, Jimmy Lin)

The presenters introduced Warcbase as a platform for exploring the past. Warcbase  is an open-source tool used to manage web archives built on Hadoop an HbaseWarcbase was introduced through two case studies and datasets, namely, exploring Canadian Political Parties and Political Interest Groups (2005 - 2015), and Geocities datasets.

Put Hacks to Work: Archives in Research (Matthew Weber)

Following Ian Milligan's presentation, Matthew Weber emphasized some important ideas to guide the development of tools for web archives, such as considering the audience.

Archive Research Services Workshop (Jefferson Bailey, Vinay Goel)

Following Matthew Weber's presentation, Jefferson Bailey and Vinay Goel presented a comprehensive introduction workshop for researchers, developers, and general users. The workshop addressed data mining and computational tools and methods for working with web archives.

Embedded Metadata as Mobile Micro Archives (Nathalie Casemajor)

Following Jefferson Bailey and Vinay Goel's presentation, Nathalie Casemajor presented her research effort for tracking the evolution of images shared on the web. She talked about how embedded metadata in images helped track dissemination of images shared on the web.

Revitalization of the Web Archiving Program at LAC (Tom Smyth)

Following Nathalie Casemajor's presentation, Tom Smyth of the Library and Archives Canada presented their archiving activities such as the domain crawls of Federal sites, curation of thematic research collections, and preservation archiving of resources at risk. He also talked about their recent collections such as Federal Election 2015, First World War Commemoration, and the Truth and Reconciliation collections.

After the first five short presentations, Jimmy Lin gave presented a technical tutorial of Warcbase. After which Helge Holzmann, presented ArchiveSpark: framework built to make accessing Web Archives easier for researchers, which makes for easy data extraction and derivation.

After a short break, there were five more presentations targeting Web Archiving and Textual Analysis Tools:

WordFish (Federico Nanni)

Federico Nanni presented WordFish: a R computer program used to extract political positions from text documents. Wordfish is a scaling technique and does not need any anchoring documents to perform the analysis but relies instead on a statistical model of word frequencies.

MemGator (Sawood Alam)

Following Federico Nanni's presentation Sawood Alam presented a tool he developed called MemGator: a Memento Aggregator CLI and Server written in Go. Memento is a framework that adds the time dimension to the web. Additionally, a timestamped copy of the presentation of a resource is also called a Memento. A list/collection of such mementos is called a TimeMap. MemGator can generate TimeMap of a given URI or provide the closest Memento to a given time.

Topic Words in Context (Jonathan Armoza)

Following Sawood Alam's presentation,  Jonathan Armoza presented a tool he developed - TWIC (Topics Words in Context) by demonstrating LDA topic modeling of Emily Dickenson's poetry. TWIC provides a hierarchical visualization of LDA topic models generated by the MALLET topic modeler.
Following Jonathan Armoza's presentation, Nick Ruest presented Twarc: a Python command line tool/Python library tool for archiving Tweet JSON data. Twarc runs in three modes: search, filter stream and hydrate.
Following Nick Ruest's presentation, I presented Carbon date: a tool originally developed by Hany SalahEldeen, which I current maintain. Carbon date is a tool for estimating the creation date of a website. Carbon date polls multiple sources for datetime evidence. It returns a Json response which contains the estimated creation date of the website.
After the five short presentation about Web Archiving and Textual Analysis Tools, all participants engaged in a brain storming session in which ideas where discussed. And clusters of researchers with common interests where iteratively developed. The brainstorming session led to the formation of seven groups, namely:
  1. I know words and images
  2. Searching, mining, everything
  3. Interplanetary WayBack
  4. Surveillance of First Nations
  5. Nuage
  6. Graph‐X‐Graphics
  7. Tracking Discourse in Social Media

Following the brainstorming and group formation activity, all participants were received at the Bedford Academy for a reception that went on through the late evening. 


The second day of the Archives Unleashed Web Archive Hackathon began with breakfast, after which the groups formed on Day 1 met for about three hours to begin working on the ideas discussed the previous day. At noon, lunch was provided as more presentations took place:
Evan Light began the series of presentations, by talking about a box he created called the Snowden Archive-in-a-Box : The box features a stand-alone wifi network and web server that allows researchers to utilize the files leaked (subsequently published by the media) by Edward Snowden. The box which serves as a portable archive protects users from mass surveillance.

Mediacat (Alejandro Paz and Kim Pham)

Following Evan Light's presentation, Alejandro Paz and Kim Pham presented Mediacat: an open-source  web crawler and archive application suite which enables ethnographic research to understand how digital news is disseminated and used across the web.

Data Mining the Canadian Media Public Sphere (Sylvain Rocheleau)

Following Alejandro Paz and Kim Pham's presentation, Sylvain Rocheleau talked about his research efforts to provide near real time Data Mining of the Canadian news media. His research involves the mass crawl of about 700 Canadian news websites at 15-minute intervals, and Data Mining processes which includes Named Entity Recognition.

Tweet Analysis with Warcbase (Jimmy Lin)

Following Sylvain Rocheleau's presentation, Jimmy Lin gave another tutorial in which he showed how to extract information from Tweets from the Warcbase platform.

A five hour Hackathon session continued. The Hackathon was briefly suspended for a visit to the Thomas Fisher Rare Books Library.
After the visit to the Thomas Fisher Rare Books Library, the hackathon session continued until the evening, after which all participants went for Dinner at the University of Toronto Faculty Club. 


The third and final day of the Archives Unleashed Web Archive Hackathon began in a similar fashion as the second: first breakfast, second a three hour hackathon session, third presentations over lunch:

Malach Collection (Petra Galuscakova)

Petra Galuscakova started the series of presentations by talking about the Czech Malach Cross-lingual Speech Retrieval Test Collection: a collection of multimedia about the testimonies of survivors and other witnesses of the Holocaust.

Waku (Kyle Parry)
Digital Arts and Humanities Initiatives at UH Mānoa (or how to do interesting things with few resources) (Richard Rath)

After the presentations, the hackathon session continued until 4:30 pm EST, thereafter, the group presentations began:


I know words and images (Kyle Parry, Niel Chah, Emily Maemura, and Kim Pham)

Inspired by John Oliver's #MakeDonaldDrumpfAgain, this team sought to research memes by processing words and images. They investigated what people say, how they use and modify the text and images of others, and how computers read text and classify images, etc.

Searching, mining, everything (Jaspreet SinghHelge Holzmann, and Vinay Goel)

Interplanetary WayBack (Sawood Alam and Mat Kelly)

"Who will archive the archives?"

To answer this question Sawood Alam and Mat Kelly presented the archiving and replay system called Interplanetary Wayback (ipwb). In a nutshell, during the indexing process ipwb consumes WARC files one record a time, splits the record into headers and payload, pushes the two pieces into the IPFS (a peer‐to‐peer file system) network for persistent storage, and stores the references (digests) into to file format called CDXJ along with some other lookup keys and metadata. For replay it it finds the records in the index file and builds the response by assembling headers and payload retrieved from the IPFS network and performing necessary rewrites. The major benefits of this system include deduplication, redundancy, and shared open access.

Surveillance of First Nations (Evan Light, Katherine Cook, Todd Suomela, and Richard Rath)

Nuage (Petra Galuscakova, Neha Gupta, Rosa Iris R. Rovira, Nathalie CasemajorSylvain Rocheleau, Ryan Deschamps, and Ruqin Ren)

Graph‐X‐Graphics (Jeremy Wiebe, Eric Oosenbrug, and Shane Martin)

Tracking Discourse in Social Media (Tom Smyth, Allison Hegel, Alexander Nwala, Patrick EganNick RuestYu Xu, Kelsey UtneJonathan Armoza, and Federico Nanni)

This team processed ~11.2 million tweets and ~50 million reddit comments which referenced the Charlie Hebdo and Bataclan attacks, in an effort to track the evolution of social media commentary about the attacks. The team sought to measure the attention span, information/misinformation flow, as well as the co-occurence network of terms in order to understand the dynamics of commentary about these events.

The votes were tallied and Nuage team got the most votes, and were declared winners. The event concluded after some closing remarks.

-- Nwala (@acnwala)