2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report

They: Hey Sawood, nice to see you again.
Me: Hi, I am glad to see you too.
They: Did you attend all hackathons, I mean datathons?
Me: Yes, I attended all of the four Archives Unleashed events so far.
They: How did you like it?
Me: Well, there is a reason why I attended all of them, despite being a seemingly busy PhD researcher.
They: So, what is your research about?
Me: I am trying to profile various web archives to build a high-level understanding of their holdings, primarily, for the sake of efficiently routing Memento aggregation requests, but there can be many more use cases of such profiles... [and the conversation continues...]

On day zero of Archives Unleashed 4.0 in London, conversations among many familiar and unfamiliar faces started with travel and lodging related questions, but soon emerged into mass storage challenges, scaling issues, quality and coverage of web archives, long-term maintenance of archival tools, documentation and discovery of libraries, and exchange of research ideas etc. Ian and Matt were looking fresh and welcoming in the reception of #HackArchives as always. This was all familiar, this is how other previous AU events started too, and yielded great networking among the web archiving community members.

Previously, the Web Science and Digital Libraries Research Group (WSDL) has been well-represented at AU events, but visa issues and competing events meant that only Mat and I were able to attend.

The next day, on Monday, June 12, 2017, the main event started at the British Library in the morning with usual registration process, welcome kit, and AU-branded, 3D-printed looking, strange red rubber balls (that no one had any idea what to do with it). Dr. Matthew Weber and Dr. Ian Milligan began with the opening remarks, described the scope of the event, and available dataset and other resources.

Next was the current efforts session for which Ian, Jefferson, Tom, and Andy were supposed to talk about Warcbase, Internet Archive APIs, National Archives Datasets, and UK Web Archive respectively. Since Jefferson could not make it to the event on time, Ian had to morph into Jefferson for the corresponding talk about IA APIs. All of these talks were very insightful and had a lot to learn from.

Possibly the most interesting aspect of AU events is the phenomenon of the group formation. People and idea stickers flock around the room and naturally cluster in smaller groups with similar interest to come up with a more precise research question and datasets to use. This time, they formed a total of eight different groups with diverse set of research questions and scopes.

After the lunch break teams settled on their tables and started worrying about task refinement, computing resources, data acquisition, and action plan. One of the most difficult issues at AU events is the problem of data set acquisition. Advertised datasets are often not in the easy-to-get condition. Additionally, these datasets are often too large to be copied over to the respective computing instances in a feasible amount of time. Some preprocessing and sampling can be helpful. Additionally, complex (and often unknown) authentication barriers should be removed from the data acquisition process. On one hand it is part of the learning process to acquire and understand the data and learn about other tools to create derivative data, but on the other hand I have consistently noticed that this process is difficult and limits the opportunity for actual data analysis.

Another very useful aspect of AU events is the opportunity to allow people to share their current projects and efforts in the field of web archiving using short lightning talks. In the past we have taken advantage of it to introduce various WSDL efforts such as MemGator, IPWB, CarbonDate, WhatDidItLookLike, and ICanHazMemento. Following the tradition, this time also there were a handful of lightning talks lined up for both the days.
After the first round of five lightning talks teams went back to their hacking task, mostly trying to acquire datasets, understand them, and adjust their ambitious plans to something more feasible withing the short time limit. Then everyone left for the dinner while discussing ideas and scope of their work with their team members. The dinner was really good, but it did not stop people from exchanging world-shaking ideas.

The next morning many teams were talking about how much data they processed overnight and what to do next. The next couple of hours were very critical for every team to come up with something that provides some answers to their proposed research questions. After another session of lightning talks, teams continued to work on their projects, but now they started thinking about reporting aspect and visualizations of their findings as more and more results are apparent. The efforts continued during and after the short lunch break. One could see people multi-tasking to get everything done before the final presentations that was only a coffee break away, but some people still had courage to put everything aside for a while and go for a walk outside. Not every team was working on data analysis, but the overall experience was still generalizable. Finally, the time has arrived for brief project presentations and share findings of the "Samudra Manthan" in front of three esteemed judges from the British Library.
  • Team Portuguese Archive presented their outcome of archived image classification using TensorFlow. As a testbed they used maps to distinguish contemporary maps from historic maps.
  • Team Intersect (of which I was a member) presented the archival coverage of Occupy Wall Street movement in various collections and social media along with overlap among various datasets. They found less than 1% of overlap among different datasets which means the more collectors the better coverage. They also found that two-third of the outlinks from these collections were not archived.
  • The Olympians presented gender distribution in Olympic committees and found strong male bias.
  • Team Shipman Report analyzed text in Shipman Report and found it deadly and dark.
  • Team Links analyzed WARC files to find the trend in distribution of relative/absolute paths and absolute URLs in anchor element along with HTML element distribution around anchors over the time.
  • Team Robots analyzed different types of robots.txt files in web archives with the intent of finding the impact on archival captures if the robots.txt was honored. They found that the impact will not be huge.
  • Team Curated built a prototype of an upcoming Rhizome tool for better curation and annotation. They illustrated some wire frame prototypes of various components and workflow.
  • Team WARCs peeked inside WARC files for traces of politics and elections in the US.
While judges were deciding winners, Ian wrapped up the event by looking back at the past two days and briefly mentioning the highlights of the event. He gave vote of thanks for all individuals and sponsoring organizations who supported the event in various ways including data and computing resources, venue and logistics, and travel grants. Judges' verdict was in; Team Links, Team Robots, and Team Intersect were found guilty of being the best. Everyone was a winner, but some of them performed more efficiently than others within a very short span of time. I am sure every team had much more to show than what they could in the short five minutes presentation.

Now, it was the time to disperse around and continue exchanging ideas over drinks and dinner while getting ready for the rest of the Web Archiving Week events.

They: So, Sawood, are you planning to continue attending all future AU events?  
Me: I hope so! ;-)

