2019-06-20: Web Archiving and Digital Libraries Workshop 2019 Trip Report

A subset of JCDL 2019 attendees assembled together on June 6, 2019, at the Illini Union for the Web Archiving and Digital Libraries workshop (WADL 2019). Like previous years, this year's workshop too was organized by Dr. Martin Klein, Dr. Zhiwu Xie, and Dr. Edward A. Fox. Martin inaugurated the session by welcoming everyone and introducing the schedule for the day. He observed that WADL 2019 had an equal representation from both males and females, which was not only the case with attendees, but also presenters. Web Science and Digital Libraries Research Group (WS-DL) from the Old Dominion University was represented there by Dr. Michele C. Weigle, Alexander C. Nwala, and Sawood Alam (me) with two accepted talks.

Cathy Marshall from the Texas A&M University presented her keynote talk entitled, "In The Reading Room: what we can learn about web archiving from historical research". She told many fascinating stories and the process she went through to collect bits and pieces of those stories. Her talk shed light on many problems similar to what we see in web archiving. Her talk reminded me of her presentation at the IIPC General Assembly 2015 entitled, "Should we archive Facebook? Why the users are wrong and the NSA is right".

Corinna Breitinger from the University of Konstanz (but now moved to the University of Wuppertal) presented her team's work entitled, "Securing the integrity of time series data in open science projects using blockchain-based trusted timestamping". She discussed a service called OriginStamp that allows people to create a tamper-proof record of ownership of some digital data at the current time by creating a record in a blockchain. She mentioned Blockchain_Pi project that allows connecting a Raspberry Pi to blockchain for timestamping various sensor data. A remarkable achievement of their project was being cited by a German Supreme Court ruling on a Dashcam recording that was configured to trigger a timestamping call on a short clip when something unusual happens on the road.

I, Sawood Alam, presented "Impact of HTTP Cookie Violations in Web Archives". This was a summary of two of our blog posts entitled "Cookies Are Why Your Archived Twitter Page Is Not in English" and "Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages" in which we performed detailed investigation of two HTTP cookie related issues in web archives. We found that long-lasting cookies in web archives have undesired consequences in both crawling and replay.

Ed Fox from Virginia Tech presented his team's work entitled, "Users, User Roles, and Topics in School Shooting Collections of Tweets". They attempted to identify patterns in user engagement on Twitter regarding school shootings. They also created a tool called TwiRole (source code) that classifies a Twitter handle as "Male", "Female", or a "Brand" using multiple techniques.

Ian Milligan from the University of Waterloo presented his talk entitled, "From Archiving to Analysis: Current Trends and Future Developments in Web Archive Use". He emphasized that the historians of the future writing history of post-1996 will need to understand the Web. Web archives will play a big role in writing the history of today. It is important that there are tools beyond Wayback Machine that they can use to interact with web archives and understand their holdings. He mentioned the Archives Unleashed Cloud as a step in that direction.

Jasmine Mulliken from the Stanford University Press (SUP) presented her talk entitled, "Web Archive as Scholarly Communication". She described various SUP projects and related stories. She spent a fair amount of time describing the use of Webrecorder at SUP in projects like Enchanting the Desert. She also described that SUP is in peril and mentioned Save SUP site that documents the timeline of recent events threatening the existence of SUP. While talking about this, she played a clip from the finale of the Game of Thrones in which the dragon burns the Iron Throne.

Brenda Reyes Ayala from the University of Alberta presented her talk entitled, "Using Image Similarity Metrics to Measure Visual Quality in Web Archives" (slides). Automated quality assurance of archival collections is a topic of interest for many IIPC members. Brenda shared initial findings of her team using image similarities of captures with and without archival banners. She concluded that their result showed significant success in identifying poor and high quality captures, but there is a lot more that needs to be done to improve the quality assurance automation.

Sergej Wildemann from the L3S Research Center presented his talk entitled, "A Collaborative Named Entity Focused URI Collection to Explore Web Archives". He started his talk by describing that the temporal aspect of named entities is often neglected when indexing the live web. Temporal changes associated with an entity become more important when exploring an archival collection related to the entity. He mentioned Internet Archive's beta version of a new prototype of Wayback Machine released in 2016 that provided text search indexed based on the anchor text pointing to sites. Towards the end of his talk he showcased his tool called Tempurion that allows archived named entity search with temporal dimension attached for filtering search results based on date ranges.

I, Sawood Alam, presented my second talk (and the last talk of the day) entitled, "MementoMap: An Archive Profile Dissemination Framework". This talk was primarily based on our JCDL submission that was nominated for the best paper award, but in the WADL presentation we focused more on technical details, use cases, and possible extensions, instead of experimental results. We also talked about the Universal Key Value Store (UKVS) format with some examples.

Once all the formal presentation were over, we all started to discuss about post-workshop plans. The first matter we discussed was about making proceedings available online or as a special issue of a journal. In previous years (except the last year) WADL proceedings were published in the IEEE-TCDL Bulletin, which is discontinued now. Potential fallback options include: 1) compilation of all submissions with an introduction and publishing it as a single document to arXiv, but citing individual work would be an issue, 2) publishing on OSF Preprint, and 3) utilizing a GitHub Pages, with the added advantage of providing supplementary materiel such as slides. To enable more effective communication, a proposal was made to create a mailing list (e.g., using Google Groups) for the WADL community. It was proposed that posters should not be included in the call for papers, because the number of submissions are usually finite enough to give a presentation slot to everyone. Fun fact, only Corinna brought a poster this time. We discussed the possibility of more than one WADL events per year which may or may not be associated with a conference. Since the next JCDL event would be in China, people had some interest in having an independent WADL workshop in the US. Finally, we discussed the possibility of adding a day ahead of JCDL for a hackathon and a day after for the workshop where hackathon results can be discussed in addition to usual talks.

It was indeed a fun week of #JCDL2019 and #WADL2019 where we got to meet many familiar and new faces and explored the spacious campus of the University of Illinois. You may also want to read our detailed trip report of the JCDL 2019. We would like to thanks organizers and sponsors of both JCDL and WADL for making it happen. We would like to extend special thanks to Dr. Stephen Downie, without whom this event would not have been as organized and fun as it was. We would also like to thank NSF, AMF, and ODU for funding our travel expenses. Last but not the least, I would personally thank the "WADL DongleNet" which made it possible for me to connect my laptop with the projector twice.

Sawood Alam

