2022-07-27: Web Archiving and Digital Libraries (WADL) Workshop 2022 Trip Report

The Web Archiving and Digital Libraries workshop (WADL), in conjunction with JCDL 2022, was held virtually on June 24, 2022. The workshop was organized by Drs. Martin Klein, Mat Kelly, Zhiwu Xie, and Edward A. Fox. The Web Science and Digital Libraries Research Group (@WebSciDL) from Old Dominion University had multiple presentations from different group members. Dr. Klein inaugurated the opening session by welcoming everyone and introducing the schedule for the day. This was followed by introductions from the Chairs and the attendees.


Talks 1

Karolina Holub from the National and University Library in Zagreb (NSK) started WADL with her invited talk on "A History of Web Archiving at the National and University Library in Zagreb".  She talked about the Croatian Web Archive (HAW) developed by NSK in collaboration with Zagreb University Computing Centre (SRCE) in 2004. She described the various approaches the NSK employs to gather and preserve the Croatian web. The talk also included the chronology of their working process and how they moved from selective crawls to crawls of the national domain (.hr), thematic, and event crawls. The talk also touched upon their next targets of archiving Twitter, increasing local crawling by collaborating with other libraries, and including the full-text search. She also mentioned they would be integrating SolrWayback, an open-source web application for searching and viewing ARC/WARC files.


Talks 2 

Yousef Younes from the GESIS Leibniz Institute for the Social Sciences talked on "Where Are the Datasets? A case study on the German Academic Web Archive". Their case study reflected the research question, “How to find references to research datasets using web archives?” They looked at the various identifiers such as DOI, URL, title, etc., to find datasets. They also investigated the changes in the volume of referenced datasets over time. 

Himarsha Jayanetta from WS-DL, Old Dominion University, presented our work on  "Comparison of Access Patterns of Robots and Humans in Web Archives." This work extends Dr. Yasmin AlNoamany's research. In our work, we analyzed the anonymized web access logs from the Internet Archive (IA) and Arquivo.pt to detect how bots and humans access the web archive holdings. We used various heuristics to identify the robot sessions. We found that 88% of sessions were robots in IA 2012 dataset, 70% of sessions were robots in IA 2019 dataset, and 97% of sessions were robots in Arquivo.pt 2019 dataset. This work is accepted for publication in the 26th International Conference on Theory and Practice of Digital Libraries 2022.

Dr. Sawood Alam from the Internet Archive, who is also a WS-DL alumnus, presented on "Wayback Machine Video Archiving Insights." He showed a dashboard demo that provides insights on videos archived by the Internet Archive. The dashboard displays the number of videos archived, the duration of videos, and the longest video archived on a particular day. The dashboard shows the word cloud of top tags associated with the videos, top languages, analysis of duration of videos, and top 100 uploaders. 

I, Kritika Garg from WS-DL, Old Dominion University, presented our work on “Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives.” We demonstrated how replaying an archived web page with carousels, widgets, etc., can generate wasteful requests. For instance, we showed a memento making 1098 requests per minute.  We created a minimal reproducible example to show how missing embedded resources make recurring requests to the web archive server. We demonstrated that we could mitigate the unnecessary requests by sending the 404 responses with a Cache-Control header.  

Talks 3

Marcel Tschöpe, and Rafael Gieschke from the University of Freiburg presented their work on “Emulation-based long-term Access to Complex Websites.” They talked about utilizing emulation as a service (EaaS) emulated network infrastructure to preserve web servers.

Travis Reid from WS-DL, Old Dominion University, presented “Web Archiving as Entertainment”. His work integrates gaming and Web archiving, where he introduces archiving as a live stream allowing users to enjoy it as a spectator sport. He uses the game configuration for an automated gaming live stream. He created demos of this using two video games: Gun Mayhem 2 and NFL Challenge.

WS-DL alumnus Dr. Mat Kelly from Drexel University presented the last talk of the session, titled "First steps in Identifying Academic Migration using Memento and Quasi-Canonicalization". They used web archiving information to associate URI-Rs to recognize faculty and see if they transferred between departments or universities.

Talks 4

Carrie Pirmann from Bucknell University and Erica Peaslee from Centurion Solutions LLC presented the invited talk on “Building a Community of Web Archivers: The Race to Save Ukrainian Cultural Heritage Online.” The SUCHO project was started in response to the Russia-Ukraine crisis that began on 24 February 2022. They are working with 1300+ cultural heritage professionals worldwide to preserve Ukrainian cultural heritage. The goal of SUCHO is to identify and archive at-risk Ukrainian cultural websites. SUCHO uses tools like Slack, Google Drive, Webrecorder suite, and Browsertrix crawler. They are using Browsertrix crawler as the core crawling system. The more complex websites are either scraped manually or crawled using Webrecorder tools. SUCHO also consists of sub-projects such as archiving memes on Facebook, Twitter, etc. They are also preserving the 3-D tours of cultural museums that were previously not well captured. SUCHO has digitally preserved 40TB+ of websites, databases, and other digitized cultural property. SUCHO was also featured on PBS News Hour. The complexity and scale of this project and its accomplishments are truly inspiring!

Talks 5

Dr. Sawood Alam presented the talk on “CDX Summary for Web Archival Collection Insights”. He demonstrated a CDX summary tool developed by him to summarize web archive capture index files (CDX). The tools allow you to see the distribution of MIME types and HTTP status codes in a WARC collection. It also shows the overview of path and query segments, hostnames, and temporal spread. He also built a Web Component and an interactive test interface for the tool. 

Grant Atkins from MITRE, a WS-DL alumnus, presented a talk on "Russia-Ukraine News on the Dark Web". He showed how many news sites were accessible through the dark web after the Russian invasion of Ukraine. He compared the surface and the dark sites (“.onion” domains that are reachable via the Tor network). He explained how many news sites usually try to mirror their content verbatim. Still, there are differences between the surface and onion sites, such as custom CSS removal of JS and different HTTP headers. He showed that the dark sites are not well archived. He stated that since the surface and dark sites are not mirrors, this content should be archived to preserve the differences.

Emily Escamilla from WS-DL, Old Dominion University, presented her work on “Archiving Source Code in Scholarly Content: One in Five Articles References GitHub”. She talked about the increasing prevalence of references to scholarly source code on Git Hosting Platforms (GHPs). She emphasized the need to preserve the ephemera along with the software product. She found that one in five articles in arXiv reference GitHub, which shows the importance of archiving GHPs. She has described this work in detail in a WS-DL blog. Emily's work is accepted for publication in the 26th International Conference on Theory and Practice of Digital Libraries 2022.

Talks 6

Helge Holzmann & Nick Ruest from the Archives Unleashed presented "Arch-It!". They demonstrated ARCH (Archives Research Compute Hub), the latest product integrated closely with Archive-It. ARCH is built on top of the Archives Unleashed Toolkit (AUT). It provides a number of datasets for analysis to the Archive-It subscriber archivists. It provides information regarding the collections such as domain frequencies, WAT files, the plain text of web pages, domain/image/longitudinal/web graphs, etc. They also provided an example Jupyter Notebook showing how to use ARCH.

Ilya Kreymer from Webrecorder presented the Web Archive Collection Zipped (WACZ) format that allows web archives to be shared and distributed. Ilya shared the infrastructure and new capabilities of WACZ. The key points of the talk were that high-fidelity web archives should be easy to use, easy to host, and low maintenance.

Mark Phillips presented the last talk of WADL, titled "Moving the End of Term Web Archive to the Cloud to Encourage Research Use and Reuse." He described the End of Term (EOT) datasets and how it was inspired by CommonCrawl. He explained the decisions and steps to move the data to Amazon S3 and replay the archived content from the cloud by configuring the pywb. Their goals are to provide greater access to EOT that would encourage reuse and research.

Closing Session

The final event of WADL was a closing discussion to wrap up the workshop. All attendees participated and gave comments on the work presented, possible publications, collaboration opportunities, and recommendations for future workshops. It was a great experience to participate in the WADL 2022. It was exciting to put forth my research work as a presenter and to witness, as an attendee, the great research going on in the web archiving and digital libraries field. 


-- Kritika Garg (@kritika_garg)