2022-07-27: Web Archiving and Digital Libraries (WADL) Workshop 2022 Trip Report

The Web Archiving and Digital Libraries workshop (WADL), in conjunction with JCDL 2022, was held virtually on June 24, 2022. The workshop was organized by Drs. Martin Klein, Mat Kelly, Zhiwu Xie, and Edward A. Fox. The Web Science and Digital Libraries Research Group (@WebSciDL) from Old Dominion University had multiple presentations from different group members. Dr. Klein inaugurated the opening session by welcoming everyone and introducing the schedule for the day. This was followed by introductions from the Chairs and the attendees.

#JCDL2022 the "Web Archiving and Digital Libraries (WADL) 2022" has begun!!

Program Schedule: https://t.co/2j4ukN0M22 @jcdl2022 pic.twitter.com/XIfPkD9944
— kritika garg (@kritika_garg) June 24, 2022

Talks 1

Karolina Holub from the National and University Library in Zagreb (NSK) started WADL with her invited talk on "A History of Web Archiving at the National and University Library in Zagreb". She talked about the Croatian Web Archive (HAW) developed by NSK in collaboration with Zagreb University Computing Centre (SRCE) in 2004. She described the various approaches the NSK employs to gather and preserve the Croatian web. The talk also included the chronology of their working process and how they moved from selective crawls to crawls of the national domain (.hr), thematic, and event crawls. The talk also touched upon their next targets of archiving Twitter, increasing local crawling by collaborating with other libraries, and including the full-text search. She also mentioned they would be integrating SolrWayback, an open-source web application for searching and viewing ARC/WARC files.

I was pleased to talk today about the Croatian Web Archive #HAW_NSK at the #WADL2022 #JCDL2022
Many thanks to all co-chairs for an invite! @mart1nkle1n @jcdl2022 @NSK_Zagreb #webarchiving
More about HAW: https://t.co/Na4vhnHq2s
— Karolina Holub (@KarolHolu) June 24, 2022

Talks 2

Yousef Younes from the GESIS Leibniz Institute for the Social Sciences talked on "Where Are the Datasets? A case study on the German Academic Web Archive". Their case study reflected the research question, “How to find references to research datasets using web archives?” They looked at the various identifiers such as DOI, URL, title, etc., to find datasets. They also investigated the changes in the volume of referenced datasets over time.

The Talks 2 session in the #WADL2022 workshop at @jcdl2022 has started with Yousef Younes's talk on "Where are the Datasets? A case study on the German Academic Web Archive"#JCDL2022 pic.twitter.com/XeEoYWXjNa
— Yasasi (@Yasasi_Abey) June 24, 2022

Himarsha Jayanetta from WS-DL, Old Dominion University, presented our work on "Comparison of Access Patterns of Robots and Humans in Web Archives." This work extends Dr. Yasmin AlNoamany's research. In our work, we analyzed the anonymized web access logs from the Internet Archive (IA) and Arquivo.pt to detect how bots and humans access the web archive holdings. We used various heuristics to identify the robot sessions. We found that 88% of sessions were robots in IA 2012 dataset, 70% of sessions were robots in IA 2019 dataset, and 97% of sessions were robots in Arquivo.pt 2019 dataset. This work is accepted for publication in the 26th International Conference on Theory and Practice of Digital Libraries 2022.

#WADL2022 @HimarshaJ from @WebSciDL is presenting "Comparison of Access Patterns of Robots and Humans in Web Archives" work by her and @kritika_garg @ibnesayeed @phonedude_mln @weiglemc

Robots behave differently than humans and we can detect this. pic.twitter.com/XdBWALcaqw
— Shawn M. Jones, PhD (@shawnmjones) June 24, 2022

Dr. Sawood Alam from the Internet Archive, who is also a WS-DL alumnus, presented on "Wayback Machine Video Archiving Insights." He showed a dashboard demo that provides insights on videos archived by the Internet Archive. The dashboard displays the number of videos archived, the duration of videos, and the longest video archived on a particular day. The dashboard shows the word cloud of top tags associated with the videos, top languages, analysis of duration of videos, and top 100 uploaders.

@ibnesayeed from @internetarchive (also @WebSciDL alum) presented "Wayback Machine Video Archiving Insights" work by him, Bill O'Connor, @MarkGraham
highlighting their Video Archiving Insights dashboard that provides visibility into what has been archived. #WADL2022 pic.twitter.com/PPoIoEO4pM
— Shawn M. Jones, PhD (@shawnmjones) June 24, 2022

I, Kritika Garg from WS-DL, Old Dominion University, presented our work on “Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives.” We demonstrated how replaying an archived web page with carousels, widgets, etc., can generate wasteful requests. For instance, we showed a memento making 1098 requests per minute. We created a minimal reproducible example to show how missing embedded resources make recurring requests to the web archive server. We demonstrated that we could mitigate the unnecessary requests by sending the 404 responses with a Cache-Control header.

#WADL2022 @kritika_garg from @WebSciDL presented "Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives" by her, @HimarshaJ @ibnesayeed @phonedude_mln @weiglemc

Improving web archive access by improving performance for end users. pic.twitter.com/fJULgEuBrs
— Shawn M. Jones, PhD (@shawnmjones) June 24, 2022

Talks 3

Marcel Tschöpe, and Rafael Gieschke from the University of Freiburg presented their work on “Emulation-based long-term Access to Complex Websites.” They talked about utilizing emulation as a service (EaaS) emulated network infrastructure to preserve web servers.

The Talks 3 session #WADL2022 has started with a talk on "Emulation-based long-term Access to Complex Web-sites" by Marcel Tschöpe, Rafael Gieschke and Klaus Rechert(@kurau5u)#JCDL2022 #WADL2022

Related blog: https://t.co/HFyMv5BxF6 @jcdl2022 pic.twitter.com/fIL1NIdN2m
— kritika garg (@kritika_garg) June 24, 2022

Travis Reid from WS-DL, Old Dominion University, presented “Web Archiving as Entertainment”. His work integrates gaming and Web archiving, where he introduces archiving as a live stream allowing users to enjoy it as a spectator sport. He uses the game configuration for an automated gaming live stream. He created demos of this using two video games: Gun Mayhem 2 and NFL Challenge.

#WADL2022 @TReid803 is presenting "Web Archiving as Entertainment" exploring how to integrate gaming with web archiving work with @phonedude_mln @weiglemc

Work funded by @NetPreserve: https://t.co/iVXbxWEvgQ pic.twitter.com/Ebh1lGUSHJ
— Shawn M. Jones, PhD (@shawnmjones) June 24, 2022

WS-DL alumnus Dr. Mat Kelly from Drexel University presented the last talk of the session, titled "First steps in Identifying Academic Migration using Memento and Quasi-Canonicalization". They used web archiving information to associate URI-Rs to recognize faculty and see if they transferred between departments or universities.

Just wrapped up our presentation at the Web Archiving & Digital Libraries 2022 Workshop (#wadl2022) titled, "First steps in Identifying Academic Migration using Memento and Quasi-Canonicalization".

Here are the slides: https://t.co/X7gtqnHcTw #memento #webarchiving pic.twitter.com/jnfAMFkGpl
— Mat Kelly (@machawk1) June 24, 2022

Talks 4

Carrie Pirmann from Bucknell University and Erica Peaslee from Centurion Solutions LLC presented the invited talk on “Building a Community of Web Archivers: The Race to Save Ukrainian Cultural Heritage Online.” The SUCHO project was started in response to the Russia-Ukraine crisis that began on 24 February 2022. They are working with 1300+ cultural heritage professionals worldwide to preserve Ukrainian cultural heritage. The goal of SUCHO is to identify and archive at-risk Ukrainian cultural websites. SUCHO uses tools like Slack, Google Drive, Webrecorder suite, and Browsertrix crawler. They are using Browsertrix crawler as the core crawling system. The more complex websites are either scraped manually or crawled using Webrecorder tools. SUCHO also consists of sub-projects such as archiving memes on Facebook, Twitter, etc. They are also preserving the 3-D tours of cultural museums that were previously not well captured. SUCHO has digitally preserved 40TB+ of websites, databases, and other digitized cultural property. SUCHO was also featured on PBS News Hour. The complexity and scale of this project and its accomplishments are truly inspiring!

.@librariancarrie and @erica_peaslee are presenting the SUCHO project where 1300+ people banded together to preserve Ukrainian cultural heritage online. They use @internetarchive and @webrecorder_io tools to archive the Ukrainian sites. #WADL2022 @jcdl2022 #JCDL2022 pic.twitter.com/jOG5nMlqW3
— kritika garg (@kritika_garg) June 24, 2022

Talks 5

Dr. Sawood Alam presented the talk on “CDX Summary for Web Archival Collection Insights”. He demonstrated a CDX summary tool developed by him to summarize web archive capture index files (CDX). The tools allow you to see the distribution of MIME types and HTTP status codes in a WARC collection. It also shows the overview of path and query segments, hostnames, and temporal spread. He also built a Web Component and an interactive test interface for the tool.

.@ibnesayeed (also @WebSciDL alum) and MarkGraham from @internetarchive are now presenting "CDX Summary for Web Archival Collection Insights" at #WADL2022.

CDX Summary is a tool to summarize web archive capture index (CDX) files. Tool: https://t.co/hs34VNCwCj

@jcdl2022 @oducs pic.twitter.com/gd1LiNHJAe
— kritika garg (@kritika_garg) June 24, 2022

Grant Atkins from MITRE, a WS-DL alumnus, presented a talk on "Russia-Ukraine News on the Dark Web". He showed how many news sites were accessible through the dark web after the Russian invasion of Ukraine. He compared the surface and the dark sites (“.onion” domains that are reachable via the Tor network). He explained how many news sites usually try to mirror their content verbatim. Still, there are differences between the surface and onion sites, such as custom CSS removal of JS and different HTTP headers. He showed that the dark sites are not well archived. He stated that since the surface and dark sites are not mirrors, this content should be archived to preserve the differences.

.@grantcatkins (@WebSciDL alum) is presenting a talk on "Russia-Ukraine News on the Dark Web" at #WADL2022.@japharl @justinfbrunelle @jcdl2022 #JCDL2022 pic.twitter.com/KKhot4hkmG
— kritika garg (@kritika_garg) June 24, 2022

Emily Escamilla from WS-DL, Old Dominion University, presented her work on “Archiving Source Code in Scholarly Content: One in Five Articles References GitHub”. She talked about the increasing prevalence of references to scholarly source code on Git Hosting Platforms (GHPs). She emphasized the need to preserve the ephemera along with the software product. She found that one in five articles in arXiv reference GitHub, which shows the importance of archiving GHPs. She has described this work in detail in a WS-DL blog. Emily's work is accepted for publication in the 26th International Conference on Theory and Practice of Digital Libraries 2022.

Scholars are increasingly citing repos in Git Hosting Platforms, but they aren't permanent. And preserving the code as a stand alone product isn't enough.

We need to archive the issues/wikis/pull requests that aid in reproducibility https://t.co/fqMRFnZjLo
— Emily Escamilla (@EmilyEscamilla_) June 24, 2022

Talks 6

Helge Holzmann & Nick Ruest from the Archives Unleashed presented "Arch-It!". They demonstrated ARCH (Archives Research Compute Hub), the latest product integrated closely with Archive-It. ARCH is built on top of the Archives Unleashed Toolkit (AUT). It provides a number of datasets for analysis to the Archive-It subscriber archivists. It provides information regarding the collections such as domain frequencies, WAT files, the plain text of web pages, domain/image/longitudinal/web graphs, etc. They also provided an example Jupyter Notebook showing how to use ARCH.

It's been phenomenal to watch the collaboration between @helgeho & @ruebot as they've developed ARCH (@unleasharchives + @archiveitorg integration solution to analyze #webarchives)!

Happening now @ #WADL2022 a @ruebot DEMO which tours through features + functionalities of ARCH pic.twitter.com/ivYj6Xi7OR
— The Archives Unleashed Project (@unleasharchives) June 24, 2022

Ilya Kreymer from Webrecorder presented the Web Archive Collection Zipped (WACZ) format that allows web archives to be shared and distributed. Ilya shared the infrastructure and new capabilities of WACZ. The key points of the talk were that high-fidelity web archives should be easy to use, easy to host, and low maintenance.

.@IlyaKreymer is presenting "WACZ" work by him, @edsu, and Cade Diehm at #WADL2022.

Web Archive Collection Zipped (WACZ) format that allows web archives to be shared and distributed.

py-wacz: https://t.co/1UJaXF1Uaw #JCDL2022 @jcdl2022 pic.twitter.com/orVlVwt1jz
— kritika garg (@kritika_garg) June 24, 2022

Mark Phillips presented the last talk of WADL, titled "Moving the End of Term Web Archive to the Cloud to Encourage Research Use and Reuse." He described the End of Term (EOT) datasets and how it was inspired by CommonCrawl. He explained the decisions and steps to move the data to Amazon S3 and replay the archived content from the cloud by configuring the pywb. Their goals are to provide greater access to EOT that would encourage reuse and research.

#WADL2022 @vphill is presenting "Moving the End of Term Web Archive to the Cloud to Encourage Research Use and Reuse" work with @ibnesayeed

"We are also focusing on computational consumption of the collection rather than just replay."

For more info: https://t.co/H6PpNmtIba pic.twitter.com/rGchhXhZPf
— Shawn M. Jones, PhD (@shawnmjones) June 24, 2022

Closing Session

The final event of WADL was a closing discussion to wrap up the workshop. All attendees participated and gave comments on the work presented, possible publications, collaboration opportunities, and recommendations for future workshops. It was a great experience to participate in the WADL 2022. It was exciting to put forth my research work as a presenter and to witness, as an attendee, the great research going on in the web archiving and digital libraries field.

#WADL2022 has come to a close. Thanks to all speakers, attendees, note-takers, tweeters (@WebSciDL), and our invited presenters @KarolHolu, @erica_peaslee, @librariancarrie @NSK_Zagreb @sucho_org #JCDL2022
— Martin Klein (@mart1nkle1n) June 24, 2022

#JCDL2022 came to an end with the closing of WADL. Thank you all for making @jcdl2022 a success. You all have a save trip home and looking forward to seeing you soon.

The organizing team https://t.co/x6oWlAoGEw
— JCDL2022 (@jcdl2022) June 25, 2022

-- Kritika Garg (@kritika_garg)

Search This Blog

Web Science and Digital Libraries Research Group