2023-08-10: Web Archiving and Digital Libraries (WADL) Workshop 2023 Trip Report

The Web Archiving and Digital Libraries Workshop (WADL), held in conjunction with JCDL 2023, took place onsite and online on June 30th, 2022. The workshop was organized by Dr. Mat Kelly, Dr. Brenda Reyes Ayala, Dr. Zhiwu Xie, and Dr. Edward A. Fox. Notably, the Web Science and Digital Libraries Research Group (@WebSciDL) from Old Dominion University contributed significantly to the event with multiple presentations.

The opening session commenced with Dr. Mat Kelly welcoming all participants and providing an overview of the day's schedule. Subsequently, the Chairs and attendees were introduced, setting the tone for an interactive and productive workshop.

Talks 1 

Dr. Sawood Alam from the Internet Archive, also a WS-DL alumnus, started WADL with his presentation on "Synthesizing Daily Top News Summaries From Archived International TV Channels Using LLMs." He discussed how the Internet Archive uses computational means and LLMs to summarize daily TV news. He presented the new experimental service of IA that offers summaries of the top news stories extracted from archived TV News Channels globally. To achieve this, audio content from these archives undergoes transcription and translation using Google Cloud services. Subsequently, the stories are identified and summarized using various AI language models (LLMs), like Vicuna and GPT-3.5. While responding to the questions, He also touched upon summary accuracy or hallucination. The evaluation of summary accuracy and hallucination is essential. The earlier architecture led to more hallucinations due to the large context and "refine" prompt engineering. The new architecture uses smaller contexts and verifies with source videos. Clustering and LLMs are combined, supporting multiple languages, but some low-resourced languages face challenges. Summarizing large and evolving data relies on vector databases, though addressing temporal evolution may require further engineering work. The code is available on GitHub.

Satvik Chekuri, a Ph.D. student from the DLRL group at Virginia Tech (VT), talked about "Identifying and Analyzing Twitter Data Related to Tunisia."  He started by giving a brief on Tunisia's history. Tunisia's post-2011 revolution inspired democratic reforms, but in 2021, President Saied reversed progress, suppressing freedoms and media outlets. In response to the oppression of speech and media, Tunisians turned to platforms like Twitter for truthfulness. CS4624 Team 21 analyzed Twitter data on democracy, political reforms, and public sentiment in Tunisia since 2020. Collaborating with Virginia Tech's University Libraries, the team analyzed and visualized the data, aiming to revitalize research at VT and inspire new publications about Tunisia. Their results showed that tweets with negative sentiments were always higher in number than positive ones. They also reported frequent spikes in tweets around major elections and protests. The future challenge includes addressing the English language biases by including tweets from Arabic and French languages.

Dr. Sawood Alam presented his second talk on “IPARO: InterPlanetary Archival Record Object for Decentralized Web Archiving and Replay.” He started by discussing the Anatomy of the web archiving system. He talked about his previous tool, InterPlanetary Wayback (IPWB), which promotes permanence in web archives by distributing the contents of WARC files across the IPFS network. He discussed self-contained IPARO, using immutable linked lists for index-free discovery and resilience, and CID (content-id) grouping. The InterPlanetary Name System (IPNS), which works in a peer-to-peer fashion using a distributed hash table, is used for the resolution of the CID of URIs to enable immutability.

Breakout Session

Dr. Mat Kelly moderated the breakout session. During this session, participants were divided into two groups to delve into funding aspects of web archiving and social media archiving. The topics for discussion were suggested by the participants themselves.

In the funding group, the participants talked about ideas for potentially funded projects and institutions that actively support research in web archiving.

Meanwhile, I joined the social media group, which delved into the challenges of preserving social media content. Our discussions revolved around issues like temporal violations in archived content and using specific tools for data extraction, particularly in extracting tweets. Our conversations concluded with examining the ethical considerations associated with archiving social media content.

Overall, the breakout session provided an excellent platform for exchanging ideas.

Drop-in Talks 

The drop-in talks allowed attendees to present their work from JCDL or a budding WADL-related topic they're working on. 

Dr. Michele Weigle, a Professor from the WS-DL research group at Old Dominion University, presented her JCDL 2023 paper on "Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering." It discusses the impact of client-side rendering (CSR) on web archives, highlighting temporal violations that affect content integrity. Combining mementos from different crawlers can lead to temporal inconsistencies. Recommendations include using browser-based tools on CSR sites and implementing heuristics during crawling to detect CSR usage. 

Dr. Michael Nelson, a Professor from the WS-DL research group at Old Dominion University, demoed animated GIFs from Dr. Mohamed Aturban’s PLOS ONE paper, "Hashes are not suitable to verify fixity of the public archived web.” Verifying the fixity of archived pages is crucial but difficult. A common technique is using cryptographic hash values to ensure fixity. However, this study on 16,627 mementos from 17 web archives found that 88.45% of mementos produce more than one unique hash value, and 16% of them consistently produce different hash values. This suggests a need for an archive-aware hashing function since conventional ones are not suitable for replayed archived web pages.

Dr. Alexander Nwala, an assistant professor at William & Mary and an alumnus of the WS-DL research group, presented StoryGraph. StoryGraph offers a set of tools that assess the news cycle by calculating the similarity between news stories from 17 different US news sources across the partisanship spectrum (left, center, and right). It generates a news similarity graph every 10 minutes that tracks the development of news events. Dr. Nwala also demonstrated the StoryGraphBot, which we built to track top news stories and create tweet threads (collections of tweets) that report updates (rising/falling/same attention) of the stories. 

Dr. Brenda Reyes Ayala from the University of Alberta presented “Gone, Gone, but Not Really, and Gone, But Not Forgotten: A Typology of Website Recoverability.” This paper analyzes webpage recoverability using web archives as a baseline. Three types of lost web pages were identified: unrecoverable, fully recoverable, and partially recoverable. The study aims to define degrees of recoverability and suggests methods for web archivists to discover lost content and improve web archives.

Talks 2

Dr. Michael Nelson presented “A Graduate Course in Web Archiving,” which will be offered as a certificate course. He presented a status of an IMLS-funded effort to provide an online, focused, in-depth graduate education in web archiving. 

Dr. Sawood Alam presented the last talk of the WADL on our late-breaking work published in JCDL 2023, "TrendMachine: A Temporal Webpage Resilience Portal.” TrendMachine is an open-source interactive tool that utilizes a mathematical model to compute a normalized score, measuring the resilience of a web page over time based on its mementos. The tool has versatile applications, including identifying points of interest, detecting dead links, and analyzing sections of large websites. While answering a question regarding the number of captures, Dr. Alam responded that the tool is better suited for web pages with a large number of temporally diverse captures and may not be as effective for pages with only a few mementos. The code and demo for TrendMachine are available online. 


Dr. Edward A. Fox moderated the closing of the WADL 2023 with valuable discussions on R&D, publishing ideas, and collaboration opportunities in web archiving. Many important topics were discussed during the closing session, like sponsors for funding research, important research problems, and appropriate datasets/collections.

It was a great experience to participate in the WADL 2023. The WADL workshop brought together experts and enthusiasts, facilitating valuable discussions and knowledge exchange on web archiving and digital libraries. 

-- Kritika Garg (@kritika_garg)

Other WSDL trip reports for JCDL 2023:

  1. Joint Conference on Digital Libraries (JCDL) 2023 Trip Report

  2. "Up and running with ARK persistable identifiers" JCDL Tutorial Trip Report

  3. JCDL'23  Doctoral Consortium Trip Report

  4. Joint Workshop of the 4th EEKE2023 Trip Report