2019-10-31: Continuing Education to Advance Web Archiving (CEDWARC)

Note: This blog post may be updated with additional links to slides and other resources as they become publicly available.

On October 28, 2019, web archiving experts met with librarians and archivists at the George Washington University in Washington, DC. As part of the Continuing Education to Advance Web Archiving (CEDWARC) effort, we covered several different modules related to tools and technologies for web archives. The event consisted of morning overview presentations and afternoon lab portions. Here I will provide an overview of the topics we covered.

Web Archiving Fundamentals

Prior to attending the event Edward A. Fox, Martin Klein, Anna Perricci, and Zhiwu Xie created a brief tutorial covering the fundamentals of web archiving. This tutorial, shown below, was distributed as a video to attendees prior to the event so they could familiarize themselves with the concepts we would discuss at CEDWARC.

Zhiwu Xie kicked off the event with a refresher of this tutorial. He stressed the complexities of archiving web content due to the number of different resources necessary to reconstruct a web page at a later time. He mentioned that it was not just necessary to capture all of these resources, but also replay them properly. Improper replay can lead to temporal inconsistencies, as has been covered on this blog by Scott Ainsworth. He further covered WARCs and other concepts, like provenance, related to the archive and replay of web pages.



Now that the attendees were familiar with web archives, Martin Klein provided a deep dive into what they could accomplish with Memento. Klein covered how Memento allows users to find mementos for a resource in multiple web archives via the Memento Aggregator. He further touched on recent machine learning work to improve the performance of the Memento Aggregator.

Klein highlighted how to use the Memento browser extension, available for Chrome and Firefox. He mentioned how one could use Memento with Wikipedia, and echoed my frustration with trying to get Wikipedia to adopt the protocol. He closed by introducing various Memento Time Travel APIs available.


Social Feed Manager

Laura Wrubel and Dan Kerchner covered Social Feed Manager, a tool by George Washington University that helps researchers build social media archives from Twitter, Tumblr, Flickr, and Sina Weibo. SFM does more than archive the pages of social media content. It also acquires content available via each API, preserving identifiers, location information, and other data not present on a given post's web page.


I presented work from the Dark and Stormy Archives Project on using storytelling techniques with web archives. I introduced the concept of a social cards as a summary of the content of a single web page. Storytelling services like Wakelet combine social cards together to summarize a topic. We can use this same concept to summarize web archives. I broke storytelling with web archives into two actions: selecting the mementos for our story and visualizing those mementos.

I briefly covered the problems of scale with selecting mementos manually before moving on to the steps of AlNoamany's Algorithm. WS-DL alumnus Yasmin AlNoamany developed this algorithm to produce a set of representative mementos from a web archive collection. AlNoamany's Algorithm can be executed using our Off-Topic Memento Toolkit.

Visualizing mementos requires that our social cards do a good job describing the underlying content. These cards should also avoid confusion. Because of the confusion introduced by other card services, we created MementoEmbed to produce surrogates for mementos. From MementoEmbed, we then created Raintale to produce stories from lists of memento URLs (URI-Ms).

In the afternoon, I conducted a series of laboratory exercises with the attendees using these tools.



Helge Holzmann presented ArchiveSpark for efficiently processing and extracting data from web archive collections. ArchiveSpark provides filters and other tools to reduce the data and provide it in a variety of accessible formats. It provides efficient access by first extracting information from CDX and other metadata files before directly processing WARC content.

ArchiveSpark uses Apache Spark to run multiple processes in parallel. Users employ Scala to filter and process their collections to extract data. Helge emphasized that data is provided in a JSON format that was inspired by the Twitter API. He closed by showing how one can use ArchiveSpark with Jupyter notebooks.


Archives Unleashed

Samantha Fritz and Sarah McTavish highlighted several tools provided by the Archives Unleashed Project. WS-DL members have been to several Archives Unleashed events, and I was excited to see these tools introduced to a new audience.

The team briefly covered the capabilities of the Archives Unleashed Toolkit (AUT). AUT employs Hadoop and Apache Spark to allow users to provide collection analytics, text analysis, named-entity recognition, network analysis, image analysis and more. From there, they introduced Archives Unleashed Cloud (AUK) for extracting datasets from ones own Archive-It collections. These datasets can then be consumed and further analyzed in Archives Unleashed Notebooks. Finally, the covered Warclight which provides a discovery layer for web archives.


Event Archiving

Ed Fox closed out our topics by detailing the event archiving work done by the Virginia Tech team. He talked about the issues with using social media posts to supply URLs for events so that web archives could then quickly capture them. After some early poor results, his team has worked extensively on improving the quality of seeds through the use of topic modeling, named entity extraction, location information, and more. This work is currently reflected in the GETAR (Global Event and Trend Archive Research) project.

In the afternoon session, he helped attendees acquire seed URLs via the Event Focused Crawler (EFC). Using code from Code Ocean and Docker containers, we were able to perform a focused crawl to locate additional URLs about an event. In addition to EFC, we classified Twitter accounts using TwiRole.


Update on 2019/10/31 at 20:42 GMT: The original version neglected to include the afternoon Webrecorder laboratory session, which I did not attend. Thanks to Anna Perricci for providing us with a link to her slides and some information about the session.

In the afternoon, Anna Perricci presented a laboratory titled "Human scale web collecting for individuals and institutions" which was about using Webrecorder. Unfortunately, I was not able to engage in these exercises because I was at Ed Fox's Event Archiving session. Webrecorder was part of the curriculum because it is a free, open source tool that demonstrates some essential web archiving elements. Her session covered manual use of Webrecorder as well as its newer autopilot capabilities.


CEDWARC's goal was to educate and provide outreach to the greater librarian and archiving community. We introduced the attendees to a number of tools and concepts. Based on the response to our initial announcement and the results from our sessions, I think we have succeeded. I look forward to potential future events of this type.

-- Shawn M. Jones