2019-10-31: Continuing Education to Advance Web Archiving (CEDWARC)

Note: This blog post may be updated with additional links to slides and other resources as they become publicly available.

On October 28, 2019, web archiving experts met with librarians and archivists at the George Washington University in Washington, DC. As part of the Continuing Education to Advance Web Archiving (CEDWARC) effort, we covered several different modules related to tools and technologies for web archives. The event consisted of morning overview presentations and afternoon lab portions. Here I will provide an overview of the topics we covered.

Web Archiving Fundamentals

Prior to attending the event Edward A. Fox, Martin Klein, Anna Perricci, and Zhiwu Xie created a brief tutorial covering the fundamentals of web archiving. This tutorial, shown below, was distributed as a video to attendees prior to the event so they could familiarize themselves with the concepts we would discuss at CEDWARC.

Zhiwu Xie kicked off the event with a refresher of this tutorial. He stressed the complexities of archiving web content due to the number of different resources necessary to reconstruct a web page at a later time. He mentioned that it was not just necessary to capture all of these resources, but also replay them properly. Improper replay can lead to temporal inconsistencies, as has been covered on this blog by Scott Ainsworth. He further covered WARCs and other concepts, like provenance, related to the archive and replay of web pages.

Resources:

Slides: Web Archiving Fundamentals
Video: Web Archiving Fundamentals
Slides: Welcome and Web Archiving Fundamentals Recap

Memento

Now that the attendees were familiar with web archives, Martin Klein provided a deep dive into what they could accomplish with Memento. Klein covered how Memento allows users to find mementos for a resource in multiple web archives via the Memento Aggregator. He further touched on recent machine learning work to improve the performance of the Memento Aggregator.

At CEDWARC (https://t.co/J4nFrTuQRh), @mart1nkle1n presents Time Travel on the web with Memento. For more information, see https://t.co/yzXZEwKDvo pic.twitter.com/v8BSR0gZFy
— Shawn M. Jones (@shawnmjones) October 28, 2019

Klein highlighted how to use the Memento browser extension, available for Chrome and Firefox. He mentioned how one could use Memento with Wikipedia, and echoed my frustration with trying to get Wikipedia to adopt the protocol. He closed by introducing various Memento Time Travel APIs available.

First presentation at #CEDWARC is about Memento, a web aggregator that accesses more than 2 dozen web archives. It uses a federated search and ranks the URIs by proximity to the date/time specified. Totally brilliant, and is has web extensions!! https://t.co/gBbJlfIiDp
— Haian (@haian_o) October 28, 2019

Resources:

Slides: Accessing and Using Web Archives

Social Feed Manager

Laura Wrubel and Dan Kerchner covered Social Feed Manager, a tool by George Washington University that helps researchers build social media archives from Twitter, Tumblr, Flickr, and Sina Weibo. SFM does more than archive the pages of social media content. It also acquires content available via each API, preserving identifiers, location information, and other data not present on a given post's web page.

At #CEDWARC, @liblaura and @DanKerchner present @SocialFeedMgr for capturing social media data from multiple networks for #webarchiving. For more information, see https://t.co/LaZltX6WWB pic.twitter.com/rVewiEFld3
— Shawn M. Jones (@shawnmjones) October 28, 2019

As part of the #CEDWARC presentation on @SocialFeedMgr, @liblaura also gives a shout out to other projects like @documentnow and twarc (https://t.co/C9pnaKZE0G). She emphasizes their partnerships with these projects and others. pic.twitter.com/rWRkbsE4tH
— Shawn M. Jones (@shawnmjones) October 28, 2019

Storytelling

I presented work from the Dark and Stormy Archives Project on using storytelling techniques with web archives. I introduced the concept of a social cards as a summary of the content of a single web page. Storytelling services like Wakelet combine social cards together to summarize a topic. We can use this same concept to summarize web archives. I broke storytelling with web archives into two actions: selecting the mementos for our story and visualizing those mementos.

Storytelling With Web Archives from Shawn Jones

I briefly covered the problems of scale with selecting mementos manually before moving on to the steps of AlNoamany's Algorithm. WS-DL alumnus Yasmin AlNoamany developed this algorithm to produce a set of representative mementos from a web archive collection. AlNoamany's Algorithm can be executed using our Off-Topic Memento Toolkit.

Existing platforms do not reliably produce social cards for #mementos! @shawnmjones has an alternative:https://t.co/m8Jez6GdXf #CEDWARC pic.twitter.com/l1mP0GIZlj
— Martin Klein (@mart1nkle1n) October 28, 2019

Visualizing mementos requires that our social cards do a good job describing the underlying content. These cards should also avoid confusion. Because of the confusion introduced by other card services, we created MementoEmbed to produce surrogates for mementos. From MementoEmbed, we then created Raintale to produce stories from lists of memento URLs (URI-Ms).

In the afternoon, I conducted a series of laboratory exercises with the attendees using these tools.

@shawnmjones rocking the Storytelling session at #CEDWARC pic.twitter.com/QNc6mPGYWi
— Martin Klein (@mart1nkle1n) October 28, 2019

Resources:

Slides: Storytelling With Web Archives
Tutorials: Laboratory Exercises

ArchiveSpark

Helge Holzmann presented ArchiveSpark for efficiently processing and extracting data from web archive collections. ArchiveSpark provides filters and other tools to reduce the data and provide it in a variety of accessible formats. It provides efficient access by first extracting information from CDX and other metadata files before directly processing WARC content.

Fourth presentation is on Archive Spark, a powerful and super impressive tool to efficiently process and analyze web archives, with the example of downloading WARC and CDX files from the Internet Archive's Wayback Machine #CEDWARC https://t.co/yKDkU63Qco
— Haian (@haian_o) October 28, 2019

ArchiveSpark uses Apache Spark to run multiple processes in parallel. Users employ Scala to filter and process their collections to extract data. Helge emphasized that data is provided in a JSON format that was inspired by the Twitter API. He closed by showing how one can use ArchiveSpark with Jupyter notebooks.

“Twitter has a nice JSON API that many people and tools already understand” so ArchiveSpark tries to emulate this idea for data from #webarchive collections, according to @helgeho at #CEDWARC https://t.co/nPK6oHbCiN pic.twitter.com/P4Oe68zUqb
— Shawn M. Jones (@shawnmjones) October 28, 2019

Resources:

Slides: Efficient Web Archive Processing With ArchiveSpark

Archives Unleashed

Samantha Fritz and Sarah McTavish highlighted several tools provided by the Archives Unleashed Project. WS-DL members have been to several Archives Unleashed events, and I was excited to see these tools introduced to a new audience.

Archives Unleashed Project team giving a brief overview of the project’s mission and afternoon session! Boom 💥 #CEDWARC pic.twitter.com/NF2FsX9pN9
— Cal Murgu (@CalMurgu) October 28, 2019

The team briefly covered the capabilities of the Archives Unleashed Toolkit (AUT). AUT employs Hadoop and Apache Spark to allow users to provide collection analytics, text analysis, named-entity recognition, network analysis, image analysis and more. From there, they introduced Archives Unleashed Cloud (AUK) for extracting datasets from ones own Archive-It collections. These datasets can then be consumed and further analyzed in Archives Unleashed Notebooks. Finally, the covered Warclight which provides a discovery layer for web archives.

Presenters gave a plug for the Archives Unleashed Datathon, where participants work on collaboratively inspired web archives project at scale using the Archives Unleashed tools. Next one is in March at Columbia! #CEDWARC https://t.co/YBGcQFaH9B
— Haian (@haian_o) October 28, 2019

Resources:

Slides: Analyzing Web Archives with the Archives Unleashed Project

Event Archiving

Ed Fox closed out our topics by detailing the event archiving work done by the Virginia Tech team. He talked about the issues with using social media posts to supply URLs for events so that web archives could then quickly capture them. After some early poor results, his team has worked extensively on improving the quality of seeds through the use of topic modeling, named entity extraction, location information, and more. This work is currently reflected in the GETAR (Global Event and Trend Archive Research) project.

At #CEDWARC, @edwardafox talks about the Event Focused Crawler, to be demonstrated this afternoon. This paper may provide more information on the ideas behind this tool: https://t.co/Wc6rUM09WO pic.twitter.com/aqP7ohSdjv
— Shawn M. Jones (@shawnmjones) October 28, 2019

In the afternoon session, he helped attendees acquire seed URLs via the Event Focused Crawler (EFC). Using code from Code Ocean and Docker containers, we were able to perform a focused crawl to locate additional URLs about an event. In addition to EFC, we classified Twitter accounts using TwiRole.

At #CEDWARC, @edwardafox is now covering some of the work featured at https://t.co/XfbWQILkDM pic.twitter.com/F2oVIcbNpU
— Shawn M. Jones (@shawnmjones) October 28, 2019

Slides: Event Archiving Introduction
Slides: Event Archiving Practicum
Document: Tutorial for EFC and TwiRole on Code Ocean

Webrecorder

Update on 2019/10/31 at 20:42 GMT: The original version neglected to include the afternoon Webrecorder laboratory session, which I did not attend. Thanks to Anna Perricci for providing us with a link to her slides and some information about the session.

In the afternoon, Anna Perricci presented a laboratory titled "Human scale web collecting for individuals and institutions" which was about using Webrecorder. Unfortunately, I was not able to engage in these exercises because I was at Ed Fox's Event Archiving session. Webrecorder was part of the curriculum because it is a free, open source tool that demonstrates some essential web archiving elements. Her session covered manual use of Webrecorder as well as its newer autopilot capabilities.

Slides: Human scale web collecting for individuals and institutions

Summary

CEDWARC's goal was to educate and provide outreach to the greater librarian and archiving community. We introduced the attendees to a number of tools and concepts. Based on the response to our initial announcement and the results from our sessions, I think we have succeeded. I look forward to potential future events of this type.

-- Shawn M. Jones

Search This Blog

Web Science and Digital Libraries Research Group