2022-06-15: IIPC Web Archiving Conference (WAC) Trip Report


This year's International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) (#IIPCWAC22) took place online. All of the presentations were pre-recorded and the live conference sessions were a Q&A format with the presenters. The pre-recorded videos and the session recordings will be publicly available at a later date and I will update this blog post with the links when they're available. 

How can we make Web Archives and their data available to researchers?

This was the main question that presenters were answering at this year's conference. Many presentations addressed lowering the barrier to Web archives to allow those outside of the Web archiving community, especially researchers, to access and leverage the holdings of the archives. Sawood Alam (@ibnesayeed) of the Internet Archive summed it up in the tweet below:
 

Day One

Session 1: Full-Text Search for Web Archives

Retrieving a representation of a URL from an archive is easy. Searching an archive like we use a search engine is far more difficult; however, it is important for discoverability and understanding what is in the archive's collection. To start off Session 1, Andy Jackson (@anjacks0n) from the UK Web Archive presented "The state of full-text search at the UK Web Archive". In the past few years, the tool ecosystem for full-text searches has become more robust and now includes tools like ElasticSearch, OpenSearch, UKWA-UI, and SolrWayback. UKWA has partnered with the Danish Web Archive to implement SolrWayback as an internal tool and are working to develop use cases to better meet the needs of researchers.

Building on the topic of full-text searches, Anders Klindt Myrvoll (@AndresKlindt) from the Royal Danish Library talked about the creation of SolrWayback, an interface that allows researchers to explore ARC and WARC files. SolrWayback allows for typical full-text searching and explorative ways of looking at the archive including link graphs.

Ben O'Brien (@ob1_ben_ob) from the National Library of New Zealand talked about their pilot program to develop a full-text search platform. After running an option analysis, they have decided to use a combination of Warclight, Solr, and Webarchive-Discovery hosted in an AWS cloud. 

Session 2: BESOCIAL: Social Media Archiving at KBR in Belgium

This session was one of my favorites from the weekend. The session was all about BESOCIAL. BESOCIAL is funded by Belspo (Belgian Science Policy Office) and KBR (Royal Library of Belgium) and is working to archive and preserve social media in Belgium.

Fien Messens (@FMessens, KBR) and Pieter Heyvaert (@HeyPieter, University of Ghent) addressed the curation side of the project: what to select and how to harvest. They are archiving Twitter and Instagram to archive Belgium cultural heritage on social media platforms by curating a combination of hashtags and accounts. The Twitter API provided an easy way to perform weekly harvesting of the text that constitutes a Tweet. However, harvesting Instagram was far more difficult due to the lack of an API, a rapidly changing interface, and strict anti-bot detection.

Eva Rolin continued the conversation by providing more detail on the curation of Twitter and Instagram hashtags and accounts to include in the representative corpus. They did not want to only collect a list of hashtags and predefined key words due to the spontaneous nature of what is trending. Instead, they focused on identifying the accounts of Belgian personalities (politicians, singers, influencers, companies, etc.) and identifying other important accounts from their followers. They used machine learning to determine if the account was Belgian and should be included. 

Rolin also detailed the search for a research interface that would allow for queries. SolrWayback was initially appealing, but it was created for webpages and not social media so some of the features needed for webpages were not applicable and feature needed for social media were not available. As a result, they are creating a tool based on ElasticSearch to query the data. 

Next, Lisa-Anne Denis (@LiseanneDenis) detailed the complex copyright laws and their exceptions that govern European Web archives and their collections. One of the main takeaways was the lack of legal certainty surrounding what Web archives are allowed to archive in Europe. 

Peter Mechant rounded out the conversation by providing a use case for the importance of archiving social media. He analyzed a Twitter dataset surrounding the events and debates in the Gorman-Rijneveld translation controversy. With the dataset, he was able to find key actors, identify trends, and create timelines which show the power of capturing public sentiment and debate that take place on social media platforms. 

Session 3: Teaching the Whys and Hows of Creating and Using Web Archives

This session contained four presentations detailing some of the ways that researchers are lowering the barrier of entry to Web archiving. 

Tim Ribaric (@elibtronic) presented his experience and lessons-learned leveraging Google Colab notebooks to make Web archiving data available to non-archivists and non-programers. Karolina Holub (@KarolHolu) explained her work with librarians in Croatia to, together, create local history collections. Zhiwu Xie (@zxie) talked about the workshops and learning opportunities they created to train and develop museum and library professions to advance Web archiving. Lastly, Kirsty Fife (@DIYarchivistdetailed the motivation and process for developing low-cost workshops for non-developers and non-archivists who want to preserve their work in grassroots communities. 

These presentations increased my awareness of the niche communities (large and small) that would hugely benefit from Web archiving as well as the steep barrier to entry that those outside the Web archiving community face. 

Session 4: Video/Stream Archiving

While the "streaming experience" (Netflix, YouTube, Spotify, TikTok, etc.) is a large part of our contemporary culture, it provides unique challenges to archiving. Andreas Lenander Aegidius (@a_aegidius) explained the various types of streaming that need to be preserved and the work they are doing to preserve it. They are currently testing the open explorative approach (an existing method of discovering streaming content) and are looking to archive highly personalized streaming experiences like Spotify in the future. 

Archiving embedded videos is easy with traditional archiving methods. However, YouTube does not embed videos. Instead, they use a complex video player with dynamically changing video streams. Sawood Alam (@ibnesayeed) of the Internet Archive (@internetarchive) presented the work they have done to address the unique challenges of archiving and replaying videos hosted on YouTube. 

Session 5: Lightning Talks

Anaïs Crinière-Boizet and Isabelle Degrange from Bibiothèque nationale de France (BnF) talked about the guided offered by BnF. The guided tours are sophisticated collections created by BnF to highlight the holdings of the Web archive and to help researchers and the general public understand the holdings. 

Ricardo Basilio (@ricardobasilio_) from Arquivo.pt presented the example exhibitions created using Arquivo.pt holdings. The example exhibitions were created to show that it is easy to create and exhibition and to highlight Arquivo.pt's holdings to the public.

Helena Byme (@HBee2015) from the British Library presented "Web Archiving the Olympic and Paralympic Games" about the Olympic and Paralympic Games collection in Archive-It. The collection starts in 2010 and contains webpage from around the world including 48 different languages.

Daniel Gomes (@dcgomes77) from Arquivo.pt gave two lightning talks in this session. The first presentation highlighted the holdings of Arquivo.pt, a Web archive that is publicly available, as an open data provider. The second presentation summarized the book "The Past Web"

Tyng-Ruey Chuang (@trc4identica) and Chia-Hsun Ally Wang from Academic Sinica in Taiwan presented their website created to allow the public to upload images from their daily life and their efforts to archive its holdings. They are using tools from the Internet Archives to preserve their collection.

Session 6: Design, Build, Use: Building a Computational Research Platform for Web Archives

Session 6 was another one of my favorites from this weekend. It was about all about ARCH (Archives Research Compute Hub). "You shouldn't need to be a Web historian to use Web Archives" is the sentiment driving the ARCH project to lower the barrier to access through community and infrastructure.

The session started with Jefferson Bailey explaining the need for supporting computational research. The Web archives contain a vast amount of data that can be leveraged for research. However, massive files, a unique file format (WARC), the variety of files within a WARC, and complex curation details can create a barrier to use for researchers. ARCH is working to lower this barrier by providing flexible data delivery methods to better suit researchers. 

Next, Ian Milligan (@ianmilligan1) detailed the infrastructure and implementation of ARCH. Prior to ARCH, the Archives Unleashed (@unleasharchives)  project created the Archives Unleashed Cloud that syncs Archive-It collections, analyzes them, generates scholarly derivatives, and works with them. With that complete, ARCH is working to merge Archives Unleashed with the existing Archive-It platform. This will provide a UI that is familiar to Archive-It users and allow access to the Internet Archive datacenter for easy data computation. 

Frédéric Clavert (@inactinique) and Valérie Schafer (@valerie_schafer) finished the session by talking about the real use cases that drive development in the ARCH project. 

During the live Q&A session, the ARCH team provided additional insight into the project. They explained that access to Web archives is the most important thing but also the most difficult thing. Additionally, they noted that the Wayback machine is fantastic but fits a different use case. ARCH is working to provide Web archiving access for data-driven work

Session 7: Advancing Quality Assurance for Web Archives: Putting Theory into Practice

Session 7 consisted of two presentations: the first explained a theory for quality assurance in the archives and the second explained the implementation of that theory.

Dr. Brenda Reyes Ayala (@CamtheWicked) of the University of Alberta built a theory of information quality for Web archives that is grounded in human-centered data, or the way that users interpret the quality of an archive. The core facets of information quality are correspondence, relevance, and archivability. Correspondence is the similarity between the original and archived webpages. Relevance is the pertinence to the content of an archived website to the original webpage. Archivability is the ability for the webpage to be archived. In her discussion of archivability, Dr. Reyes Ayala cited "The impact of JavaScript on archivability" from Old Dominion University's Web Science and Digital Libraries research group.

Grace Thomas and Meghan Lyon (@aquatic_archive) of the Library of Congress presented their implementation of the theory in their curation and archiving process. They use the information quality theory to improve the quality of the captures and provide reasonable expectations of the usability of the archive. 

Session 8: Lightning Talks

Nicholas Taylor (@nullhandle) from Los Alamos National Laboratory kicked off the session by talking about the evolving use of Internet Archive Wayback Machine (IAWM) evidence in US federal court cases. There are several different legal strategies that can employ Wayback Machine evidence, but it is clear that there is a need to educate the legal community on what IAWM can provide. 

I, Emily Escamilla (@EmilyEscamilla_) from Old Dominion University, explained the increasing prevalence of references to scholarly source code on Git Hosting Platforms (GHPs) and the need for preservation. I found that 1 in 5 articles reference GitHub which indicates the importance of preserving GHPs. 

Sawood Alam (@ibnesayeed) from Internet Archive presented CDX Summary, a WARC collection summarization tool. CDX Summary allows users to understand what is in a CDX file that contains large collection of WARC files by providing a human-readable summary. The summary includes the time spread, mime types, top hosts, and sample URIs from the collection.

Himarsha Jayanetta (@HimarshaJ) from Old Dominion University presented her work comparing access patterns of robots and humans in Web archives. She found that analyzing request types, User-Agent headers, requests to robots.txt browsing speed, and image-to-html ratio could accurately detect bots accessing the Web archives. 

Kritika Garg (@kritika_garg) from Old Dominion University talked about optimizing archival replay by eliminating unnecessary traffic to Web Archives . If a resource is missing during replay, the page resends the Web archive server until it receives a 200 HTTP response header. In one instance, a page was sending 945 requests per minute. Garg found that caching responses can minimize repeated requests and reduce unnecessary traffic to Web archives.

Session 9: Serving Researchers with Public Web Archive Datasets in the Cloud

Hosting Web archiving data on public clouds is one way to provide data access to researchers. In this session, three presenters explained how they use the public cloud to host Web archiving data.

Mark Phillips (@vphill) talked about the End of Term Web archiving project that works to capture key government websites during presidential transitions (2008, 2012, 2016, 2020). In order to make the collection data available to researchers, they have implemented an environment based on the the Common Crawl Data structure. The data is available via AWS in WARC, WAT, WET, CDX index, and Parquet index formats to meet a variety of research needs. The End of Term collection has been loaded into the environment for 2008 and 2012 and can be found at https://eotarchive.org/data/.

Sebastian Nagel (@sebnagel) presented on Common Crawl's experience using the cloud. The overarching goal of Common Crawl is to lower the barrier to "the web as a dataset". The cloud allows archives to not only host data, but also to run the crawler and process data using cloud computing resources. They are analyzing data usage (storage space and data requests) to more efficiently use cloud resources. 

How do we explore, make sense of, and analyze Web archives at scale? Benjamin Lee (@lee_bcg) from the Library of Congress presented their work to create a proof of concept pipeline for multimodal search and discovery. They analyzed 1000 PDFs from .gov websites as a sample dataset. There were three components to their analysis: metadata, textual, and visual. With the results of the analysis, they are working to make the pipeline more beneficial to users and accessible at scale. 

Day Two

Session 10: Researching Web Archives: Access & Tools

Youssef Eldakar and Olga Holownia kicked off Day 2 as they presented their work to republish IIPC Collections in alternative interfaces for researcher access. IIPC curates and archives special Web collections in the Archive-It platform. They are working to republish the data via SolrWayback and LinkGate as alternative access interfaces and transferring the data from Archive-It to Bibliotheca Alexandrina.

Researchers want to look at aggregate sites and the accompanying metadata. Jaime Mears (@JaimeMears) and Chase Dooley from the Library of Congress have been working to create indexing and derivatives that serve researchers needs. They have been working with researchers to determine what data would be most beneficial and conducting a technical trial with some .gov datasets. From the datasets, they are able to provide the dataset, documentations, and Jupyter Notebooks to researchers. 

To wrap up the session, Jennifer Morival and Dorothée Benhamou-Suesser talked about ResPaDon, their project to make archives available for researchers across disciplines. They have conducted user studies to determine how people use, access, and interact with the archives. They also created a remote access point at the University of Lille and provided a dataset as a proof of concept. They are using a combination of SolrWayback to explore the collection and the Archives Unleashed Toolkit to provide statistics and datasets for understanding the collection. Moving forward, they want to install additional access points at other universities to increase the availability of the data.

Session 11: Electronic Literature and Digital Art: Approaches to Documentation and Collecting

Giulia Carla Rossi (@giugimonogatari) and Tegan Pyke from the UK Web Archive talked about their efforts to archive the New Media Writing Prize Collection. The New Media Writing Prize is an annual award that celebrates innovative and interactive digital writing. The collection archives the website and all short-list and winners of the award. Archiving these works presents unique challenges due to the complex nature of the works. As a results, a variety of tools is used to capture the collections: W3ACT, Heritrix as the main Web crawler, Conifer with Webrecorder to patch Heritrix crawls, and Archiveweb.page with Ruffle to capture formats the need Flash. The collection includes 76 works and 970 total captures. They then analyzed the quality of the captures to ensure that the narrative, atmosphere, and theme of the digital literary works is reflected. Their presentation is available here.

Natalie Kane (@nd_kane) and Stephen McConnachie (@mcnatch) talked about their work to preserve and share born-digital and hybrid objects across the National Collection. As a result of their work, they created a set of case studies on born-digital and hybrid objects, a report outlining recommendations for future research, and a decision model for community investigation. They found that born-digital objects present unique challenges. For example, born-digital objects are commonly multi-part objects that rely on networks and infrastructures. Additionally, authorship and ownership of an object can be unclear and create legal challenges. 

Bostjan Spetic (@igzebedze) and Borut Kumperscak (@kumpri) talked about deep archiving from a museum curation point of view. Deep archiving is a systematic effort to identify websites of special significance that warrant a more thorough approach to archiving. Their goal is to not just capture snapshots (through they are better than nothing), but to preserve the code and user experience as well. They provided an interesting perspective on the value of code by saying that it contains intangible cultural heritage. They went on to explain that code stores human ingenuity on a technical level and encapsulates design values of the time. Most archivists are primarily focused on capturing user experience, so hearing an focus on capturing code itself was different. 

Session 12: Research Use of the National Web Archives

How are researchers using the holdings of the Web archives? This session contained three presentations by researchers who are using the data in the Web archives. 

Liam Markey (@Liam_Markey94) from the University of Liverpool and British Library presented "Mediating Militarism: Chronicling 100 Years of British Military Victimhood from Print to Digital 1918-2018". He used captures of the Daily Mail, Daily Mirror, and The Times in the UK Web Archive to study the tension between militarism and military victimhood in media through the last 100 years and across print and digital media.

Sara Abdollahi from the University of Hannover presented "Building Event Collections from Web Archives". She explained an environment that allows researchers to collect and analyze large amounts of event-centric information on the Web. Collections are great for stable situations with a limited number of related URLs. However, some events are rapidly developing and are covered across a vast and unknown number of URLs. Given a Web archive, an event knowledge graph, and an event of interest, the environment can return a ranked list of relevant websites covering essential information. Part of the process is expanding the query to cast a broader net. For example, "Arab Spring" is expanded to "Yemeni Revolution, Tunisian Revolution, Algerian Protests, Tunisia, Egypt, Middle East".  This expanded query can help create a more complete and evolving collection for researchers.

Márton Németh and Gyula Kalscó of the National Széchényi Library presented their work to aggregate and visualize WARC files from Hungarian websites related to the Ukrainian war. Their goal is two part: preserve endangered contents that can disappear due to war and create an animated word cloud that shows the change of the frequency of the most frequent words in the war. They selected 445 seed URLs and created a SolrWayback public search interface instance to allow for full-text search of the dataset. With the collection established, they created the emtsv toolset to tokenize, filter, aggregate, and sort the contents of the WARC files to create a word cloud. 

Session 13: Lightning Talks

Sanaz Baghestani from Alzahra University summarized the access functionalities of 31 archival institutions world wide. She found that all institutions use URL search while alphabetical browsing is the least common access method. Additionally, institutions use a wide range of metadata, so there is an apparent need to create a common standard for interoperability.

Vasco Rato from Arquivo.pt presented Arquivo404, a single line of JavaScript code that can be implemented in the server's 404 page. Arquivo 404 searches the holdings of a variety of Web archives to find an archived version of the resource that is no longer available and presents it to the user. This implementation would be a win for website owners, users, and Web archives. 

Pedro Gomes from Arquivo.pt gave two lightning talks in this session. First, he presented SavePageNow, an on-demand archiving functionality inspired by Internet Archive's Save Page Now. Unlike the Internet Archive, SavePageNow runs on Webrecorder, so archiving a page requires user interaction to create a complete capture. This tool improves the quality of captures and complements existing collections. Second, Gomes presented the work by Arquivo.pt to archive cryptocurrencies. 

Ricardo Basilia (@ricardobasilio_) from Arquivo.pt talked about the online sessions and webinars they hosted during the COVID-19 pandemic. The sessions were successful with over 680 participants over 23 sessions.

JavaScript is increasing used in Web pages with a 4x increase in JavaScript bytes in the last decade. Ayush Goel from University of Michigan presented a way to reduce the amount of space JavaScript files take up and improve fidelity of captures. They proposed reducing storage space by discarding JavaScript that serves no purpose on archived pages and improving fidelity by using server-side matching to account for differences in URLs. 

Session 15: Rapid Response Collecting: What Are the New Workflows and Challenges?

Tom J. Smyth (@smythbound) from Library and Archives Canada (LAC) opened the session by presenting the methodology they use to archive major historic events which by nature cannot be planned or anticipated. In 2011, creating a collection for a major event was ad hoc and reactionary. In 2016, LAC began daily harvesting of a broad range of news media which allowed archivists to retroactively add already-archived content to the collection when the importance of the event was realized. The automated collection of news media contains 335 seeds and captures front page news which supplies ongoing documentation for major events. The collection is 18+ TB and rapidly growing. Alongside the collection for the major event, LAC publishes Web Archival Finding Aids which are reports that detail how the scope of the project evolved, defines what is scope-in and scope-out, and included metadata of the collection. This report will allow researchers to rapidly decide if the dataset will answer their research question. 

When a collection is massive and resources are limited, how do curators choose what is included? Melissa E. Wetheimer from the Library of Congress explained the archival appraisal rubric she created with and for her colleagues at the Library of Congress to select the seed URLs for the Coronavirus Web Archive. Over 2,200 URLs were nominated for inclusion, but the collection was limited to include 250 new seeds due to resource constraints. Wetheimer explained the criteria that the rubric considered and the resulting archive collection.

Session 16: Lightning Talks

James Kessenides from Yale University Library talked about the Web as hyperspace, a perceptual space distinct from a communication medium.

Kirk Mudle (@kirk_mudle) from New York University presented a comparison of three tools for archiving Discord, an instant messaging and social media platform. He found that DiscordChatExporter CLI was the most flexible and sophisticated of the tools and has the potential for be integrated into a workflow. However, like most social media platforms, discord servers are dynamic and ephemeral, so snapshots quickly become outdated.

Travis Reid (@TReid803) explained his work to integrate gaming and Web archiving to make archiving entertaining so it can be enjoyed like a spectator sport. He created an archiving live stream and the results effect the game configuration for an automated gaming live stream. So far, he has made two games: Gun Mayhem 2 and NFL Challenge.

Jessica Dame from University of North Carolina at Greensboro presented their Triad COVID-19 Archive. The collection was created using Archive-It to archive content from the Piedmont-Triad community in North Carolina including local government and school Web pages. The collection only contains Web pages with COVID content which means that the whole website is not captured.

Session 17: Saving Ukrainian Cultural Heritage Online 

The last session of the conference was about the Saving Ukrainian Cultural Heritage Online (SUCHO) project. The goal of SUCHO is to identify and archive at-risk websites and digital content in Ukrainian cultural heritage institutions during the Russo-Ukrainian war. 

Anna Kijah (@anna_kijas, Tufts University), Sebastian Majstorovic (@storytracer, Austrian Center for Digital Humanities and Cultural Heritage), and Quinn Dombrowski (@quinnanya, Stanford University) gave an overview of the SUCHO project. After initial crowd-sourcing via Twitter, SUCHO launched March 1, 2022 and now includes over 1300 volunteers. They are collaborating with Google Sheets and Slack to distribute work and communicate updates. The collection now includes 40+ TB of data and 4,500 WACZ Web archives. The collection is incredibly varied. It contains archives for public and private institutions, governments and churches, videos and online records and everything in between. SUCHO experienced an incredibly fast ramp-up time due to leveraging the time and resources of individuals, but they are increasingly working with institutions to provide longevity that a collection of individuals cannot provide. Dombrowski compared emergency Web archiving to making Molotov cocktails: it provides an inspiring story, but getting to that point indicates a failure. As a collective archiving body, more proactive policies are needed so reactive measures are not needed. 

Ilya Kreymer (@IlyaKreymer, Webrecorder) presented the suite of Webrecorder tools being leveraged by SUCHO. They are using the Browsertrix Crawler as the core crawling system. It is accessible via the command-line and Docker, can be easily scaled, and produces WACZ files. Due to the heterogeneous nature of the collection, some websites like museum virtual tours are more difficult to archive with traditional crawlers. In these cases, volunteers "manually" archives the website using ArchiveWeb.page which allows for user-directed recording through a Chromium-based browser. All archives were captured in the WACZ format which is portable and supports random access to individual URLs without downloading the whole file. SUCHO used ReplayWeb.page to replay webpages in an in-browser viewer. They also used Browsertrix Cloud, an integrated UI to manage browser-based crawling, to run a crawl, provide replay, and automatically upload the WACZ to S3. 

Dena Strong (@Dena_Strong, University of Illinois), Kim Martin (@antimony27, University of Guelph), and Erica Peaslee (@erica_peaslee, Centruion Solutions) talked about the administrative challenges that faced the project. With an all-volunteer team of 1500+ people from 20 time zones and multiple languages, no planning time, and no budget, training and processes were necessary and difficult. One of the biggest take-aways was "don't teach processes, teach signposts". A specific process can quickly become irrelevant in a project of this scale, so it is essential to teach people where to find the most recent information and how to keep up-to-date. They also discussed the metadata they associated with each WACZ file. They included the standard metadata used by the Internet Archive and added five custom fields: source URL, host institution, host location, original subject heading (in Ukrainian), and original description (in Ukrainian). 

In hearing about SUCHO, I was consistently amazed at the organization and collaboration that allowed this initiative to be successful at such a large scale. It was also interesting to hear how very common archiving and collaboration tools (Google Sheets, Slack, Webrecorder) were being used in a very uncommon way. 

Conclusion

IIPC WAC was the first Web archiving conference I have attended and it broadened my understanding of the Web archiving field. We heard about a broad spectrum of Web archiving research including creating emergency collections, new features from established Web archives, preserving born-digital art collections, and providing researchers with access to Web archiving data to name a few. I left this conference with a renewed appreciation for the need for Web archiving as well as the diverse applications of Web archiving research.

We have trip reports for some of the prior IIPC Web Archiving Conferences: 2021, 20172016.

Other IIPC WAC 2022 Blog Posts:




Comments