2023-05-24: IIPC Web Archiving Conference (WAC) Trip Report

 

This year's International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) took place in Hilversum, The Netherlands at The Netherlands Institute of Sound and Vision. It was the first in-person event since 2019 and the 20th anniversary of IIPC! The program offered between two and three tracks for attendees to choose from, so this trip report will give a summary of the sessions I was able to attend. For more information on the other sessions, check out the full conference schedule and the official hashtag (#IIPCWAC23).

Day One

To kick off Day One, Eppo van Nispen (@eppovannispen) from The Netherlands Institute for Sound & Vision gave the opening remarks.

Keynote

Elliot Higgins (@EliotHiggins), the founder of Bellingcat (@Bellingcat), gave the opening Keynote. He presented the work Bellingcat is doing to fight disinformation in social media with open-source investigation and the help of over 6,000 trained volunteers. 

Session #1: Research & Access

Samantha Fritz (@SamVFritz) from Archives Unleashed (@unleasharchives) presented "Through the ARCHWay: Opportunities to Support Access, Exploration, and Engagement with Web Archives". Archives are a largely untapped resource due to the complexity of archival data and the lack of tools available, so the Archives Unleashed Project is working to bridge the gap between researchers and the data available in the archives. 

Leontien Talboom (@makethecatwise) and Mark Simon Haydn presented "Research-Ready Collections: Challenges and Opportunities in Making Web Archive Material Accessible", their work with the Archive of Tomorrow Project. The project worked to curate a collection of 10k targets relating to health in the UK Web Archive (@UKWebArchive) and to explore ethical collection from the Web and responsible republishing. Legal limitations remain a significant barrier, but the project was about to achieve an increase from 1% to 8% of archives sites being publicly accessible!

Jennifer Morival (University of Lille), Sara Aubry (@saraaubry, BnF), and Dorothée Benhamou-Suesser (BnF) presented "Developing New Academic Uses of Web Archives Collections: Challenges and Sessions Learned from the Experimental Service Deployed at the University of Lille During the ResPaDon Project". BnF has worked with the University of Lille to allow full access to BnF holdings at the University of Lille. They also shared their experiences and lessons learned in helping researchers leverage the BnF's holdings through tools, datasets, and trained mediators. 

Session #3: Panel: Supporting Digital Scholarship

For Session 3, Sarah Potvin (Texas A&M), Talya Cooper (@talya_cooper, New York University), and Emily Escamilla (@EmilyEscamilla_, Old Dominion University) presented "Institutional Web Archiving Initiatives to Support Digital Scholarship", a panel moderated by one of their collaborators Martin Klein (@mart1nkle1n, Los Alamos National Lab). They talked about the need for archiving scholarly software hosted on the Web and what Texas A&M and NYU are doing to address the problem in their institutions. With the help of the CoSAI project, NYU is developing a workflow to archive scholarly source code developed by NYU scholars. Texas A&M is teaching graduate students about Web archives and how they can ensure the URIs in their thesis or dissertation are archived and the content is preserved.

Session #6: Social Media & Playback: Collaborative Approaches

Katrien Weyns (@KatrienWeyns) from meemoo (Flemish Institute for Archives) and Ellen Van Keer from KADOC kicked off Session 6 with their presentation "Archiving Social Media in Flemish Cultural or Private Archive, (How) Is It Possible". 

It is no secret that archiving social media presents unique and complex challenges. Zefi Kavvadia (@ZKavvadia), Katrien Weyns (@KatrienWeyns), Mirjam Schaap (@mrjmschaap), and Sophie Ham (@Sophies_posts) presented "Searching for a Little Help from My Friends: Reporting on the Efforts to Create an (Inter)national Distributed Collaborative Social Media Archiving Structure". They called for better collaboration between archives, institutions, and nations to tackle the complex challenges of archiving social media and the need for an improved legal policy to facilitate archiving social media as cultural heritage. They presented the results of a survey they conducted to gauge interest and challenges from potential collaborators. 

Clare Stanton (@clare__stanton) from Harvard's Library Innovation Lab (@harvardlil) and Perma.cc (@permacc) presented "Collaborating on the Cutting Edge: Client Side Playback". They created WACZ-Exhibotor, a wrapper for web recorder's replay tool that shifts the burden of upkeep to a browser and away from the institution's servers. Clare presented the process of creating a working prototype for the #MeToo Project with Schlesinger Library and creating tools to make the process easy to replicate for others.  

Session #7: Collaborations & Outreach

Ricardo Basílio (@ricardobasilio_) from ROSSIO presented "Linking Web Archiving with Arts and Humanities: The Collaboration Between ROSSIO and Arquivo.pt". Together, they created an arts and humanities archive that's available on the live Web.

Inge Rudomino (@IngeRudomino) from the Croatian Web Archive presented "Building Collaborative Collections: Experience of the Croatian Web Archive". They are working with other libraries, researchers, and the public to curate archives of local online history. They hosted a "HAWathon" to promote the crowdsourcing project and citizen science. 
Youssef Eldakar from Bibliotheca Alexandria (@bibalexOfficial) presented "Your Software Development Internship in Web Archiving". He discussed their internship program and the ingredients that make it successful: intern, mentor, mini-project. Internships give interns real-world experience and host institutions are able to make extra progress. 

Session #10: Lightning & Drop-In Talks

We closed out Day One with six lighting and drop-in talks. For more information, check out the thread below: 

Day Two

Workshop #4

To start Day Two, I attended the "Browser-Base Crawling for All: Getting Started with Browsertrix Cloud" workshop hosted by Andy Jackson (@anjacks0n), Anders Klindt Myrvoll (@AndersKlindt), and Ilya Kreymer (@IlyaKreymer). They introduced Browsertrix Cloud, an integrated Web archiving system, and demoed the process of setting up and running a crawl. The UI allows users to create, watch, and manage crawls in real time. One of the coolest features was the ability to dynamically add exclusions. The user could indicate the regular expression they wanted to exclude from the crawl and the URIs currently in the queue that matched the regular expression were highlighted. This allows users to fix crawler traps in real-time without having to stop or cancel the crawl. Additionally, Browsertrix Cloud can use credentials which allows it to work behind pay walls. 

Session #12: Domain Crawls

Martin Klein (@mart1nkle1n) from Los Alamos National Lab (and ODU WSDL alum), presented "Laboratory Not Found? Analyzing LANL's Web Domain Crawl". This presentation was related to their previous work with LANL's institutional Web domain. 

Session #13: Crawling, Playback, Sustainability

Ilya Kreymer (@IlyaKreymer) and Tessa Walsh (@bitarchivist) from Webrecorder (@webrecorder_io) had two presentations in Session 13. First, they presented "Developer Update for Browsertrix Crawler and Browsertrix Cloud". For Browsetrix Crawler, a docker image to run a single browser-based crawl, they have implemented more consistent logging in addition to more robust status codes that reflect page completeness within the logs. For Browsertrix Cloud, an integrated crawl management service that uses Browsertrix Crawler, they are working to support collection curation and replay. 

Second, they presented "Sustaining pywb through Community Engagement and Renewal: Recent Roadmapping and Development as a Case Study in Open Source Web Archiving Tool Sustainability". With limitations on time and resources, they have been roadmapping and evaluating future directions for pywb and inviting input from users. What features do users use most? What features are users looking for? Are others willing to contribute and in what ways? In this presentation, they presented the results of their survey and invited additional input via their online form.

Matteo Cargnelutti (@macargnelutti) from Perma.cc (@permacc) presented "Opportunities and Challenges of Client-Side Playback", a more technical overview of the project described by colleague Clare Stanton in Session 6. Client-side replay does not simplify the complexity of replay, but it moves the complexity from one end of the Web (server) to the other end (browser). He explained the security challenges of using iframes and the patches they have implemented in WACZ-Exhibitor, a tool that allows safe 2-way communication between the embedded archive and the embedding page. Matteo also described some of the other tools in the toolkit Perma.cc has been developing.

Lastly, Ayush Goel (@goelayu_sh) from the University of Michigan presented "Addressing the Adverse Impacts of JavaScript on Web Archives". JavaScript execution results in different renderings of the same Web page through various sources of non-determinism including browser, OS, screen dimensions, and current time. He argued that it does not make sense to remove all non-determinism and presented JavaScript Aware Web Archiving (JAWA) as a solution. JAWA selectively removed non-determinism by eliminating non-determinism only if it influences the resources fetched. 

Session #15: Data Considerations

To start Session 15, Emily Escamilla (@EmilyEscamilla_) from Old Dominion University's Web Science and Digital Libraries research group (@WebSciDL) presented "What if GitHub Disappeared Tomorrow?". Access to the original software used in a research experiment is crucial to reproducibility, a cornerstone of scientific research. Archived copies of software can be found in Zenodo, Software Heritage, and Internet Archive. She presented different ways to access software repositories archived in each of the digital libraries. However, if GitHub disappeared tomorrow, at least 15,000 scholarly repositories would be lost forever. 

Eld Zierau (@EldZierau) from the Royal Danish Library presented "Web Archives and FAIR Data: Exploring the Challenges for Research Data Management (RDM)", an overview of the WARCnet project. They presented the results of their semi-structured interviews on the Research Data Management (RDM) practices of those who engage in the Web Archiving Lifecycle (WAL). They specifically focused on FAIR principles (findable, accessible, interoperable, and reusable). 

Mark Phillips (@vphill) from the University of North Texas presented "Lessons Learned in Hosting the End of Term Web Archive in the Cloud". The End of Term Web Archive (@eotarchive) is to document the transition in the Executive Branch of the United States by archiving federal government Web pages before and after each election cycle. They have captured the 2008, 2012, 2016, and 2020 transitions with the help of multiple institutions include the University of North Texas and the Internet Archive. They recently moved the collections to Amazon S3 to allow for greater access and computational consumption of the collection.

Session #16: Preservation and Complex Digital Publications

Michael Kurzmeier (@mkrzmr) from University College Cork presented "Preservability and Preservation of Digital Scholarly Editions". He found that there is no universal solution to archiving Digital Scholarly Editions (DSEs), but existing approaches like Web archiving can be used for some purposes. Web archives are de facto important providers of DSE preservation. 

Ian Cooke (@IanCooke13) and Giulia Carla Rossi (@giuliacrossi) from the British Library presented "Collecting and Presenting Complex Digital Publications". Complex digital publications are publications that are born-digital and are typically multi-modal with hardware, software, and/or operating system dependencies. In the collection, they are working to represent the diversity of publishing within the UK. They presented some of the challenges associated with such an undertaking including access to non-browser-based material, developing a rights and re-use framework for contextual information, and discovering and linking related materials.

Next, Daniel Steinmeier and Susanne van den Eijkel (@SvandenEijkel) from KB Nationale Bibliotheek presented "What Can Web Archiving History Tell Us about Preservation Risks?" File format obsolescence is a problem for Web archiving. In migrating to a new format, archivists typically agree on significant properties with the producer. However, it can be difficult to identify significant properties when there is no clear producer and no way to know the original intent. They concluded by saying that, while obsolescence is a problem, completeness should be a preservation priority more urgent than solving obsolescence.  

Keynote

To close out the IIPC WAC 2023, Marleen Stikker (@marleenstikker) from WaagFuturelab presented her keynote "Public Values in the Digital Domain". She talked about the history of the Internet and the impact of capitalism and large companies on the Internet as a public commons. She left the audience with lots to ponder regarding how we interact with the Internet and how it is or is not governed. 

Conclusion

IIPC WAC 2023 was the first time I was able to attend IIPC WAC in-person and I had the opportunity to present as part of a panel and with an individual presentation. I came away with a better understanding of the tools being developed in the Web archiving community and with some ideas on how to leverage them in the research I am doing. I also learned more about the limitations and challenges faced by various cultural heritage and archival institutions as well as the solutions they have implemented. Last year, IIPC WAC 2022, deepened my appreciation for the need for Web archiving and this year's conference, IIPC WAC 2023, grew my understanding of the innovative solutions our community is developing and left me excited to investigate them further on my own.

We have trip reports for some of the prior IIPC Web Archiving Conferences and IIPC General Assemblies: 20222021201720162015201420122011.

Other IIPC WAC 2023 Blog Posts:

Emily Escamilla 

Comments