2023-07-07: IIPC Web Archiving Conference (WAC) Online Day Trip Report


This year’s International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) marked the 20th anniversary of IIPC (#IIPC20Years, #IIPCWAC23). The conference happened in two parts, an online session, and an on-site conference. The online day was conducted first on May 3rd, 2023. It had eight sessions. During the online day, presenters talked briefly and answered questions from the audience. The presentations were pre-recorded and made available in advance for the attendees to watch. These presentations will be available to the public in July through the IIPC YouTube channel

In this blog, we have summarized and reflected on the online day. WSDL has covered IIPC WAC as trip reports over the years: 2023 (onsite), 2022, 2021, 2017, 2016, 2015, 2014, 2012, 2011. UK Web Archive also reflected this year’s IIPC WAC in two blog posts (1,2). IIPC also posted reflections on IIPC WAC 2023.



Session 1: Reconstructions In National Domains: History, Collections & Corpora (Videos)


Susanne van den Eijkel and Sophie Ham from KB, National Library of the Netherlands, chaired Session 1. This session consisted of three research works discussing web archive collections. 


Bas Vercruysse from Ghent University presented “Uncovering the (paper) traces of the early Belgian web.” The early Belgian internet emerged in 1988 with the introduction of the .be domain. However, documentation about its early stages is limited. This study aims to complete the history of the early European web by focusing on the Belgian context. Available records include domain name lists, archived websites, and organizational archives. Combining these resources with interviews of key actors involved in developing the Belgian internet allows for reconstructing its history and understanding its dynamics.



Christian Cote from the University of Lyon and Alexandre Faye from the French National Library (BNF) presented “The Lifranum research project: Building a Collection on French-speaking Literature.” The Lifranum research project aims to create a thematic web archive incorporating advanced search features and automatic style analysis. Researchers and librarians collaborated to define the needs of the collection and develop methods for constructing the corpora and conducting crawls. The presentation discussed challenges encountered and experiences gained during the selection and crawl processes. Topics covered include text indexing, building thematic corpora using the Hyphe tool, managing quantity and quality on blogging platforms, and documenting data choices.



Sharon Healy from Maynooth University, Juan-José Boté from Universitat de Barcelona, and Helena Byrne from British Library presented “Developing a Reborn Digital Archival Edition as an Approach for the Collection, Organisation, and Analysis of Web Archive Sources.” This presentation explored the use of RDAE for collecting, organizing, and analyzing reborn digital materials accessible through web archives. The case study focuses on the press statements of Irish politician Michael D. Higgins from 2002-2011. They used the NLI Web Archive and the Wayback Machine to find and collect traces of these statements. They used Zotero software for data collection, organization, and analysis, including text extraction and capturing screenshots. Omeka served as the platform for presenting the curated collection, offering search and discovery functions. DROID software facilitated data organization for long-term preservation, using Open Science Framework for sharing derivative materials and datasets.




Session 2: Barriers To Web Archiving In Latin America (Videos)


Eilidh MacGlone from the National Library of Scotland chaired Session 2. Alan Colin-Arce, Rosario Rogel-Salazar from Universidad Autónoma del Estado de México, and Sylvia Fernández-Quintanilla from the University of Texas at San Antonio presented their work titled “Web Archiving en español: Barriers to Accessing and Using Web Archives in Latin America.” Web archiving is popular in Global North countries but faces barriers in Spanish-speaking Latin American countries. They presented how most of the collections are either inactive or inaccessible. Lack of awareness among librarians and archivists and the high costs of web archiving services are major limitations. They also talked about linguistic barriers. Most web archiving software is in English, hindering multilingual collections. This unequal access risks the digital divide where the Global North is well preserved while Latin America relies on foreign institutions to preserve them. Raising awareness through workshops and translating documentation can help overcome these barriers.




Session 3: Researching Web Archives (Videos)


Ben Els from the National Library of Luxembourg chaired Session 3, consisting of two projects.


Tim Ribaric and Sam Langdon from Brock University presented the All Our Yesterdays Took Kit (AOY-TK) to explore web archives in Google Colab. AOY-TK is built on the derivative output of the ARCH tool developed as a part of Archive-It. The AOY-TK helps analyze web archives with Google Drive integration and text analysis tools. They presented the toolkit demo, demonstrating how to generate text derivatives from WARC (Web ARChive) files, analyze the derivative text file, and perform topic modeling. The notebooks to perform all these processes are available online. 


 


WS-DL alumnus Mat Kelly from Drexel University presented “Using Web Archives to Model Academic Migration and Identify Brain Drain.” This project examines academic mobility and brain drain from Historically Black Colleges and Universities (HBCUs) in the US. Web archives are used to extract past faculty information. They used Memgator to collect mementos of HBCU sites. By analyzing changes in HBCUs' faculty over time, the project seeks to model academic migration and measure the extent of brain drain. The presentation discussed the challenges of efficient extraction, ethical dilemmas, and data quality and outlined the next steps in identifying brain drain.




Session 4: Sampling The Historical Web & Temporal Resilience Of Web Pages (Videos)


Laura Wrubel from Stanford University chaired Session 4. Sawood Alam and I presented the outcomes of our “Not Your Parents’ Web: Scope, Segmentation, Stability, Resilience, and Persistence” project. The common belief that web pages last for 40 to 100 days is outdated due to the evolving nature of the web. Our project aims to understand the longevity and resilience of web pages. 

 

I, Kritika Garg, a Ph.D. student from WS-DL, Old Dominion University, presented our work on “Lessons Learned From the Longitudinal Sampling of a Large Web Archive.” We collected 27.3 million URLs and 3.8 billion archived pages from the Internet Archive spanning 26 years (1996-2021). We used various sampling strategies based on time, MIME types, URL depth, and TLD. We filtered URLs to focus on HTML pages and adjusted our sampling to account for fewer URLs archived in the early years. In our dataset collection, we also addressed over-representing popular domains in web archives and ensured fairness in domain and temporal representations. We convey the lessons learned from sampling the archived web, which could inform other studies that sample from web archives.


 


Sawood Alam from the Internet Archive, also a WS-DL alumnus, demonstrated “TrendMachine: Temporal Resilience of Web Pages.” TrendMachine is an open-source interactive tool based on a mathematical model that uses the mementos of a page to calculate a normalized score, measuring its resilience over time. This model has various applications, including identifying points of interest, detecting dead links, and analyzing sections of large websites. The code and demo are available online.




Session 5: Preserving Social Media & Video Games (Videos)


Sawood Alam from the Internet Archive chaired Session 5, which consisted of two projects discussing the preservation of Social Media & Video Games.

 

Kirk Mudle from New York University and the Museum of Modern Art (MoMA) presented “A Gift to Another Age: Evaluating Virtual Machines for the Preservation of Video Games at MoMA.” This project focuses on using virtual machines to preserve video games, specifically using the classic adventure game Myst (1993) as a sample record. They evaluated three virtualization options (SheepShaver, Qemu, and EaaSI) for the Mac OS 9 operating system. They documented the native performance of Myst on a PowerMac G4 and compared it to virtualization. A fully configured virtual machine is created and tested in various computing environments. The project highlights the risks and challenges associated with using virtual machines for the long-term preservation of computer and software-based art.


Magdalena Sjödahl and Stefan Jacobson from Arkiwera presented “Experiences from archiving information from social media.” Social media has revolutionized public dialogue, news dissemination, and direct communication. However, it poses challenges for archival institutions and governmental organizations. To address this, a consultancy firm specializing in digital preservation developed Arkiwera, a system for preserving social media posts. Their presentation introduces the Swedish archival context, discussing the choices made from archival, regulatory, and ethical perspectives in creating Arkiwera. 



Session 6: Collaborative Web Archiving (Videos)


Lauren Ko from the University of North Texas (UNT) chaired Session 6, consisting of two collaborative projects.

 

Quinn Dombrowski from Stanford University presented “Empowering Bibliographers to Build Collections: The Browsertrix Cloud Pilot at Stanford Libraries.” Subject-area librarians have expanded their responsibilities to include digital materials, but web archiving has not received the same attention. Stanford Libraries aimed to empower librarians by providing access to web archiving tools such as Webrecorder’s Browsertrix Cloud through a pilot program. Their goals were to understand their engagement, challenges, and resource needs and shape the strategic direction of web archiving at Stanford. The talk discussed the pilot design, outcomes, and perspectives of two participating librarians.


Anna Kijas from Tufts University, Quinn Dombrowski from Stanford University, and Andreas Segerberg from the University of Gothenburg presented “What Next? An update from SUCHO.” SUCHO, an international volunteer initiative, archived Ukrainian cultural heritage websites after Russia's invasion. Over 1,500 volunteers participated, creating a collection of over 5,000 websites and 50 TB of data. The focus was on digital repatriation, holding the data until Ukraine's cultural heritage sector could rebuild. The archiving phase used various methods, but the challenge now lies in curation. In this talk, they discussed the next steps for SUCHO, including reuniting archives with metadata, extracting data for rebuilding websites, and presenting the accomplishments of the volunteer community.



Session 7: Legal & Ethical Considerations (Videos)


Tom Smyth from Libraries and Archives Canada chaired Session 7, discussing legal & ethical considerations in web archiving.


Di Yoong and Filipa Calado from CUNY Graduate Center and Corey Clawson from Rutgers University presented “Querying Queer Web Archives.” This work explores queer identities and community interactions in web spaces, using web archival records from queer online spaces. They discussed the search methods and ethical considerations, including privacy and anonymity. As queer spaces have evolved, they examined the shifting concept of anonymity and the ethical use of collected sites. 


Nicholas Taylor from Los Alamos National Laboratory Research Library presented “Beyond the Affidavit: Towards Better Standards for Web Archive Evidence.” The Internet Archive's standard affidavit is used reliably in litigation to authenticate Wayback Machine evidence. However, the legal community's understanding of web archives is limited, leading to potential consequences when conflating the authenticity of IA's records with historical web pages. Taylor presented that the web archiving community should advocate for institutionally-agnostic standards to evaluate web archive authenticity. Existing frameworks like judicial precedents and commercial archiving companies provide some guidance but fall short. It is necessary to develop a more comprehensive set of criteria to ensure the trustworthiness of web archives for legal purposes.



Session 8: Browser-Based Crawling For All: The Story So Far (Videos)


Meghan Lyon from the Library of Congress chaired Session 8. This Session provided an update on the development of Webrecorder's crawling tools, Browsertrix Crawler and Browsertrix Cloud, and the experiences of IIPC members using the tools. 


Anders Klindt Myrvoll from Royal Danish Library presented “Browser-Based Crawling For All: The Story So Far.” He demonstrated the Browsertrix Cloud Interface by crawling and replaying URI-Rs, changing workflows and organization settings. The following steps include the implementation of automated quality assurance (QA), final improvements to crawling API and understanding the capabilities of scaling the system.


Sholto Duncan from the National Library of New Zealand shared the first user experience. They tested Browsertrix and were able to capture the range of websites they were not able to harvest using Heritrix, WCT, and Archive-It. However, they had issues capturing content with lazy loading, dynamic link loading, and images encoded with intrinsic width values


Lauren Ko from UNT shared the second user experience. UNT Libraries now hosts Browsertrix Cloud to reduce the time spent on browser-based crawling. It expects its user-friendly interface to facilitate usage by staff in web archiving courses and collaborative projects with external contributors.


Jasmine Mulliken from Stanford University Press (SUP) shared the third user experience. SUP supports web-based digital scholarship that is challenging to archive. Scholars publishing non-traditional research struggle to prove longevity to tenure committees. SUP partners with Webrecorder and Browsertrix Cloud to create web-archived versions of their complex scholarly projects, ensuring their preservation in the scholarly record. These archived publications demonstrate the value of web archiving for innovative scholarly content. Jasmine Mulliken has also covered this session in her blog.



Andreas Predikaka and Antares Reich from the Austrian National Library (A) shared the last user experience. To improve crawl quality, they tested Browsertrix on websites failing to crawl using Heritrix and found it effective. They also integrated Browsertrix into their workflow using its API.



Conclusion


This year was my third time presenting at IIPC WAC. It was a wonderful learning experience, as always. The Q&A sessions provided valuable feedback for my research. I also gained insights into the challenges and limitations currently being tackled by the researchers. Looking at the innovative initiatives and ongoing web archiving research at IIPC has deepened my understanding of the field. I am also looking forward to presenting at next year’s IIPC WAC.

 


- Kritika Garg (@kritika_garg)

Comments