2025-01-06: 9th Computational Archival Science Workshop Trip Report

The 9th Computational Archival Science Workshop took place in Washington, DC in December, 2024

 

The 9th Computational Archival Science Workshop, a part of the IEEE Big Data conference, took place on December 17, 2024 in Washington, DC. The hybrid workshop featured publications from students and professors at information science departments, computer science departments, libraries, and business departments. The topics all focused on integrating artificial intelligence with archives, and the presentations and discussions also prominently featured ethics as well. The workshop included 18 papers from 21 institutions.

Session 1: Trends in Computational Archival Science

To start the workshop, Jennifer Proctor presented her work, "A Computational Review of the Literature of Computational Archival Science (CAS): Advancing Archival Theory in the Age of the Digital Tsunami and the Vanishing Box Problem." She analyzed all of the previous Computational Archival Science workshop publications to identify core artificial intelligence topics, social justice topics, and types of records. She created a Computational Review Dashboard in Tableau to showcase her findings. She also focused on the Digital Tsunami (exponential rate at which computational archival data is being generated) and the Vanishing Box (comparative lack of structure of digital archives compared to physical ones).

Session 2: Exploring and Using Archives

Lesley Frew presents "Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach." Photo Credit: Bipasha Banerjee.


In the first session, I presented our work, "Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach" by Jhon Botello (@jhon_gbm12), myself, Dr. Jose Padilla of Storymodelers, and Dr. Michele C. Weigle. We used LLMs to detect semantic changes in the EDGI US Federal Environmental Webpages 2016-2020 dataset. Specifically, we used WARC-GPT and prompt engineering to enable detection of semantically meaningful changes in web archives using LLMs, which can help people perform these kind of analyses more quickly.

Next, Lori Perine (@ProfPerine) presented her work, "Historic Black Lives Matter: Recovering Hidden Knowledge in Archives Through Interactive Data Visualization." She used Tamara Munzner's data visualization framework to create an effective dashboard for manumissions at the Maryland State Archives. The viz framework helped her to encode effective attributes in her dashboard, such as using color to represent counties.

Session 3: AI for Archival Functions

In session 2, Giulia Osti (@semanticnoodles) remotely presented her work, "Collaborating for Change? Assessing Metadata Inclusivity in Digital Collections with Large Language Models (LLMs)." She focused on the Robert Langmuir collection, which is known to have problematic human metadata. She investigated using LLMs to detect issues with metadata including incomplete descriptions in metadata, harmful and antiquated text in metadata, and accessibility of the language in the metadata using the Inclusive Metadata Toolkit framework.

Next, Joel Pepper of Drexel CCI presented his work, "AI-Ready Data: Knowledge Extraction from Archival Lab Notebooks." He used OCR to extract text and tables from handwritten chemistry lab notebooks.

Bipasha Banerjee (@bipasha_bb), Assistant Professor and AI Research Scientist at Virginia Tech University Libraries, presented her work, "Automating Chapter-Level Classification for Electronic Theses and Dissertations." She compared the effectiveness of fine tuning various LLMs to perform classification of ETD topics. She found that fine tuning outperformed pretrained models on high level classification, but models struggled with multi-class classification because the classes are not standardized which hindered evaluation.

Session 4: Computer Vision

Gregory Jansen presented his work, "Sifting U. S. Census Records with Computer Vision and Machine Learning." This fascinating approach attempted to use computer vision to identify non-white US households in Sacramento, California, in order to enable further analysis by researchers of Japanese Americans after World War II. The census pages are handwritten and scanned, so in addition to normal handwriting detection challenges, they are not always scanned at a perfect angle either. Not all pages follow the same layout, either. He used the MNIST character database, but had to go back and add underlines through all of the Ws to match the census extraction formatting. This is a great case study of the use of computer vision in archives.

Session 5: Ethical Considerations

In the last session, Jason Clark (the farthest domestic attendee from Montana) presented his work, "A Tool for Responsible AI Implementation in Computational Archival Science." This was absolutely the star of the workshop, generating at least 30 minutes of discussion. He led us through the work that he and his research practitioner team (researchers with librarians) have done on this multi-year grant from IMLS. This paper in particular walked through the Ethical Reflection Aid for Responsible AI in Libraries and Archives and how it was applied to real scenarios with stakeholders. The framework that results from this grant will be essential to everyone doing any kind of work with artificial intelligence in computational archival science. 

Conclusions

The interdisciplinary nature of this workshop provided a well-rounded perspective on archives, including people are using them and how people want to use them. I also appreciated the focus on ethics. Many of the attendees have also published at JCDL, and the workshop featured presentations relevant to much ongoing WSDL research. Previously, WSDL published at this workshop in 2020 (Modeling Updates of Scholarly Webpages Using Archived Data by Yasith Jayawardana, Alexander C. Nwala, Gavindya Jayawardena, Jian Wu, Sampath Jayarathna, and Michael L. Nelson). Next year's conference is likely to be held in Asia, but the workshop organizers are committed to a hybrid format so that everyone can participate.

-Lesley

Comments