2026-02-24: The 10th Computational Archival Science (CAS) Workshop Trip Report

IEEE BigData 2025-The10th Computational Archival Science (CAS) Workshop Home Page

The 10th Computational Archival Science (CAS) Workshop is part of 2025 IEEE Big Data Conference (IEEE BigData 2025). It was an online workshop held on Tuesday December 9, 2025. It included close to 70 participants, with a keynote from Dr. Phang Lai Tee, National Archives of Singapore and Chair of the UNESCO Memory of the World Preservation Sub-Committee on Artificial Intelligence, and 18 papers from 27 institutions in 8 countries spanning 5 continents: Canada, USA (North America) / Brazil (South America) / Scotland, Spain, Switzerland (Europe) / South Africa (Africa) / Korea (Asia).

Michael Kurtz, who passed on December 17th, 2022 launched the CAS initiative in 2016, with Victoria Lemieux, Mark Hedges, Maria Esteva, William Underwood, Mark Conrad, and Richard Marciano.

The 10th CAS workshop was organized by the CAS Workshop Chairs:

Mark Hedges from King’s College London UK
Victoria Lemieux from U. British Columbia CANADA
Richard Marciano from U. Maryland USA

The workshop started with a 10 minute welcome message from the CAS workshop chairs and then a 20 minute keynote from Dr. Phang Lai Tee, National Archives of Singapore, who presented "Applications and Challenges for Archives and Documentary Heritage in the Age of AI: Some Reflections". Overall, the topic is a timely reflection on how AI is reshaping archival and documentary heritage work, highlighting both opportunities and challenges. It was a strong presentation that included emphasis on practical challenges such as scale, access, cybersecurity, and regulation.

The workshop itself was divided into six sessions:

1: Blockchain & Archives [2 papers]

A. Blockchain and Responsible AI: Enhancing Transparency, Privacy, and Accountability through Blockchain Hackathon

Authors: Jiho Lee, Jaehyung Jeong, Victoria Lemieux,Tim Weingartner, and JaeSeung Song

PAPER — VIDEO — SLIDES

The presentation highlights a curriculum initiative where participants used a blockchain-enabled fair-data ecosystem (Clio-X) in Blockathon to build privacy-preserving AI chatbots for archival datasets. It highlights blockchain’s potential to improve transparency and accountability in AI workflows by making all actions traceable on-chain.

B. Cryptographic Provenance and AI-generated Images

Authors: Jessica Bushey, Nicholas Rivard, and Michel Barbeau

PAPER — VIDEO — SLIDES

The presentation highlighted how content credentials and cryptographic provenance frameworks can operationalize archival trustworthiness for born-digital assets and AI-generated images by embedding tamper-evident metadata into assets, which is a highly relevant and timely challenge given the proliferation of synthetic media. It effectively bridges archival theory (authenticity and provenance) with practical systems and discusses how blockchain and content credentials can support verifiable history of digital images, situating the work within computational archival science. Overall, it makes a strong conceptual and methodological contribution to trustworthy preservation of digital content.

2: Processing Analog Archives [4 papers]

A. Using an Ensemble Approach for Layout Detection and Extraction from Historical Newspapers

Authors: Aditya Jadhav, Bipasha Banerjee, and Jennifer Goyne

PAPER — VIDEO — SLIDES

The presentation focused on layout detection and Optical Character Recognition (OCR) for historical newspapers by proposing a modular, detector-agnostic ensemble pipeline combining OpenCV, Newspaper Navigator, and a fine-tuned TextOnly-PRIMA model to improve segmentation and extraction on variable scans. It’s strong in engineering detail and demonstrates practical improvements over commercial baselines like AWS Textract, especially on degraded material. Overall, it’s a solid methodological contribution with clear application value in large-scale digitization efforts.

B. PARDES: Automatic Generation of Descriptive Terms for Logical Units in Historical Handwritten Collections

Authors: Josepa Raventos-Pajares, Joan Andreu Sanchez, and Enrique Vidal

PAPER — VIDEO — SLIDES

The PARDES project presents a practical and scalable method for automatically generating descriptive terms from noisy handwritten text recognition (HTR) outputs in large historical collections, using probabilistic indexing and Zipf's Law to identify important terns. It’s strong in handling uncertainty in HTR.

C. From Analog Records to Computational Research Data: Building the AI-Ready Lab Notebook

Authors: Joel Pepper, Zach Siapno, Jacob Furst, Fernando Uribe-Romo, David Breen, and Jane Greenberg

PAPER — VIDEO — SLIDES

Similar to the previous presentation, this one addressed transforming analog, handwritten lab notebooks into AI-ready digital data to unlock valuable experimental records for computational analysis. It demonstrated promising performance. Overall, it’s a good step toward making analog scientific records computationally accessible and usable for AI systems.

D. Classification of Paper-based Archival Records Using Neural Networks

Authors: Jussara Teixeira, Juliana Almeida, Tania Gava, Raphael Lugon Campo Dall’Orto, and Jose M´ arcio Moraes Dorigueto

PAPER — VIDEO — SLIDES

The presentation demonstrates a practical application of supervised machine learning (ML) to classify unprocessed archival records, achieving high accuracy and scalability on a large real-world governmental dataset (Electronic Process System (SEP) of the State of Espirito Santo, Brazil). It effectively shows how a modular ML architecture can be integrated into existing archival systems, and how clustering similar records can reduce manual effort. Overall, it’s a solid empirical case study of ML enhancing a core archival function at scale.

3: Retrieval-augmented Generation [3 papers]

A. Developing a Smart Archival Assistant with Conversational Features and Linguistic Abilities: the Ask_ArchiLab Initiative

Authors: Basma Makhlouf Shabou, Lamia Friha, and Wassila Ramli

PAPER — VIDEO — SLIDES

This talk presented a compelling initiative to modernize archival practice by building a conversational AI assistant that integrates advanced Retrieval Augmented Generation (RAG) and semantic technologies to support fast, contextual, and professional‑level archival queries. It’s strong in conceptualizing how multilingual conversational agents can bridge gaps in access, complex metadata, and diverse user expertise. Overall, it’s an innovative approach with great potential to enhance usability and knowledge discovery in digital archives.

B. Index-aware Knowledge Grounding of Retrieval-Augmented Generation in Conversational Search for Archival Diplomatics

Authors: Qihong Zhou, Binming Li, and Victoria Lemieux

PAPER — VIDEO — SLIDES

This work presents an index‑aware chunking strategy to improve RAG pipelines for conversational search by grounding retrieval on structured index terms extracted from PDFs, aiming to reduce resource demands, accuracy issues, and hallucinations common in standard RAG workflows. It’s a practical contribution that addresses problems with traditional chunking strategies. Overall, it is an interesting methodological refinement with promising implications for archival conversational search but would benefit from broader validation.

C. Retrieval-augmented LLMs for ETD Subject Classification

Authors: Hajra Klair, Fausto German, Amr Ahmed Aboelnaga, Bipasha Banerjee, Hoda Eldardiry, and William A. Ingram

PAPER — VIDEO — SLIDES

This work presents a two‑stage RAG‑based pipeline that uses keyword extraction and guided question generation from Electronic Theses and Dissertations (ETD) abstracts to retrieve and synthesize core document content, tackling the challenge of long, full‑text processing. It addresses the challenge of subject classification at scale for ETD by capturing signatures that go beyond simple lexical similarity to improve classification accuracy and contextual richness. The evaluation shows improvements over traditional approaches. Overall, it’s a promising and well‑structured application of RAG methods to a real-world problem.

4: Archival Theory & Computational Practice [4 papers]

A. Archival Research Theory: Putting Smart Technology to Work for Researchers

Authors: Kenneth Thibodeau, Alex Richmond, and Mario Beauchamp

PAPER — VIDEO — SLIDES

This work extends archival theory beyond traditional archival management to a new Archival Research Theory (ART) framework that models archives as complex informational systems with informative potential responsive to researchers’ questions, grounded in semiotics, Constructed Past Theory, and type theory. It’s conceptually rich, offering a strong theoretical foundation for integrating smart technologies into archival research and emphasizing how meaning and context can be formally modeled to support diverse inquiry. Overall, it makes a thoughtful and potentially foundational contribution to bridging archival theory and computational practice.

B. Systems Thinking, Management Standards, and the Quest for Records and Archives Management Relevance

Author: Shadrack Katuu

PAPER — VIDEO — SLIDES

The presentation makes a case for records and archives management (RAM) within organizations by embedding RAM into widely adopted Management System Standards (MSS) like ISO frameworks, which currently drive visibility and measurable outcomes in areas such as quality and security. It uses systems thinking and standards practice to argue that RAM can gain institutional relevance and leadership buy‑in by aligning with structured MSS processes and the Plan‑Do‑Check‑Act cycle, thereby elevating archival functions beyond marginal roles. Overall, it’s a good management‑focused contribution that highlights the importance of standards and systemic framing for advancing archival relevance.

C. Can GPT-4 Think Computationally about Digital Archival Practices?

Authors: William Underwood and Joan Gage

PAPER — VIDEO — SLIDES

This work investigates whether GPT‑4o demonstrates computational thinking capabilities applied to digital archival tasks, grounding the analysis in a recognized computational thinking taxonomy. It surfaces compelling examples where the model exhibits knowledge across archival processes and computational practices, suggesting its potential as a learning partner or assistant in teaching archival computational methods. Overall, the paper offers a thought‑provoking exploration of LLM capabilities in a computational archival context, with promising avenues for further research.

D. Algorithm Auditing for Reliable AI Authenticity Assessment of Digitized Archival Objects

Author: Daniel F. Fonner

PAPER — VIDEO —SLIDES

This presentation shows how small variations in input image resolution can drastically affect AI‑based art authentication results, highlighting a key vulnerability in applying such models to archival or cultural heritage objects and raising important concerns about reliability and manipulation risk. It makes a strong case that algorithm auditing should be embedded in computational archival science practices to improve transparency, reproducibility, and accountability of automated analyses. Overall, it’s a practical contribution that urges the need for rigorous evaluation frameworks when deploying AI for authenticity and provenance tasks in digital archives.

5: Knowledge Organization & Retrieval [2 papers]

A. Ontologies Applied to Archival Records: a Preliminary Proposal for Information Retrieval

Authors: Thiago Henrique Bragato Barros, Maurício Coelho da Silva, Rafael Rodrigo do Carmo Batista, David Haynes, and Frances Ryan

PAPER — VIDEO — The slides were not posted

This paper presents an ontology‑driven approach to improve information retrieval (IR) over archival descriptions and digital objects by capturing archival contexts such as provenance, functions, agents, and events within a formal semantic model. It grounds its design in established ontology engineering and archival principles to support semantic indexing, reasoning, and query handling. Overall, it makes a decent conceptual contribution toward ontology‑enhanced archival IR.

B. Operationalizing Context: Contextual Integrity, Archival Diplomatics, and Knowledge Graphs

Authors: Jim Suderman, Frédéric Simard, Nicholas Rivard, Iori Khuhro, Erin Gilmore, Michel Barbeau, Darra Hofman, and Mario Beauchamp

PAPER — VIDEO — SLIDES

This paper lays out a context‑driven privacy framework for archival records that combines theories of contextual integrity, archival diplomatics, and knowledge graphs to make privacy‑relevant relationships machine‑legible and support informed decisions about sensitive information at scale. Its strength lies in operationalizing context rather than content alone using GraphRAG and knowledge graphs to capture nuanced contextual features that traditional vector embeddings miss, thereby offering a richer basis for privacy assessment. Overall, it’s a promising conceptual and advancement toward AI‑enabled privacy support in archives.

6: Web Archiving [3 papers]

This session highlights my contributions. The workshop designated two slots for my papers. The first slot was for presenting one of the papers and the second one is for summarizing the remaining two papers, which is why there are three papers, but only two videos. The slides for both slots are combined in one file. I want to thank Richard Marciano, Victoria Lemieux, and Mark Hedges for giving me the opportunity to present and being flexible with the workshop registration since my work is not funded and we were unable to pay the registration fees.

SLIDES

A. Arabic News Archiving is Catching Up to English: A Quantitative Study

PAPER

In the first paper, I presented a quantitative analysis of web archiving coverage for Arabic versus English news content over a 23‑year period, revealing that while English pages are still archived at a higher rate, Arabic archival coverage has increased significantly in recent years. I showed the heavy dependence on the Internet Archive (IA) for web archiving and that other public web archives contribute very little, exposing a centralization risk where loss of IA would make most archived content inaccessible. This paper is a continuation of previous work "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages".

B. The Gap Continues to Grow Between the Wayback Machine and All Other Web Archives

PAPER

The second paper I presented highlights a quantitative study showing that the Internet Archive (IA) overwhelmingly dominates public web archiving, preserving 99.74 % of archived Arabic and English news pages in the dataset I constructed (1.5 million URLs) while all other web archives combined account for only a tiny fraction. I highlighted the risk to web archiving if the IA became unavailable, the vast majority of archived online news would be lost or irretrievable, underscoring a critical vulnerability in web preservation. My analysis offer clear results, but the paper could benefit from a broader discussion of why other web archives are shrinking and what practical strategies could diversify preservation efforts. Overall, it is an important wake‑up call about concentration in web archiving and the fragility of our collective digital memory. This paper is a continuation of previous work "Profiling web archive coverage for top-level domain and content language".

C. Collecting and Archiving 1.5 Million Multilingual News Stories’ URIs from Sitemaps

PAPER

The third paper I presented introduced JANA1.5, a large dataset of 1.5 million Arabic and English news story URLs collected from news site sitemaps, and demonstrated an effective sitemap‑based collection method that outperforms alternatives like RSS, X (formerly Twitter), and web scraping. I also discussed ways for noise reduction. I ended with explaining how this dataset is going to be submitted to the IA.

One of the standout aspects of the CAS workshop was its responsiveness and quick turnaround. Reviewers' comments were actionable and came back quickly, decisions were clear, and the entire process moved at a fast pace that made it possible to focus on the work itself rather than waiting on it. The entire process from submission to publishing and presenting the work takes about a month. It’s the kind of efficiency every venue should strive for. Attending the 10th CAS Workshop was great. It underscored issues related to computational archival science including centralization, authenticity, and who gets to be remembered. It was a rewarding experience to present my work at the CAS workshop exploring web archiving’s dependence on the Internet Archive. The discussion highlighted just how vital the Internet Archive is to our digital memory, and it was inspiring to see how their work motivates us all to take action and contribute to preserving our online heritage.

Hussam Hallak

Search This Blog

Web Science and Digital Libraries Research Group

2026-02-24: The 10th Computational Archival Science (CAS) Workshop Trip Report

Comments

Post a Comment