2023-07-26: ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2023 Trip Report

The ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2023 was a hybrid conference with the in-person event at Hilton Santa Fe Historic Plaza, New Mexico and virtual attendees joining via Zoom. JCDL 2023 conference took place from June 26-30 and it was hosted by Los Alamos National Laboratory. JCDL is a major international forum focusing on digital libraries and associated technical, practical, and social issues.

Members of our Web Science and Digital Libraries (WS-DL) research group (current and former) presented 9 papers and posters and won three awards at JCDL 2023.
  • Best Student Paper Award: Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives (Lesley Frew, Michael Nelson, and Michele Weigle)
  • Best Short Paper Award: MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries (Muntabir Hasan Choudhury, Lamia Salsabil, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, and Edward A. Fox)
  • Best Poster Award: The Memento Tracer Toolset for Human-Guided Focused Crawling of Dynamic Web (Lyudmila Balakireva, Emily Escamilla, Talya Cooper, Michael L. Nelson, and Michele C. Weigle)
Four Ph.D. students from WS-DL also presented their doctoral work at the Doctoral Consortium (JCDL 2023 Doctoral Consortium Trip Report). Members of WS-DL also presented their papers and posters at the Web Archiving and Digital Libraries (WADL) 2023 workshop (trip report), in conjunction with the JCDL 2023. JCDL also featured additional workshops, such as EEKE, and tutorials, such as ARKs.

Conference Venue - Hilton Santa Fe Historic Plaza, New Mexico

Opening and Keynote #1: Oksana Bruy

Day 1 of the JCDL main conference started off with some information about the Santa Fe area before the first keynote speaker.

Oksana Brui is the Director of the Scientific and Technical Library of the National Technical University of Ukraine (Igor Sikorsky Kyiv Polytechnic Institute). She was featured in the Guardian article, “‘Our mission is crucial’: meet the warrior librarians of Ukraine.” Oksana talked about how many libraries have been damaged or destroyed during the Russo-Ukrainian war. Unfortunately, prior to the war, fewer than 1% of cultural documents were digitized. Due to the war, many of these documents and the libraries that housed them were damaged and destroyed, resulting in a loss of Ukraine’s cultural heritage. Some individual libraries have digitalized certain collections, such as the Kyiv Polytechnic Digital Collection, Kyiv Mohyla Academy Digital Collection, and the Yaroslav Wise National Library of Ukraine Digital Library of Culture of Ukraine, but there is no central Ukrainian Digital Library. Some of the challenges Ukraine has faced in digitizing documents are lack of equipment, storage, software, and staff training. A grassroots campaign, SUCHO (Saving Ukrainian Cultural Heritage Online), was started to address this problem. Ukraine is also actively developing the framework for a national Digital Library, and is curating a collection about the war as well.

Session 1: Digital Libraries

Session 1: Digital Libraries was chaired by Alexander Nwala (@acnwala), an assistant professor of data science at William & Mary and an alumnus of the WS-DL research group.

First up in the Digital Libraries session, Christin Kreutz (@kreutzch) presented “Evaluating Digital Library Search Systems by using Formal Process Modelling”. The motivation for this paper is that existing digital library system evaluation options don’t identify discrepancies between the ideal user’s information seeking behavior and the capabilities of the system. The paper’s concept is to create models of the user’s ideal strategy, how that strategy would be translated to a digital library system, and what the user actually did by translating oral interviews and observing behavior. For evaluation, 13 users were given two tasks to find two papers. They observed that users switched digital library systems mid-task, and they also didn’t follow their ideal model in practice. Some steps, such as getting help from a person, couldn’t be translated into the models. Some steps that the user wanted to perform couldn’t be translated because the system didn’t have those capabilities. This evaluation concept can be used to find discrepancies between what users want to do and what they can do.

Next up, Satvik Chekuri (@SatvikChekuri) and Bipasha Banerjee (@bipasha_bb) presented “Integrated Digital Library System for Long Documents and their Elements”, which was nominated for the best student paper award. The motivation for this paper is that modern improvements have not been incorporated into ETD DL systems, such as extraction of figures and automatic extraction and classification of chapters. They created a framework that supports diverse user personas including curators, researchers, and experimenters working on data mining, as well as handles new UI requirements. The framework specifically supports ETDs. It is an extensible IR system, with a workflow that supports all three user personas. It automated the workflows, with 16 actions available in an API. For evaluation, they collected a data set of 500k ETDs. Of those, they indexed 57k, and further processed 5k with NLP. They plan to evaluate the accuracy of the NLP with a future study.

Liz Woolcott (@LizWoolcott) presented “A Conceptual Framework for the Design of Digital Repository Interfaces Supporting Digital Content Reuse Assessment.” In the white paper “Surveying the landscape,” they identified no prior research on reuse of digital content. Their goal was to identify use cases and develop a toolkit. They conducted six focus groups and identified 85 examples of use and reuse in literature review. The presentation ended with a slide containing many resources.

The last presentation for the paper “SMAuC - The Scientific Multi-Authorship Corpus” was pre-recorded by the authors from the Webis Group (@webis_de). They created a dataset from 3.3 million scientific publications from CORE. The dataset can be used for authorship analysis and is hosted at Zenodo.

Session 2: Scientometrics

In parallel to the Digital Libraries session, the Scientometrics session was held on day 1 of the JCDL conference. This session was chaired by Dr. Michele Weigle from Old Dominion University and WS-DL research group. This session included one long paper, two short papers, and three late breaking datasets, and had one best short paper nominee. The session began with Hardik Arora presenting their long paper, “Deciphering the Reviewer's Aspectual Perspective: A Joint Multitask Framework for Aspect and Sentiment Extraction from Scholarly Peer Reviews”. This work introduce a novel multitask deep neural architecture to jointly discover the aspects and associated sentiments from the peer review texts.

Next, Tarek Saier from Karlsruhe Institute of Technology, Germany presented their late breaking dataset, “CoCon: A Data Set on Combined Contextualized Research Artifact Use”. The CoCon dataset includes a large scholarly data reflecting the combined use of research artifacts, contextualized in academic publications’ full-text. Second late breaking work at the Scientometrics session, "CitePrompt: Using Prompts to Identify Citation Intent in Scientific Papers" was presented by Avishek Lahir from Indian Association for the Cultivation of Science. This paper presents the CitePrompt tool which uses the hitherto unexplored approach of prompt learning for citation intent classification. At JCDL Scientometrics session, Tarek Saier from Karlsruhe Institute of Technology, Germany is presenting their second Dataset, "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network". The unarXive 2022 dataset is a new version of the data set unarXive.

At the Scientometrics session, Akhil Pandey Akella from Northern Illinois University presented a short paper titled "Laying Foundations to Quantify the Effort of Reproducibility". Next, Muntabir Choudhury, a graduate student from Old Dominion University and a member of WS-DL presented their short paper "MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries" which proposes MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of the key fields of electronic theses and dissertations (ETDs). This paper won the Best Short Paper at JCDL 2023.

Session 3: Web Archiving

The third paper session of the conference "Web Archiving" started after the lunch break and was chaired by Dr. Shawn M. Jones, a postdoc at the Los Alamos National Laboratory and an alumnus of the WS-DL research group. This session included two long papers which were nominated for best paper awards, one short paper which was nominated for the best short paper award, and two late-breaking datasets. Each paper presented in this session featured at least one co-author who is affiliated with WS-DL (current or alumni). 

The web archiving session began with Lesley Frew from Old Dominion University, presenting their long paper on “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives”. In this paper, they introduce a change text search engine that allows users to find changes in web pages. Current web archive interfaces don’t allow for users to search for changes. The architecture design of this new search engine includes acquisition, indexing, and replay. The animated differences tool allows for users to view changes in context. They evaluated the paper with the EDGI federal environmental webpages 2016-20 dataset. They verified that some EDGI terms, like “climate,” were among the most commonly deleted terms in the dataset, and found new frequently deleted terms, like “science.” This paper won the Best Student Paper at JCDL 2023.

Next, Dr. Michele Weigle from Old Dominion University presented “Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering”. This paper was nominated for the Vannevar Bush Best Paper Award. This work describes how the top-level CNN.com page has used client-side rendering (CSR) and the impact of this client-side rendering model on web archives. Combining mementos collected from conventional (non-browser-based) and browser-based crawlers can result in temporal violations that affect the integrity of the content presented. One example showed a CNN.com front page memento in the Internet Archive with a Memento-Datetime of September 17, 2015 that contained the "Hero" story content from July 29, 2016. Recommendations to mitigate this problem include using browser based tools on high profile sites that are using CSR and developing a heuristic at crawl time to detect sites using CSR.

Next, Dr. Sawood Alam from the Internet Archive presented their late-breaking work, "TrendMachine: A Temporal Webpage Resilience Portal" at the Web Archive session. TrendMachine is an interactive webpage resilience portal based on a mathematical model to calculate a normalized score to quantify the temporal resilience of a web page as time-series data based on the historical observations of the page in web archives. Following that, a late-breaking/dataset presentation titled “End of Term Web Archive Dataset: Longitudinal Web Archive of .GOV and .MIL Domains" was presented by Mark Phillips.
The final presentation of the Web Archiving session at JCDL 2023 was by Himarsha Jayanetti from Old Dominion University on “Less than 4% of Archived Instagram Account Pages for the Disinformation Dozen are Replayable”. This paper was one of the nominees for Best Short Paper Award at JCDL 2023. Himarsha highlighted how web archives are an essential resource for studying content posted on banned social media accounts, but most archived Instagram account pages are unfortunately redirected to the login page. Their study revealed that less than 4% of the mementos for the Disinformation Dozen were replayable, and only 1% replayed with all the post images intact. The first author of this paper, Haley Bragg, was an undergraduate who participated our NSF Research Experience for Undergraduates (REU) program at ODU Computer Science.

Session 4: Information Retrieval

The conference had a session on information retrieval in parallel to the session on web archiving, chaired by Dr. Martin Klein from the Los Alamos National Laboratory. The session included three long paper presentations, of which two had nominations for the best paper awards. The session began with  Timo Breuer from TH Köln presenting their study "Bibliometric Data Fusion for Biomedical Information Retrieval," which proposes to improve information retrieval by incorporating metadata such as citations and altmetrics. Next, Souvick Ghosh from San Jose State University presented their paper, "Toward Connecting Speech Acts and Search Actions in Conversational Search Tasks," which uses speech acts in utterances to predict system-level actions in conversational information retrieval. The session's final presentation was "Binding Data Narrations- Corroborating the Plausibility of Scientific Narratives by Open Research Data" by Denis Nagel from TU Braunschweig. Their proposal includes structuring narratives and identifying plausible supporting data for a given scientific claim using the open data repository of the World Health Organization. For example, unlike the widely accepted connection between smoking and lung cancer, the plausibility of scientific narratives linking smoking and tuberculosis require strong evidence and argumentation for credibility is often questioned. Their BiND system is built for computing flexible bindings between scientific narratives and open research data to provide the missing evidence.

Panel 1: Research data without borders

The panel Research data without borders focused on the work that has been done by the National Research Data Infrastructure Germany (NFDI). The five panelists represented topics: metadata (Brigitte Mathiak), infra (Sonja Schimmler - @sonjas0815), edutrain (Sonja Herres-Pawlis - @HerresLab), Ethical, Legal, and Social Aspects (Oliver Vettermann), and industry (Ulrich Krieger - @uli_krieger).

Poster Session

The poster session started with Minute Madness, where each poster presenter has one minute to summarize their work. 

After minute madness, conference attendees could walk around to view the posters and talk to the presenters about their work.

Keynote #2: Jessica Polka

Jessica Polka, the Executive Director of ASAPbio, a nonprofit organization led by researchers, delivered the second keynote speech  in person. Her presentation was titled “How preprints are changing biomedical publishing” and she offered a diverse range of thought-provoking viewpoints on scholarly communication based on preprints. 

During her keynote, highlighted several key benefits of preprints, including rapid dissemination, rapid feedback, and rapid correction. She emphasized that preprints enable individuals to collaborate and communicate at an earlier stage, fostering the opportunity to receive feedback on their papers, expand their professional network, and welcome new collaborators. She also made a reference to the popular XKCD Comic. As avid fans of XKCD ourselves at WS-DL, we couldn't resist including this delightful tidbit in the trip report.

Session 5: Knowledge Graphs and Knowledge Organization

On the second day at the JCDL 2023 main conference, the second keynote was followed by session 5, "Knowledge Graphs and Knowledge organization". This session was chaired by Dr. Zeyd Boukhers. The session included a short paper presentation and two long paper presentations. The first speaker of this session, Hermann Kroll from TU Braunschweig, Germany, presented their long paper titled "Enriching Simple Keyword Queries for Domain-Aware Narrative Retrieval". This work introduces a method that deduces patterns from keyword searches to bridge the gap between the ease of keyword search and sophisticated narrative retrieval. The next presenter, Christof Bless from Lucerne University of Applied Sciences and Arts, Switzerland, presented their long paper, "SciKGTeX - A LaTeX Package to Semantically Annotate Contributions in Scientific Publications". This paper presents a LaTeX package, Scientific Knowledge Graph TeX (SciKGTeX), that allows authors of scientific publications to mark the main contributions of their work directly in LaTeX source files. This LaTeX package is being developed as an open-source package and allows the community to build their own domain-specific templates for annotation. This paper was one of the nominees for the Vannevar Bush Best Paper Award. On Wednesday night, the awards were announced and the Vannevar Bush Best Paper Award went to this paper. The final presentation of this session, “Exploring Multiview Interactive Visualization for Chinese Historical Texts Using Knowledge Aggregation” by Shaojian Li from the Remin University of China.

Session 6: Digital Humanities and Teaching

Following the keynote session, the digital humanities and teaching session was held in parallel to session five, with Dr. Wolf-Tilo Balke from TU Braunschweig chairing the session. It included a diverse set of publications: one long paper, three short papers, and one late-breaking study.

The session began with Nandana Kumara from the University of Waikato presenting "Reading Lists Systems' Pedagogical Features: A Comparative Analysis." In the study, the authors explore features of existing reading lists with their perceived value. They discover that existing reading lists do not fully meet academic expectations and identify possible improvements based on their observations. Next was a short paper by Caitlin Burge from the University of Luxembourg, titled "A King's Counsel: A Network(ed) Approach; Digitizing the Privy Council Registers of Henry VIII." The study analyzes digitized historical records to identify details on power and influence among historical figures, highlighting the significance of digitized records over digitalized records.

The third presentation was the late-breaking study, "Yes but.. Can ChatGPT Identify Entities in Historical Documents?" by Carlos-Emiliano González-Gallardo from the University of La Rochelle. The study explores the named entity recognition and classification by Large Language Models in historical documents with zero-shot learning and identifies several shortcomings, such as the inaccessibility of historical archives for training these models. Next was a short paper titled "FastCat Catalogues: Interactive Entity-based Exploratory Analysis of Archival Documents" by Pavlos Fafalios from ICS-FORTH, Greece. The study proposes an application that supports researchers of maritime history to search archival documents with a wide variety of features, such as entity browsing. The session's final presentation was the short paper "MINE - A Text Analysis Service for Digital Humanities Scientists" by Triet Ho Anh Doan from GWDG, Göttingen. The project aims to address problems with text analysis and accessibility issues in large-scale data by simplifying and offering the process as a service.

Panel 2: AI and Public Archives

The third day of the conference began with a panel discussion titled “AI and Public Archives: Collaborative Leadership for Responsible Adoption”. The four panelists were William Ingram from Virginia Tech, Rebecca Dikow from Smithsonian Institution Data Science Lab, Abigail Potter from Library of Congress, and Jill Reilly from U.S. National Archives and Records Administration (NARA).

Jill Reilly talked about using Artificial Intelligence tools to automate previously manual work for digital access (public access to government records). Rebecca Dikow talked about how off-the-shelf AI models misidentify historical objects. One example from the Smithsonian was the automated tagging of shackles used in the movie "Roots" as a necklace with 91% probability. Another example was tagging entries in a botanical archive gathered by Mary Vaux Walcott as being contributed by her husband, Charles Walcott because the entry was listed as "Mrs. Charles Walcott", which was a common way that married women identified themselves during this time period. Emphasizing the significance of an AI values statement at the Smithsonian, she highlighted the institution's role as a trusted source. Their aim is to guarantee that the integration of new technology does not undermine public trust. Abigail Potter discussed the role of values as a fundamental component of an AI strategy at the Library of Congress. She introduced a framework and various tools for effective AI planning.

After their short presentations, the panelists engaged in discussions and a Q&A session with the audience. 

The discussions and Q&A resulted in several noteworthy points and suggestions, including but not limited to the following:

  • Ensuring users are explicitly informed about their automated nature when using automatic captions, thereby encouraging feedback for improvement.
  • Leverage general-purpose crowdsourcing to facilitate corrections and gather public input on various aspects.
  • Involve students who are studying practical applications in testing out models.
  • Importance of promoting transparency by sharing models and datasets with users and researchers
  • Creation of a centralized registry where institutions can contribute their unique developments, allowing for cross-institutional learning and innovation. Notable platforms such as Hugging Face Hub and AI4LAMs were mentioned as valuable resources for accessing and sharing models and datasets.

Out of the panelists, only William Ingram attended the event in person. The remaining panelists participated virtually and expressed their gratitude to William Ingram for his excellent handling of the panel.

Session 7: AI/ML/Entity Extraction

The third day of the main conference had the final paper session on Artificial Intelligence, Machine Learning, and Entity Extraction. Dr. Sawood Alam from the Internet Archive chaired the session comprising six publications. The first paper was the late-breaking study/dataset, "Mining the History Sections of Wikipedia Articles," by Wolfgang Kircheis from Leipzig University. The authors propose a dataset comprising science and technology Wikipedia articles with their extracted history sections. Next was the long paper "Efficient Ultrafine Typing of Named Entities" by Alejandro Sierra-Múnera from Hasso-Plattner-Institut. The study addresses the complexities in ultrafine named entity recognition, such as the requirement of a large training dataset or the costly operation of comparing against all entity types.

The third presentation of the paper was "Mining Semantic Relations in Data References to Understand the Roles of Research Data in Academic Literature," by Lizhou Fan from the University of Michigan. The study presented a workflow for identifying the relationships between the publications, studies, and authors. The next presentation was the short paper "Extreme Classification for Answer Type Prediction in Question Answering" by Vinay Setty from University of Stavanger. The author proposes to improve the answer type prediction by incorporating transformer models when predicting the top-k knowledge graph type producing state-of-the-art results.

The session's final presentation was the late-breaking study, "Zero-shot Entailment of Leaderboards for Empirical AI Research," by Salomon Kabongo Kabenamualu from Leibniz University of Hannover. In the study, authors investigate the generalizability of state-of-the-art models in identifying entailment-the directional relation between two text fragments.

Keynote #3: Sarah Lamdan

Professor Sarah Lamdan from the City University of New York School of Law delivered the final keynote titled "Data Cartels and the Future of Digital Information Access”. The talk focussed mainly on the transformation of publishers towards data analytics and its impact on library professionals' roles in adapting to digital platforms and products. She used LexisNexis as an example, where records containing overlapping data points are assigned a unique identifier called LexID, which is not derived from personally identifiable information. These identity profiles (LexIDs) are continuously enriched by adding new records. She discussed how this information is being sold to various entities such as governments, tenant screening companies, healthcare systems, and insurance companies. She discussed some incidents where incorrect data from LexisNexis had real-world implications, causing harm to the public. She also highlighted the notion that data analytics has become pervasive across all industries (with few companies dominating all the informational markets), with publishers simply adapting to this prevailing trend. For instance, she cited the exposure of personal data through IoT devices, the emergence of technological trends like smart thermometers and smart clothing, and how contemporary cars gather extensive data about drivers.

For more information on this topic, interested readers can explore Sarah Lamdan's insightful book titled “Data Cartels: The Companies That Control and Monopolize Our Information”.

Conference Dinner and Awards

On Wednesday, June 28th night, the conference dinner was held at La Fonda on the Plaza, Santa Fe. Following the conference dinner, the best paper and poster awards were announced.

Best Poster Award

The Best Poster Award was given to the best poster that is presented at the JCDL conference. All posters presented are eligible and JCDL attendees voted to select the winner for the Best Poster Award. The award went to “The Memento Tracer Toolset for Human-guided Focused Crawling of Dynamic Web” by Lyudmila Balakireva, Emily Escamilla, Talya Cooper, Michael L. Nelson, and Michele C. Weigle. Himarsha R. Jayanetti presented this poster on behalf of the authors.
Best Short Paper Award

The best short paper award at JCDL 2023 was awarded to “MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries” by Muntabir Hasan Choudhury, Lamia Salsabil, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, and Edward A. Fox.

Best Student Paper Award

This award was given to the best paper presented at JCDL having a student as the first author. All full papers with a student as the first author that are accepted for presentation are eligible. The full paper titled “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives” won the best student paper award and this paper was by Lesley Frew, Michael Nelson, and Michele Weigle from WS-DL.

Vannevar Bush Best Paper Award

The Vannevar Bush Best Paper Award is given to the best paper that is presented at the JCDL (and earlier ACM DLs) since 1998. All full papers that are accepted for presentation are eligible and the JCDL Steering Committee selects the winner. This year, the award went to “SciKGTeX - A LaTeX Package to Semantically Annotate Contributions in Scientific Publications” by Christof Bless, Ildar Baimuratov, and Oliver Karras.


For all of us, it was our first time attending the JCDL conference in person. We were thrilled to present our research before a live audience and had the privilege of interacting with numerous academic experts from across the globe. The JCDL 2023 conference held in Santa Fe, New Mexico, provided us with a fantastic experience filled with enjoyment of local cuisine and appreciation of Pueblo-Spanish architecture. Representing the WS-DL research group at this event was truly an honor.

--Yasasi (@Yasasi_Abey), Bhanuka (@mahanama94), Lesley (@lesley_elis), and Himarsha (@HimarshaJ)