2022-07-25: ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022 Trip Report

This year, the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2022) was held at Art'otel in Cologne, Germany from June 20-24, 2022. It was held in a hybrid manner, with participants attending both in-person from Art'otel and virtually from Zoom. Members of our Web Science and Digital Libraries (WSDL) research group (current and former) presented five papers at JCDL 2022.

Invited Paper - D-Lib Magazine pioneered Web-based Scholarly Communication (Michael Nelson and Herbert Van De Sompel)
Investigating Bloom Filters for Web Archives Holdings (Martin Klein et al.) (WSDL alumni)
StreamingHub: Interactive Stream Analysis Workflows (Yasith Jayawardana et al.)
Visual Descriptor Extraction from Patent Figure Captions: A Case Study of Data Efficiency Between BiLSTM and Transformer (Xin Wei et al.)
Memento Validator: A toolset for Memento compliance testing (Bhanuka Mahanama et al.)

Members of WSDL also presented papers and posters at the Web Archiving and Digital Libraries (WADL) 2022 workshop, which was held in conjunction with JCDL 2022 (WADL 2022 Trip Report).

Day 1: 2022-06-20

The first day of the conference was dedicated to two tutorial sessions, and the Doctoral Consortium. Both tutorial sessions were conducted in parallel.

Tutorial Sessions

The tutorial "OpenRefine to Wikibase" was conducted by Lucia Sohmen and Lozana Rossenova from TIB Hannover, Germany. The tutorial "Building Digital Library Collections with Greenstone 3" was conducted by David Bainbridge from University of Waikato, New Zealand.

Doctoral Consortium

The Doctoral Consortium took place in person and via Zoom where five PhD students presented their research ideas.

The first presentation was by Sameh Frihat from University of Duisburg-Essen, Germany titled “Context-Sensitive, Personalized Search at the Point of Care”. He talks about his work in which they aim to develop a case-sensitive and personalized medical search engine for medical practitioners, focusing on medical doctors and researchers by integrating the users' interests and knowledge levels into the retrieval process. Medical research articles and clinical trials are the documents that were included in this search engine currently. They are hoping to add electronic health records to the document corpus in the future by keeping ethics and privacy concerns in mind.

The second presentation was titled “Integration of models for linked data in cultural heritage and contributions to the FAIR principles” by Inês Koch from University of Porto, Portugal. The main objective of her work is to promote the access and reuse of structured data originating from heritage institutions. She proposes to carry out a study that includes both existing data models for cultural heritage and the models that emerged with the web. The research builds upon the EPISA project.

The third presenter at the Doctoral Consortium was Bipasha Banerjee from Virginia Tech, USA presenting through Zoom on the topic “Opening Scholarly Documents Through Text Analysis”. She uses a collection of around 300,000 born-digital ETDs for her research. The aim of her research is to provide comprehensive metadata in the form of chapter labels to help readers understand the topic being discussed in the chapter and chapter-level summaries to help readers find the specific sections in the ETD without having to read the entire document. For this purpose, she uses a custom ETD-oriented language model to better understand the vocabulary in the corpus.

After a short coffee break, Luiz Barboza from CESAR School, Brazil presented his research on “The Effect of Data Science Teaching for Non-STEM Students”. The technological world has changed, bringing with it extensive processing power and modern programming languages such as Python and R. With data science being a multidisciplinary field, students from a non-STEM background face technical barriers when learning data science. As a solution, they propose a Data Science program for such students. Though this, they aim to support students' development and a prepare them for the future job market.

The final presentation was by Yuerong Hu from University of Illinois, USA on “Synthesizing Digital Libraries and Digital Humanities Perspectives for Illuminating Under-investigated Complexities associated with User-generated Book Reviews”. This research examines how to combine Digital Humanities and Digital Libraries to enlighten user-generated book reviews' under-examined intricacies. It also explores improving the usability and interpretability of user-generated book reviews. To empirically study the complexity of user-generated book reviews, they conducted case studies using data from two social reading and networking platforms, Goodreads and Douban.

Each presentation was followed by a Q&A session when participants and speakers discussed the research that had been outlined. Students also had a chance to ask the experts in the audience for advice and feedback on any challenges they were experiencing while performing their research.

Day 2: 2022-06-21

The second day marked the beginning of the main conference. The day consisted of four paper sessions, one keynote, and an invited talk.

Minute Madness

The day started off with minute madness videos of the conference presentations.

Paper Session 1

The minute madness video session was followed by the first paper session of the day, "Natural Language Processing." This session was chaired by Martin Klein (WSDL alumni) from Los Alamos National Laboratory, USA.

The session began with Timo Spinde from University of Konstanz, Germany presenting their paper "A Domain-adaptive Pre-training Approach for Language Bias Detection in News."
Next, Sotaro Takeshita from University of Mannheim, Germany presented their paper "X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents".
Next, Allard Oelen from TIB Hannover, Germany presented their paper "TinyGenius: Intertwining Natural Language Processing with Microtask Crowdsourcing for Scholarly Knowledge Graph Creation".
The session ended with Azeddine Bouabdallah from University of Koblenz-Landau, Germany presenting their paper "Vision and Natural Language for Metadata Extraction from PDF Documents: A Multimodal Approach".

Paper Session 2

Next, the second paper session of the day "Information Retrieval and Access" started after a brief coffee break. This session was chaired by Norbert Fuhr from University of Duisburg-Essen, Germany.

The session began with Malte Ostendorff from University of Konstanz, Germany presenting their paper "Specialized Document Embeddings for Aspect-based Similarity of Research Papers".
Next, Dwaipayan Roy from GESIS, Germany presented their paper "Studying Retrievability of Publications and Datasets in an Integrated Retrieval System."
The session ended with Hermann Kroll from TU Braunschweig, Germany presenting their paper "What a Publication Tells You - Benefits of Narrative Information Access in Digital Libraries."

Keynote 1

The first keynote of the conference, "Human-information Behavior and Interaction: Envisioning a New Paradigm Shift" was delivered by Dr. Dania Bilal from University of Tennessee, USA. In her speech, Dr. Bilal discusses the current status of human-information behavior and interaction and presents her thoughts on how these interactions are changing. She draws attention to how web search engines have evolved into the process of obtaining information. Many innovative system interfaces have been introduced in various library and information contexts to improve user experience (UX) as a result of recent advancements in artificial intelligence (AI). Voice assistants and different kinds of robots are examples of additions to such systems. What part do information specialists, system designers, governments, and industry play in promoting and assisting this change? What kinds of procedures and guidelines are necessary to introduce novel and successful user-AI system interactions? She also discussed the reasons why and the ways in which libraries must remain vital institutions during the paradigm shift.

Paper Session 3

The keynote was followed by the third paper session of the day, "Search and Recommendation". This session was chaired by Wolf-Tilo Balke from TU Braunschweig, Germany.

The session began with Yunqi Li from Rutgers University, USA presenting their paper "Causal Factorization Machine for Robust Recommendation". She discussed how causal feature selections in Factorization Machines (FMs) can be used to enhance the robustness of recommendation. A FM predicts users' preferences on items based on their feature vectors. They created a personalized causal feature selection method for FMs and emphasized that causal features selected for recommendation should be personalized to satisfy users' different preferences. They also conducted experiments to show the effectiveness of their method in enhancing the robustness of recommendations as well as improving the recommendation accuracy under the non-i.i.d. setting.

Next, Sumanta Kashyapi from University of New Hampshire, USA presented their paper "Query-specific Subtopic Clustering". He described their new method named Query-Specific Siamese Similarity Metric (QS3M) that is used for query-specific clustering of text documents. When given a query and documents, their subtopic clustering model can be used to get better query-specific subtopic clusters than previous methods like sentence BERT, TF-IDF, and topic models. Their approach also generalizes to unseen queries and different domains.

The session ended with Sourav Saha from Indian Statistical Institute, India presenting their paper "On Modifying Evaluation Measures to Deal with Ties in Ranked Lists." He discussed a new tie-aware version of Hit@k that they proposed named Tie-aware Hit@k (ta-Hit@k). Hit@k is an evaluation metric that can be used to evaluate recommender systems and question answering systems. They also created an alternative derivation of the formula Reciprocal Rank (RR) named Tie-aware RR (ta-RR).

Paper Session 4

Following another coffee break, the fourth paper session of the day "Web Archives" started. This session was chaired by Thomas Risse (@risse691) from Goethe University Frankfurt, Germany.

The session began with Helge Holzmann (@helgeho) from Internet Archive presenting their paper "ABCDEF - The 6 key features behind scalable, multi-tenant web archive processing with ARCH". He discussed ABCDEF (Archive, Big data, Concurrent, Distributed, Efficient, and Flexible) which are six principles used to guide the development and design of a system that processes web archive data. ARCH (Archives Research Compute Hub), Sparkling, and their Web Archive Datasets were also discussed during this presentation. ARCH, is a cloud-based system that was designed to meet all of the six principles of ABCDEF. ARCH is a platform that was built off of the past work by the Internet Archive (Archive-It) and Archives Unleashed Project (Archives Unleashed Cloud). The Sparkling Data Processing Library is a multi-purpose generic toolkit that is designed for web archive processing and can work with temporal web data. They have published their Web Archive Datasets, which currently consists of three collections: Early Web Datasets, Friendster Datasets, and GeoCities Datasets.

Next, Martin Klein from Los Alamos National Laboratory, USA (WSDL alumni) presented their paper "Investigating Bloom Filters for Web Archives Holdings." They tackled the problem of most archival holdings being largely unknown to the public (and to web archives) as they do not share CDX files for privacy reasons. He discussed how Bloom Filters can be used for discovering archived resources and sharing entire archival holdings. A Bloom Filter is a data structure that can reveal whether an element is present in a set. For the Bloom Filter, they used a database of hashed URIs. The advantages of using Bloom Filters is that the lookup is faster and it does not require publication of plain text URIs for index sharing. He mentioned that this approach is most likely suitable for smaller archives, individual collections, and live lookup of URLs during distributed crawl of topic collection.

The session ended with Yasith Jayawardana (@yasithdev) from Old Dominion University, USA (WSDL) presenting their paper "StreamingHub: Interactive Stream Analysis Workflows." He discussed StreamingHub which is a framework to build metadata propagating interactive stream analysis workflows using visual programming. This framework was created to assist with the problem of reusable data/code and reproducible analyses. They also proposed a metadata format and two platform heuristics. The metadata format was created to enable data reuse and is named Data Description System (DDS). DDS is used to collectively describe datasets, streams, and analytics. The two platform heuristics are Fluidity (F) and Growth Factor (GF). Fluidity is a heuristic for evaluating computational bottlenecks in a transformation. Growth Factor is a heuristic for evaluating the change in data volume through a transformation. They conducted two case studies to show how StreamingHub simplified the research process by allowing users to build reproducible experiments that generate verifiable results. In the case studies, their platform heuristics helped to make workload distribution and chaining decisions.

Invited Talk 1

Next, Michael Nelson (@phonedude_mln) from Old Dominion University, USA (WSDL) and Herbert Van de Sompel (@hvdsomp) from Data Archiving and Networked Services (DANS), Netherlands presented “D-Lib Magazine pioneered Web-based Scholarly Communication”. They discussed the innovations pioneered by D-Lib magazine. D-Lib magazine was an experiment in electronic publishing that did not have peer-review and no editorial board, but it was frequently cited in peer-reviewed literature and technically innovative. The innovations that D-Lib helped pioneer are Open Access (OA), HTML only publication, persistent identifiers and stable URLs, metadata discovery, rapid publication, and community engagement. Some of the experimentations that were mentioned during the talk were the use of Screencams, MPEG animations, and JavaScript (to inject annotations on links) in the articles.

Day 3: 2022-06-22

The third day of the conference consisted of four paper sessions, the dataset and demos session, and an invited talk. The day started off with a replay of the minute madness videos.

Paper Session 5

The minute madness session was followed by the first paper session of the day "Biblio/Alt-Metrics". This session was chaired by Hermann Kroll from TU Braunschweig, Germany.

The session began with Prajjwal Bhattarai from New York University Abu Dhabi, United Arab Emirates presenting their paper "Open-source Code Repository Attributes Predict Impact of Computer Science Research."
Next, Yusra Shakeel from Otto-von-Guericke-University, Magdeburg, Germany presented their paper "Altmetrics and Citation Counts: An Empirical Analysis of the Computer Science Domain."
The session ended with Masaya Ohagi from University Of Tokyo, Japan presenting their paper "Pre-trained Transformer-Based Citation Context-Aware Citation Network Embeddings."

Paper Session 6

Next, the second paper session of the day "Information Extraction" started after a brief coffee break. This session was chaired by Thomas Mandl from University of Hildesheim, Germany.

The session began with Philipp Scharpf from University of Konstanz, Germany presenting their paper "Mining Mathematical Documents for Question Answering via Unsupervised Formula Labeling."
The session ended with Xin Wei from Old Dominion University, USA (WSDL) presenting their paper "Visual Descriptor Extraction from Patent Figure Captions: A Case Study of Data Efficiency Between BiLSTM and Transformer."

Paper Session 7

Next, the third paper session of the day "Search and Recommendation II" started. This session was chaired by J. Stephen Downie from University of Illinois, USA.

The session began with Qian Wu from Nanyang Technological University, Singapore presenting their paper "Asking for Help in Community Question-Answering: The Goal-Framing Effect of Question Expression on Response Networks."
The session ended with Corinna Breitinger from University of Konstanz, Germany and University of Göttingen, Germany presenting their paper "Recommending Research Papers to Chemists: A Specialized Interface for Chemical Entity Exploration."

Datasets and Demos Session

This session was followed by lunch, and subsequently the datasets and demos session. This session had a total of 11 datasets and demos.

The session began with Yinlin Chen from Virginia Tech, USA presenting "DevOps Practices in Digital Library Development."
Next, Lozana Rossenova from TIB Hannover, Germany presented "Collaborative annotation and semantic enrichment of 3D media: A FOSS toolchain."
Next, Shaurya Rohatgi from Pennsylvania State University, USA presented "S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications."
Next, Ming Jiang from University of Illinois, USA presented "A Prototype Gutenberg-HathiTrust Sentence-level Parallel Corpus for OCR Error Analysis: Pilot Investigations."
Kamal Kaushik from Indian Institute of Technology Patna, India presented "HedgePeer: A Dataset for Uncertainty Detection in Peer Reviews."
After a small break, Christin Katharina Kreutz from Trier University, Germany presented "SchenQL: A query language for bibliographic data with aggregations and domain-specific functions."
Next, Thanasis Vergoulis from IMSI, Greece presented "BIP! Scholar: A Service to Facilitate Fair Researcher Assessment."
Next, Janete Saldanha Bach from GESIS, Germany presented "The hurdles of current data citation practices and the adding-value of providing PIDs below study level."
Next, Agnieszka Mykowiecka from Polish Academy of Sciences, Poland presented "TermoPL – A Tool for Extracting and Clustering Domain Related Terms."
Next, Matthew Yang from University of Waterloo, Canada presented "Integration of Text and Geospatial Search for Hydrographic Datasets Using the Lucene Search Library."
The session ended with Bhanuka Mahanama from Old Dominion University, USA (WSDL) presenting "Memento Validator: A toolset for Memento compliance testing."

Paper Session 8

Following another coffee break, the fourth paper session for the day "User Behavior" started. This session was chaired by Thomas Mandl from University of Hildesheim, Germany.

The session began with Jiqun Liu from University of Oklahoma, USA presenting their paper "Leveraging User Interaction Signals and Task State Information in Adaptively Optimizing Usefulness-Oriented Search Sessions."
Next, Allen Riddell from Indiana University, USA presented their paper "Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models."
Next, Orland Hoeber from University of Regina, Canada presented their paper "Information Seeking within Academic Digital Libraries: A Survey of Graduate Student Search Strategies."
The session ended with Tyler Brown from University of Oklahoma, USA presenting their paper "A Reference Dependence Approach to Enhancing Early Prediction of Session Behavior and Satisfaction."

Invited Talk 2

Next up was an invited talk by Dr. Heike Winschiers-Theophilus from Namibia University of Science and Technology, Namibia, titled "Bridging Worlds: Indigenous Knowledge in the Digital World". In this talk she outlined the epistemological clashes between indigenous knowledge and technology and discussed methods for co-designing technology and digital presentations of indigenous knowledge, based on their community projects in Namibia. One of the main lessons learned from this talk was the need for indigenous knowledge holders and communities to join forces in the development of technologies and the digitalization of their own knowledge systems, as well as the need for digital libraries to grow in order to accommodate cutting-edge knowledge-sharing methods like virtual reality.

Day 4: 2022-06-23

The fourth day of the conference consisted of three paper sessions, one keynote, two workshops (EEKE and NKOS), and a satellite event (NFDI). It began with a replay of the minute madness videos and the NFDI satellite event in parallel.

Paper Session 9

This was followed by the first paper session of the day "Scholarly Communications I". This session was chaired by Helge Holzmann (@helgeho) from Internet Archive.

The session began with Yi Bu from Peking University, China presenting their paper "Comparing different perspectives of characterizing interdisciplinarity of scientific publications: Author vs. publication perspectives."
Next, Chifumi Nishioka from National Institute of Informatics, Japan presented their paper "The Influence of Author Affiliations on Preprint Citation Count."
The session ended with Sandeep Kumar from Indian Institute of Technology Patna, India presenting their paper "DeepASPeer: Towards an Aspect-level Sentiment Controllable Framework for Decision Prediction in Peer Reviews."

Paper Session 10

Next, the second paper session of the day "Scholarly Communications II" started after a brief coffee break. This session was chaired by Ralph Ewerth from TIB Hannover, Germany.

The session began with Yuerong Hu from University of Illinois, USA presenting their paper "Complexities Associated with User-generated Book Reviews in Digital Libraries: Temporal, Cultural, and Political Case Studies."
Next, Jing Ren from Federation University, Australia presented their paper "The Significance and Impact of Winning an Academic Award: A Study of Early Career Academics."
The session ended with Gustavo Fernandes from Federal University of Minas Gerais, Brazil presenting their paper "Between Acceptance and Rejection: Challenges for an Automatic Peer Review Process."

This session was followed by the Town Hall meeting and subsequently lunch.

Paper Session 11

Next, the third paper session of the day "Classification" started. The session was chaired by Dwaipayan Roy from GESIS, Germany.

The session began with Tamara Heck from Heinrich-Heine-University Dusseldorf, Germany presenting their paper "Is one source enough? The effects of literature databases on the outcomes of systematic reviews."
Next, Arthur Brack from TIB Hannover, Germany presented their paper "Cross-Domain Multi-Task Learning for Sequential Sentence Classification in Research Papers."
The session ended with Hermann Kroll from TU Braunschweig, Germany presenting their paper "A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries."

Keynote 2

The second keynote of the conference, "German National Research Data Infrastructure (NFDI) - Structure and Perspective" was presented by Dr. York Sure-Vetter from Karlsruher Institute for Technology, Germany and National Research Data Infrastructure (NFDI), Germany. He discussed FAIR principles, NFDI, NFDI consortia, and some projects that NFDI is involved in. The FAIR principles are Findable, Accessible, Interoperable, and Reusable. NFDI is the German National Research Data Infrastructure and their goal is to make relevant data available according to the FAIR principles. The NFDI consortia are collaborations between different scientific institutions that focus on NFDI's goal for research data management. NFDI is also involved in international projects like Gaia-X and the European Open Science Cloud (EOSC).

Closing Ceremony and Awards

Following the keynote, it was time to bid farewell to JCDL 2022. The closing ceremony started by announcing the recipients of JCDL 2022 awards.

Best student paper award(s):

Vannevar Bush best paper award:

Investigating Bloom Filters for Web Archives Holdings (Martin Klein et al.) (WSDL alumni)

With this, JCDL 2022 came to an end.

-- Yasith Jayawardana (@yasithdev), Himarsha Jayanetti (@HimarshaJ), Travis Reid (@TReid803), Emily Escamilla (@EmilyEscamilla_)

Search This Blog

Web Science and Digital Libraries Research Group