2019-06-05: Joint Conference on Digital Libraries (JCDL) 2019 Trip Report

Alma Mater, a bronze statue at the University of Illinois by sculptor Lorado Taft. Photo by Illinois Library, used under CC BY 2.0 / Cropped from original
It's June, so this means it's time for the 19th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2019). This year's JCDL was held at the University of Illinois, in Urbana-Champaign (UIUC) between June 2 - 6. Similar to last year's conference, we (members of WSDL) attended paper sessions, workshops (Web Archiving and Digital Libraries), tutorials, and panels, in which researchers from multiple disciplines presented the findings or progress of their respective research efforts. Unlike previous years, we did not feature any students or faculty in this year's JCDL doctoral consortium. We regret this and hope to resume next year.

Day 1

Following a welcome statement by Dr. Stephen Downie, Professor and Associate Dean for Research at the School of Information Sciences at UIUC, Day 1 began with a keynote from Dr. Patricia Hswe (pronounced "sway"), the program officer for Scholarly Communications at The Andrew W. Mellon Foundation. The title of her keynote was: Innovation is Dead! Long Live Innovation!
Her keynote proposed rethinking the purpose of innovation in the Digital Libraries domain to ensure what is being built is not entirely new. But to ensure innovation includes adaptation, reuse, recovery, etc., instead of rushing to build the next new "Next New Shiny Thing."

Three parallel paper sessions followed the keynote after a break:
  1. Generation and Linking
  2. Analysis and Curation, and 
  3. Search Logs

Generation and Linking Session

Pablo Figueira began this paper session with a full paper presentation titled: Automatic Generation of Initial Reading Lists: Requirements and Solutions. They proposed an automatic method for generating reading lists of scientific articles to help researchers familiarize themselves with existing literature by presenting four existing requirements, and one novel requirement for generating reading lists.
Next, Lucy McKenna, a PhD student at Trinity College Dublin, presented a full paper titled: NAISC: An Authoritative Linked Data Interlinking Approach for the Library Domain. They showed that Information Professionals such as librarians, archivists, and cataloguers have difficulty in creating five star Linked Data. Consequently, they proposed NAISC, an approach for assisting Information Professionals in the Linked Data creation process.
Next, Rohit Sharma presented a short paper titled: BioGen: Automated Biography Generation. They proposed BioGen, a system that automatically creates biographies of people by generating short sets of biographical sentences related to multiple life events. They also showed their system produced biographies similar to those manually generated by Wikipedia.
The Generation and Linking session ended with a short paper presentation by Tinghui Duan, PhD student at the University of Jena, titled: Corpus Assembly as Text Data Integration from Digital Libraries and the Web. Their work proposes a method of building a Digital Humanities corpora by searching and extracting fragments of high-quality digitized versions of artifacts from the Web.

Analysis and Curation Session

Dr. Antoine Doucet, professor of Computer Science at the University of La Rochelle, France, began the first paper session by presenting their full paper: Deep Analysis of OCR Errors for Effective Post‐OCR Processing. They presented the results of a study of five general Optical Character Recognition (OCR) errors: misspellings (real-word and non-word errors), edit operations, length effects, character position errors, and word boundary. Subsequently, they recommended different approaches to design and implement effective OCR post-processing systems.
Next,  Colin Post, a doctoral candidate in the Information and Library Science program at the University of North Carolina, Chapel Hill, presented a full paper (best paper nominee) titled: Digital curation at work: Modeling workflows for digital archival materials. This research provides insight about digital curation in practice by studying and comparing the digital curation workflows of 12 cultural heritage institutions, and focusing on the use of open-source software in their workflows.
Next was a presentation from Julianna Pakstis, Metadata Librarian at the Department of Biomedical and Health Informatics (DBHi) at the Children's Hospital of Philadelphia (CHOP), and Christiana Dobrzynski, Digital Archivist at DBHi. Their short paper presentation was titled: Advancing Reproducibility Through Shared Data: Bridging Archival and Library Practice. This research highlights the work of a team of librarians and archivists at CHOP. This team implemented Arcus, an initiative of the CHOP Research Institute with the purpose of providing the biomedical research data archive and discovery catalog more broadly available within their institution.
The session was concluded with Ana Lucic's short paper presentation titled: Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents. This research explores addressing the problem of separating the main text of a work from its surrounding paratext, a task common to the processing of large collections of scanned text in the Digital Humanities domain. The paratext is often required to be removed in order to avoid the distortion of word counts computation, locating of references, etc. They proposed a method for detecting the paratext based on a smoothed unsupervised clustering technique, and showed that their method improved subsequently text processing post removal of the paratext.

Search Logs Session

This session began the first (best paper nominee) of three full papers presentation from Behrooz Mansouri, Computer Science PhD Student at the Rochester Institute of Technology, titled: Toward math-enabled digital libraries: Characterizing searches for mathematical concepts. The work explores what queries people use to search for mathematical concepts (e.g., "Taylor series") by studying a dataset of 392,586 queries from a two-year query log. Their results show that math search sessions are typically longer and less successful than general search, and their queries are more diverse. They claim these findings could aid in the design of search engines designed for processing mathematical notation.
Next, Maram Barifah, presented a full paper titled: Exploring Usage Patterns of a Large-scale Digital Library in which they proposed a framework for assisting librarians and webmasters explore the usage patterns of Digital Libraries.
Finally, Yasunobu Sumikawa, presented the final full paper of the session titled: Large Scale Analysis of Semantic and Temporal Aspects in Cultural Heritage Collection's Search. In this presentation they reported the results of a study of a 15-month snapshot of query logs of the online portal of the National Library of France to understand the the interest of users and how users find cultural heritage content.

Classification, Discovery and Recommendation Sessions

Following a lunch break, Abel Elekes, presented the first full paper titled: Learning from Few Samples: Lexical Substitution with Word Embeddings for Short Text Classification. To help in the classification of short text, this paper proposes clustering semantically similar terms when training data is scarce to improve the performance of text classification tasks.
Next, Andrew Collins, a researcher at Trinity College Dublin, presented a short paper titled: Document Embeddings vs. Keyphrases vs. Terms for Recommender Systems: A Large‐Scale Online Evaluation. They compared a standard term-based recommendation approach to document embedding and keyphrases - two methods used for related-article recommendation in digital libraries, by applying the algorithms to multiple recommender systems.
Next, Corinna Breitinger, a PhD student at the University of Konstanz, presented her short paper titled: 'Too Late to Collaborate': Challenges to the Discovery of in-Progress Research. She presented the finding from an investigation to understand how how computer science researchers from four disciplines currently identify ongoing research projects within their respective fields. Additionally, she outlined the challenges faced by researchers such as avoiding duplicate research, while protecting the progress of their research for fear of idea plagiarism.
Finally, Norman Meuschke, a PhD candidate at the University of Wuppertal, presented a full paper titled: Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations. He presented their approach for addressing the problem of detecting concealed plagiarism (heavy paraphrasing, translation, etc.) in scholarly text which consists of a two-staged detection that combines similarity assessments of mathematical content, academic citations, and text, as well as similarity measures that consider the order of mathematical features.
Minute Madness followed after Norman's presentation, wrapping up the scholarly activities of Day 1 of JCDL. In Minute Madness, poster presenters were given one minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.

Minute Madness

Day 2

Day 2 of JCDL 2019 began with a keynote from Dr. Robert Sanderson, the Semantic Architect for the J. Paul Getty Trust: Standards and Communities: Connected People, Consistent Data, Usable Applications. The keynote highlighted the value of Web/Internet standards in providing the underlying foundation that makes the connected world possible. Additionally, the keynote explored the relationship between standards and their target communities, some common inverse relationships such as the trade-off between the completeness and usability, production and consumption, etc.
The Web Archives session followed the keynote.

Web Archives 1 Session

Sawood Alam,  a PhD student at Old Dominion University, and member of the WSDL group presented a full paper on behalf of Mohamed AturbanArchive Assisted Archival Fixity Verification Framework. Sawood presented two approaches, Atomic and Block, to establish and check fixity ( testing if an archived resource has not been unaltered since the last capture time) of archived resources. The Atomic approach for checking fixity involves storing fixity information of web pages in a JSON file and publishing the fixity content before it is disseminated to multiple on-demand Web archive. In contrast, the block approach involves merging the fixity information of multiple archived pages in a single file before its publication and dissemination to the archives.

Next, Dr. Martin Klein, a research scientist, at the Los Alamos National Laboratory presented a short paper titled: Evaluating Memento Service Optimizations. He explained the the problem of long response time services that utilize the Memento Aggregator experience. This problem arises because search requests are broadcast to all Web archives connected to the Aggregator irrespective of the fact that some URI requests can only be fulfilled by some Web Archives. He subsequently reported some results of some performance optimizations of the Memento Aggregator such as Caching and Machine Learning-based predictions.
Finally, Sawood Alam, again, presented a full paper (best paper nominee) titled: MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood additionally proposed the MementoMap framework as a flexible and adaptive means of efficiently summarizing the holdings of a Web archive, showing its application for the summary of the holdings of a Portuguese Web archive  collection (http://arquivo.pt/) consisting of 5 billion mementos (archived copies of web pages).

Other papers were presented concurrently in the Analysis and Processing session.

Analysis and Processing Session

In this session, Felix Hamborg, a PhD candidate at the University of Konstanz, presented a full paper titled: Automated Identification of Media Bias by Word Choice and Labeling in News Articles. Felix presented their research about an automatic method to detect a specific form of news bias - Word Choice and Labeling (WCL). WCL often occurs when journalists use different terms (e.g., "economic migrants" vs. "refugees.") to refer to the same concepts.
Next, Drahomira Herrmannova,  presented a full paper (Vannevar Bush best paper award winner) titled: Do Authors Deposit on Time? Tracking Open Access Policy Compliance. This paper presented the findings from an analysis of 800,000 research papers published over a 5 year period. They investigated if the time lag between the publication date of research papers and the dates the papers were deposited in a repository can be tracked across thousands of repositories globally.
Following a break, the paper sessions continued.

Web Archives 2 Session

Sergej Wildemann, a researcher at the L3S Research Center, began with a full paper presentation titled: Tempurion: A Collaborative Temporal URI Collection for Named Entities, where he introduced Tempurion, a collaborative service for enriching entities (e.g., People, Places, and Creative Work) by linking them with URLs that best describe them. The URLs are dynamic in nature and change as the associated entities change.
Next, I (Alexander Nwala) presented a full paper (best paper nominee) titled: Using Micro-collections in Social Mediato Generate Seeds for Web Archive Collections. I highlighted the importance of Web Archive collections as a means of traveling back in time to study events (e.g., Ebola Virus Outbreak and Flint Water Crisis) that may not be properly represented on the live Web due to link rot. These Archived collections begin with seed URLs that are often manually selected by experts or crowdsourced. As a result of the time consuming nature in collecting seed URLs for Web Archive collections, it is common for major news events to occur without the creation of a Web Archive collection to memorialize the events, justifying the need for automatically generating seed URLs. I showed that social media Micro-collections (curated lists created by social media users) provide the opportunity for generating seeds and produce collections with distinctive properties from convention collections generated by scraping Web and Social Media Search Engine Result Pages (SERPs).

Next, Dr. Ian Milligan, history professor at the University of Waterloo, presented a short paper titled: The Cost of a WARC: Analyzing Web Archives in the Cloud. Dr. Milligan explored and answered (US$7 per TB) the question he proposed: "How much does it cost to analyze Web archives in the cloud?" He used the Archives Unleashed platform as an example to show some of the infrastructural and financial cost associated with supporting scholarship in the humanities and social sciences.
Finally, Dr. Ian Milligan, again, presented another short paper titled: Building Community and Tools for Analyzing Web Archives through Datathons. In his second talk of the session, Dr. Milligan highlighted lessons learned from conducting the Archives Unleashed Datathons. The Archives Unleashed Datathons started in March 2016, as a collaborative Data hackathon in which social scientists, humanists, archivists, librarians, computer scientists, etc. work together for 2-3 days on analyzing Web archive data.
Another series of paper sessions followed after a break.

User Interface and Behavior Session

Dr. George Buchanan and Dr. Dana Mckay, researchers at the University of Melbourne School of Computing and Information systems, presented a full paper titled: One Way or Another I'm Gonna Find Ya: The Influence of Input Mechanism on Scrolling in Complex Digital Collections. They presented their findings from comparing the effect of input modality-touch and scrolling-on navigation in book browsing interfaces, by reporting user satisfaction associated with horizontal and two-dimensional scrolling.
Next, Dr. Dagmar Kern, a Human Computer Interaction and User Interface Engineering researcher at Gesis, presented a short paper titled: Recognizing Topic Change in Search Sessions of Digital Libraries Based on Thesaurus and Classification System. She presented their thesaurus and classification-based solution for segmenting user session information of a social science literature into its topical components.
Finally, Cole Freeman, a researcher at Northern Illinois University, presented the last short paper of the session titled: Shared Feelings: Understanding Facebook Reactions to Scholarly Articles. where he presented a new dataset of Facebook Reactions to research papers, and the results of analyzing it.

Citation Session

Dattatreya Mohapatra, a recent Computer Science graduate of Indraprastha Institute of Information, presented, a full paper (best student paper award winner) titled: Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees. He presented a novel data structure, the Influence Dispersion Tree (IDT) to model the impact of a scientific paper without relying of citation counts, but instead captures the relationship of follow-up papers and and their citation dependencies.
Next, Leonid Keselman, a researcher at Carnegie Mellon University, presented a full paper titled: Venue Analytics: A Simple Alternative to Citation‐Based Metrics. He presented a means for automatically organizing and evaluating the quality of Computer Science publishing venues, by producing venue scores for conferences and journals, done by formulating venue authorship as a regression problem.
Day 2 ended with the conference banquet and awards presentation at the Memorial football stadium.
The best demo award was given to MELD: a Linked Data Framework for Multimedia Access to Music Digital Libraries, by Dr. Kevin Page, David Lewis, and Dr. David M. Weigl
The best student paper award to given to Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees, by Dattatreya Mohapatra, Abhishek Maiti, Dr. Sumit Bhatia and Dr. Tanmoy Chakraborty
The Vannevar Bush best paper award was given to Do Authors Deposit on Time? Tracking Open Access Policy Compliance by Drahomira Herrmannova, Nancy Pontika and Dr. Petr Knoth

Day 3

Day 3 of JCDL 2019 began with a keynote from Dr. John Wilkin, the Dean of Libraries and University Librarian at the University of Illinois at Urbana-Champaign. His keynote was titled: How do you lift an elephant with one hand? and explored the challenges overcome in building the HathiTrust Digital Library, a large-scale digital repository that offers millions of titles digitized from libraries around the world.
Following the keynote was an ACM Digital Library (DL) panel session titled: Towards a DL by the Communities and for the Communities. The ACM Digital Library & Technology Committee is headed by Dr. Michael Nelson and Dr. Ed Fox, and the panel session featured talks from Dr. Daqing He, Dr. Dan Wu, Wayne Graves, and Dr. Martin Klein. During the panel, Dr. Daqing presented usage statistics of the ACM DL, Wayne Graves, Director of Information Systems at ACM presented the redesigned ACM DL website (available soon) and received feedback on existing and future services, and Dr. Martin Klein presented Piloting a ResourceSync Interface for the ACM Digital Library. Dr. Dan Wu invited the researchers to Wuhan University, the host of the JCDL 2020 conference, and introduced the audience to the city, subsequently, Dr. Stephen Downie gave the conference closing remarks.

I would like to thank the organizers and sponsors of the conference and the hosts, Dr. Stephen Downie and the University of Illinois, in Urbana-Champaign (UIUC), and Corinna Breitinger for taking and uploading additional photos of the conference.

-- Alexander C. Nwala (@acnwala)