2018-06-08: Joint Conference on Digital Libraries (JCDL) 2018 Trip Report

The gathering place at the Cattle Raisers Museum, Fort Worth, Texas 
This year's 18th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2018) took place at the University of North Texas (Fort Worth, Texas). Between June 3-6, members of WSDL attended paper sessions, workshops, tutorials, panels, and a doctoral consortium.

The theme of this year's conference was "From Data to Wisdom: Resilient Integration across Societies, Disciplines, and Systems." The conference provided researchers across multiple disciplines ranging from Digital Libraries and Web science research to Libraries and Information science, with the opportunity to communicate the findings of their research.

Day 1 (June 3, 2018)

The first day of the conference was dedicated to doctoral consortium, tutorials, and workshops. The doctoral consortium provided an opportunity for Ph.D. students in the early phases of their dissertation to present their thesis and research plans and receive constructive feedback. I will provide a link to the Doctoral Consortium blogpost when it becomes available.

Day 2 (June 4, 2018)

The conference officially began on the second day with Dr. Jiangping Chen's introduction of the conference and the keynote speaker - Dr. Trevor Owens. Dr. Trevor Owens is a librarian, researcher and policy maker and the first head of Digital Content Management for library services at the Library of Congress. His talk was titled: "We have interesting problems." 

It started with a highlight of Ben Shneiderman's The New ABCs of Research which provides students with guidance on how to succeed in research, and provides senior researchers and policy makers on how to respond to new problems and apply new technologies. The new ABC's of research may be grossly summarized with two acronyms included in the book: ABC (Applied, Basic, and Combined) and SED (Science, Engineering, and Design).
Additionally, he presented NDP@3, an IMLS framework for investments in digital infrastructures for libraries. Also he presented multiple IMLS-funded projects such as: Image Analysis for Archival Discover (AIDA), which explores various ways to use millions of images representing the digitized cultural record.
Next he talked about some resources at the Library of Congress Labs such as:
  • Library of Congress Colors: provides the capability of exploring the colors in the Library of Congress collections.
  • LC for Robots: provides a list of APIs, data and tutorials for exploring the digital collections at the Library of Congress.
Following the keynote were three concurrent paper sessions with the theme: Use, Collection Building, and Semantics & Linking. I will briefly describe the papers discussed in two paper sessions.

Paper session 1B (Day 2)

Myriam Traub (best paper nominee), a PhD student at Centrum Wiskunde & Informatica (CWI) presented a full paper titled: "Impact of Crowdsourcing OCR Improvements on Retrievability Bias." She discussed how  crowd-sourced correction of OCR errors affects the retrievability of documents in a historic newspaper corpus in a digital library.
Three short papers followed Traubs's presentation. First, Karen Harker, a Collection Assessment Librarian at the University of North Texas Libraries presented: "Applying the Analytic Hierarchy Process to an Institutional Repository Collection." She discussed the application of the Analytic Hierarchy Process (AHP) to create a model for evaluating collection development strategies of institutions. Second, Douglas Kennard presented: "Computer-Assisted Crowd Transcription of the U.S. Census with Personalized Assignments for Better Accuracy and Participation," where he introduced the Open Genealogy Data census transcription project that strives to make census  data readily available to researchers and digital libraries. This was achieved through the use of automatic handwriting recognition to bootstrap their census database, and subsequent crowd-sourced correction of the data through a web interface. Finally, Mandy Neumann, a research associate at the Institute of Information Science at TH Kรถln presented: "Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp." She explored different features for ranking conference candidates by using a pseudo-relevance assessment.

Paper session 1C (Day 2)

Dr. Federico Nanni (best paper nominee), a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim presented the first of three full papers titled: "Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context," in which he introduced a method for obtaining specific descriptions of entities in text by retrieving the most related section from Wikipedia.
Next, Gary Munnelly, a PhD student at the School of Computer Science and Statistics (SCSS) at Trinity College Dublin presented: "Investigating Entity Linking in Early English Legal Documents," discussing the effectiveness of different entity linking systems for the task of disambiguating named entities in 17th century depositions obtained during the 1641 Irish rebellion.
Finally, Dr. Ahmed Tayeh presented: "An Analysis of Cross-Document Linking Mechanisms," where he discussed different strategies for linking or associating information across physical and digital documents. The titles of other papers presented in a parallel session (1A) include:

Open Cross-Document Linking Service Based on a Plug-in Architecture from Ahmed Tayeh

Paper session 2A (Day 2)

Two full papers were presented after a break. The first was titled: "Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations," was presented by Rosita Andrade. She presented her research about the automated analysis of street names with date references around the world, and showed that "temporal streets" are frequently used to commemorate important events such as a political change in a country.
Next, Dr. Philipp Mayr, a deputy department head and a team leader at the GESIS department Knowledge Technologies for the Social Sciences presented: "Contextualised Browsing in a Digital Library's Living Lab." He presented two approaches that contextualize browsing in a digital library. The first approached is based on document similarity and the second utilizes implicit session information (e.g., queries and document metadata from sessions of users). 

Paper session 3A (Day 2)

Three concurrent paper sessions followed Dr. Phillip Mayr's presentation. Dr. Dominika Tkaczyk, a researcher and a data scientist at the Applied Data Analysis Lab at the University of Warsaw (Poland) presented: "Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers," in which she presented the results of the comparison of different methods for parsing scholarly article references.
Anne Lauscher, a PhD student at the University of Mannheim presented: "Linked Open Citation Database: Enabling Libraries to Contribute to an Open and Interconnected Citation Graph." She presented the current state of the workflow and implementation of the Linked Open Citation Database project, which is a distributed infrastructure based on linked data technology for efficiently cataloging citations in libraries.

Paper session 3C (Day 2)

Norman Meuschke, a PhD student at the University of Konstanz, presented: "An Adaptive Image-based Plagiarism Detection Approach," in which he discussed his analysis of images in academic documents to detect disguised forms of plagiarism with approaches such as perceptual hashing, ratio hashing and position-aware OCR text matching. 

Hisham Benotman presented his work: "Extending Multiple Diagram Navigation with Internal Diagram And Collection Connections." He discussed his work about extending Multiple diagram navigation (MDN) such that diagram-to-content queries reach related collection documents not directly connected to the diagrams.
Other papers presented in a parallel session (3B) include:
Minute madness followed the paper sessions. The minute madness was an activity in which poster presenters were given 1 minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.

Day 3 (June 5, 2018)

Day 3 of the conference began with Dr. Niall Gaffney's keynote. Dr. Niall Gaffney is an Astronomer and Director of Data Intensive Computing at the Texas Advanced Computing Center (TACC). He started by emphasizing the importance of scientific reproducibility before moving on to show some of the projects supported by the computational machinery at TACC such as Firefly.
Two concurrent paper sessions followed a short break.

Paper session 4A (Day 3)

Dr. Gianmaria Silvello, an assistant professor at the Department of Information Engineering of the University of Padua presented a full paper titled: "Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents." He discussed his project about the development of an evaluation framework for validating the conformance of long-term preservation by assessing correctness, usability and usefulness.
Next, Dr. Pavlos Fafalios a researcher at L3S Research Center in Germany presented a full paper titled: "Ranking Archived Documents for Structured Queries on Semantic Layers," in which he proposed two ranking models that rank archived documents and considers the similarity of documents to entities, timeliness of documents, and the temporal relations between the entities.
The final paper presented (not by an author of the paper) in this session was a short paper titled: "Modeling Author Contribution Rate With Blockchain." Three concurrent paper sessions (all full papers) followed after break.

Paper session 4B (Day 3)

Florian Mai, a graduate student at Kiel University in Germany was the first presenter of the paper session on Text Collections. He presented a full paper titled: "Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text," in which he presented the findings from investigating how deep learning models obtained from training on titles compare to deep learning models obtained from training on full-texts.
Next, Chris Holstrom, a PhD student from the Information School at the University of Washington presented a short paper: "Social Tagging: Organic and Retroactive Folksonomies," in which he showed that tags on MetaFilter and AskMetaFilter follow a power law distribution and retroactive taggers do not use "organization" tags like professional indexers.
Next, Jens Willkomm, a PhD student at the Karlsruhe Institute of Technology in Germany, presented a full paper titled: "A Query Algebra for Temporal Text Corpora." He proposed a novel query algebra for accessing and analyzing words in large text corpora.

Paper session 5A (Day 3)

Omar Alonso (best paper nominee) presented a full paper titled: "How it Happened:  Discovering and Archiving the Evolution of a Story Using Social Signals." He introduced a method of showing the evolution of stories from the perspective of social media users as well as the articles that include social media as supporting evidence.
Tobias Backes a researcher at Gesis presented  his paper titled: "Keep it Simple: Effective Unsupervised Author Disambiguation with Relative Frequencies." He addressed the problem of author name homonymy in the Web Science domain by proposing a novel probabilistic similarity measure for author name disambiguation based on feature overlap.
The last paper (best paper nominee) presented in this session was titled: "Digital History meets Microblogging: Analyzing Collective Memories in Twitter."

Paper session 5B (Day 3)

Noah Siegel a researcher at the Allen Institute for Artificial Intelligence presented a full paper titled: "Extracting Scientific Figures with Distantly Supervised Neural Networks," where he introduced a system of extracting figures from large number of scientific documents without human intervention.
Next, Andrรฉ Greiner-Petter presented his full paper titled: "Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context." He presented a new approach for mathematical format conversion that utilizes textual information to reduced error rate. Additionally, he evaluated state-of-the art tools for mathematical conversions and provided a public manually-created gold standard dataset for mathematical format conversion.

Next, Yuta Kobayashi presented a paper titled: "Citation Recommendation Using Distributed Representation of Discourse Facets in Scientific Articles," presenting the effectiveness of using facets of scientific articles such as "objective," "method," and "result" for citation recommendation by learning a multi-vector representation of scientific articles, in which each vector represents a facet in the article.

Paper session 5C (Day 3)

Catherine Marshall, an adjunct professor at Texas A&M University presented: "Biography, Ephemera, and the Future of Social Media Archiving." She presented her finding from answering the following question: "Will the addition of new digital sources such as records repositories, digital libraries, social media, and collections of ephemera change biographical research practices?" She demonstrated how new digital resources unravel a subject's social network, thus exposing  biographical information formerly invisible.
Next, I presented our full paper titled: "Scraping SERPs for Archival Seeds: It Matters When You Start" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson. In my presentation, first, I highlighted the importance of web archive collections for studying important historical events ranging from elections to disease outbreaks. Next, I showed that search engines (specifically Google) can be used to generate seeds. Finally, I showed that it becomes harder to find the older URLs of news stories over time, so seed generators that utilize search engines should begin early and persist to capture the evolution of an event.

Next, Mat Kelly (best paper nominee), a fellow PhD student at Old Dominion University and member of WSDL presented his full paper titled: "A Framework for Aggregating Private and Public Web Archives." He showed his framework that provides a means of combining public web archive captures and private web captures (e.g., banking and social media information) without compromising sensitive information included in the private captures. This work utilizes Sawood Alams's Memgator, a Memento aggregator that supports multiple serialization formats such as Link, JSON, and CDXJ.

Paper session 6A (Day 3)

The last paper session on Topic Modeling and Detection consisted of three full papers. First, Julian Risch (best paper nominee), a PhD student at Hasso-Plattner Institute (Germany) presented: "My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections." He presented a topic model combined with automatic domain term extraction and phrase segmentation that distinguishes collection-specific and collection-independent words based on information entropy.
Next, Dr. Ralf Krestel, the head of Web Science Research Group & Senior Researcher at Hasso-Plattner Institute (Germany) presented his full paper titled: "WELDA: Enhancing Topic Models by Incorporating Local Word Context." He proposed a new topic model called WELDA that combines word embeddings (WE) and Latent Dirichlet Allocation (LDA).
Finally, Angelo Salatino, a PhD student at the Knowledge Media Institute (UK) presented a full paper titled: "AUGUR: Forecasting the Emergence of New Research Topics." He introduced AUGUR, which is a new approach for the early detection of research topics in order to help stakeholders such as universities, institutional funding bodies, academic publishers and companies recognize new research trends.

A dinner at the Fort Worth Museum of Science and History followed after a break. The best poster award was presented to Mohamed Aturban, a fellow PhD student at Old Dominion University and member of WSDL for this poster "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation."
Dr. Federico Nanni  (Providing Fine-Grained Semantics of Entities in Context) and Myriam Traub (Impact of Crowdsourcing OCR Improvements on Retrievability Bias) tied for the Vannevar Bush best paper awards. Myriam Traub also won the best student paper award.

Day 4 (June 6, 2018)

Day 4 began with a keynote from Dr. Carly Strasser, director of Strategic Development for the Collaborative Knowledge Foundation. Her keynote "Open Source Tech for Scholarly Communication: Why It Matters," illustrated the problems in the submission, production and delivery of scholarly communication. She talked about the problem of the disjoint nature (silos) of the various stages of scholarly communication, as well as the expensive delivery, slow production, static and less interoperable output.

She also presented a vision of scholarly communication that consists of living documents that link to open source code and data, a cheaper delivery system, faster production and more interoperable and dynamic output. Additionally, she talked about the organizations working to achieve various aspects of this vision.
The main conference gave way to workshops and a preview of JCDL 2019 which is scheduled to take place at the School of Information Sciences at the University of Illinois, Urbana-Champaign from June 2-6, 2019.
I would like to thank the organizers of the conference, the hosts, University of North Texas (UNT) College of Information and UNT Health Science Center, as well as SIGIR for the travel grants. Here are other trip reports including the Doctoral Consortium (from Shawn Jones), a preview of WADL (Web Archiving and Digital Libraries) workshop from Jasmine Mulliken, Digital Production Associate at Stanford University PressMat Kelly's (WADL) trip report, and Corren McCoy's Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report. Dr. Min-Yen Kan set up a repository for all the slides from JCDL 2018; please upload your slides if you have not already done so.

-- Nwala (@acnwala)

