2020-09-09: Theory and Practice of Digital Libraries 2020 (TPDL 2020) Non-Trip Report

The 2020 Theory and Practice of Digital Libraries (TPDL 2020) was planned to take place in Lyon, France, but was virtually hosted via Big Blue Button. It was a joint conference with ADBIS 2020 and EDA 2020. TPDL was a fascinating look into the various projects and research efforts undertaken by members of the digital library community. Due to the time zone differences and technical issues, I could not attend all sessions. As usual, I will summarize some of those I have attended here.

On the Persistence of Persistent Identifiers of the Scholarly Web

Martin Klein and Lyudmila Balakireva, my teammates from the Los Alamos Laboratory Research Library Prototyping Team, won the best paper award at TPDL 2020 for "On the Persistence of Persistent Identifiers of the Scholarly Web." 

In their paper, Martin and Luda investigated how consistently scholarly publishers respond to common HTTP requests against Digital Object Identifiers (DOIs) that identify scholarly artifacts on the web. They analyzed the length of the DOI redirect chain and the HTTP response code of the redirect chain's last link for different HTTP request methods and HTTP clients. They found significant differences in responses to HTTP clients and methods that closely resemble machines "browsing" the web versus HTTP requests more closely resembling human browsing behavior. Less than 50% of DOIs returned the same response code across all requests. Overall, requests sent with the popular web browser Chrome (as the method most closely resembling a human browsing) returned the most successful HTTP responses. Some of the odd behavior they noticed was that, for example, the simple HTTP HEAD method resulted in a sizable number of "404 - Not Found" responses, but 25% of those DOIs returned a "200 - OK" response when any other HTTP request was sent. Martin and Luda further investigated differences when sending requests from different network environments (with and without commercial publisher subscription levels) and when sending requests against DOIs that identify Open Access vs. non-Open Access content. Please read the paper for more details on the study and the implications of its findings. Martin and Luda raise questions about trust in the persistence of these widely used persistent identifiers, given the noticeable inconsistencies in responses to simple HTTP requests.


Due to time zone differences and technical issues, I was only able to attend two keynotes.

Ioana Manolescu from the National Institute for Research in Digital Science and Technology shared "Integrating (very) heterogeneous data sources: a structured and an unstructured perspective." She noted that there has been "unprecedented data generation rates by human, software and physical objects." This explosion of data has produced many heterogeneous data sources that do not match on schema, keys, or even interpretations. However, the integration of these datasets produces new knowledge that is invaluable to research and public discourse. She provided overviews of two projects that try to help with data integration. She mentioned that there are many data management systems, and some are excellent for specific tasks but poor for other tasks. ESTOCADA tries to leverage these disparate systems' strengths and produce views into the data across different data stores. She mentioned how ESTOCADA was applied to the MIMIC-III dataset to provide views that join relational, JSON, and SOLR data to answer medical questions. ConnectionLens integrates different documents together for journalism. Manolescu covered examples applying ConnectionLens to unearth connections within the different datasets surrounding the Panama Papers. ConnectionLens converts each dataset into a graph connected by the named entities discovered in the data. With the Decodex project, users can leverage this information to determine if a web page is reliable, satirical, has published fake news, or requires a re-check. She closed by noting that there is still much work to be done in this area.

Verónika Peralta from the Université de Tours shared "From source data to data narratives: accompanying users in the way to interactive data analysis." Peralta noted that data narration is narrating with data visualization, incorporating analysis, synthesis, and visualization together to tell a story. Her talk (available on YouTube) contains many, many references for narrative discourse, data analysis, visualization, and more. She covered a four-layer model for building a data narrative. The factual layer is where we collect and analyze data. With the intentional layer, we create our message based on our findings. The structural layer organizes these messages into units that can be discussed. The presentational layer takes these messages and visualizes them to tell our narrative. Peralta broke each of these layers into individual tasks demonstrating the work necessary to bring a data narrative to life. She also covered how the OLAP III project is further developing these ideas to provide insights into data, trying to address issues of which queries and models to execute against data, and what highlights to select for an interesting story. She provided many different references about measures of interestingness before closing with the many open challenges surrounding building these narratives.

Selected Papers

Brenda Reyes from the University of Alberta presented "Correspondence as the Primary Measure of Quality for Web Archives: A Grounded Theory Study" (slides). Reyes analyzed issues reported against Archive-It to build a Theory of Quality built upon an evaluation of correspondence (similarity between live and memento), relevance (on-topic content compared to original), and archivability (difficulty with preservation, similar to memento-damage). She found that Archive-It users have issues evaluating the relevance of content and are concerned as much about the "overabundance of content" as they are about coverage of their collection topic.

Andrea Mannocci from CNR-ISTI presented "Context-driven Discoverability of Research Data," where he noted how research data is often considered "ancillary material" and, as such, has no common practices, incentives, or mandates for its care and use. This presents issues with discoverability and reusability. To aid in discoverability, Mannocci proposes applying semantic relations connecting datasets to the documents whose access is already facilitated by existing discovery tools. He closed with a demo of his solution.

Rasa Bočytė from the ReTV Project presented "Online News Monitoring for Enhanced Reuse of Audiovisual Archives," where the authors reused news archives to gain insights. Their project analyzes cross-platform and multilingual data and produces a variety of visualizations of news across different topics and localities. Their visual dashboard produces graphs, but unlike StoryGraph, these graphs connect concepts rather than news sources. It also produces flow diagrams, bubble charts, and word clouds for topic analysis as well as steamgraphs demonstrating changes in news over time.


TPDL 2021 will likely occur in concert with ADBIS 2021 at the University of Tartu in Estonia. We covered a lot of good work in 2020. It is a shame that I did not get to view it all. I look forward to next year.

--Shawn M. Jones