2022-10-11: Theory and Practice of Digital Libraries (TPDL) 2022 Trip Report
Prato della Valle in Padua, Italy
This year, the 26th International Conference on Theory and Practice of Digital Libraries (#TPDL2022) returned to an in-person, hybrid format after taking place solely online for the last two years. TPDL was held in beautiful Padova, Italy at the Istituto Sant’Antonio Dottore from September 20-23, 2022. Emily Escamilla and Himarsha Jayanetti from the Web Science and Digital Libraries (WSDL) research group attended the conference in-person and presented four papers at TPDL 2022. There were four presentations from WSDL and one from a recent WSDL alumnus:
- Robots Still Outnumber Humans in Web Archives, But Less Than Before (Himarsha Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, and Michele C. Weigle
- A Chromium-based Memento-aware Browser (Abigail Mabe, Michael L. Nelson, and Michele C. Weigle) presented by Emily Escamilla
- CDX Summary: Web Archival Collection Insights (Sawood Alam and Mark Graham) (WSDL alumni)
- Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists (Himarsha Jayanetti, Shawn Jones, Martin Klein, Alex Osborne, Paul Koerbin, Michael L. Nelson, and Michele C. Weigle)
- The Rise of GitHub in Scholarly Publications (Emily Escamilla, Martin Klein, Talya Cooper, Vicky Rampin, Michele C. Weigle, and Michael L. Nelson)
The full proceedings have been published in the Lecture Notes in Computer Science (LNCS).
The proceedings of #TPDL2022 are online with @SpringerNature: https://t.co/TsuVgyJbTt
— TPDL2022 (@tpdl2022) September 16, 2022
Day 1: 2022-09-20
Linked Archives Workshop
Linked Archives workshop (Presentation 2)“Mining typewritten digital representations to support archival description” by Mariana Dias and Carla Teixeira Lopes.#tpdl2022 @WebSciDL @tpdl2022 pic.twitter.com/bVxFxXC4c5— Himarsha R. Jayanetti (@HimarshaJ) September 20, 2022
For the keynote, Kerstin Arnold (@kerstarno) from Archives Portal Europe presented “No Archive is an Island - A Tale of Exploring a Brave New World”. Archives Portal Europe is a project that allows for aggregation and discovery across over 600,000 collections from over 7,100 institutions. In her keynote, Arnold explained the current International Standard Archival Description (General) (ISA(G)) standards, the standards being implemented by Archives Portal Europe, and the lessons they have learning along the way.
Artificial Intelligence and Archives
Luís Filipe da Costa Cunha from the Department of Informatics at University of Minho presented “Fine-Tuning BERT models to extract Named Entities from Archival Finding Aids”. Their work is an improvement on the NER model specifically for the Portuguese language. They created an API, a web platform, and an automatic annotator.
Manual document mining of born-physical cultural heritage objects to create metadata is time-consuming. Mariana Dias from the University of Porto presented “Mining Typewritten Digital Representations to Support Archival Description”, part of their Entity and Property Inference for Semantic Archives (EPISA) Project. They proposed an architecture that combines optical character recognition (OCR), information extraction, and ontology population to conduct document mining for automatic metadata records.
Infrastructures for Archives and Linked Data
@CamtheWicked is presenting her work “Detecting content drift on the Web using Web archives and textual similarity at @tpdl2022 #tpdl2022 pic.twitter.com/PkTa4x1n2C
— Emily Escamilla (@EmilyEscamilla_) September 20, 2022
Can Web page titles be used to detect content drift? Brenda Reyes Ayala (@CamtheWicked) from the University of Alberta presented “Detecting content drift on the Web using Web archives and textual similarity”. She proposed leveraging Web page titles to detect content drift and found 92.1% recall across three collections. Additionally, the run time was short compared to other methods that have been used to detect content drift. Her work was inspired by “Scholarly context adrift: Three out of four URI references lead to changed content” by Shawn M. Jones (@shawnmjones), Herbert Van de Sompel (@hvdsompel), Harihar Shankar (@hariharshankar), Martin Klein (@mart1nkle1n), Richard Tobin, and Claire Grover.
Sérgio Nunes from the University of Porto presented “EPISA Platform: A Technical Infrastructure to Support Linked Data in Archival Management”, another portion of the EPISA project mentioned above. Part of their presentation includes a demonstration of the EPISA ArchClient which they created to provide archivists with a graphical user interface to access, manage, and describe collections.
Models for Linked Archives
Architectural artifacts cannot be easily accessed or searched with traditional finding aids. Daria Mikhaylova from the University of Pisa presented “An extension of RiC-O for architectural archives”. Their solution models a architectural project including its different phases, types of records, and architectural artifacts. The extension also provides a structured and formal representation of the archive that is compatible with existing standards.
Alex Green and Faith Lawrence from The National Archives of the UK presented “The Shock of the New: Testing the Pan-Archival Linked Data Catalogue with Users”. The National Archives of the UK are in the process of creating a canonical Linked Data catalog that is user-focused. However, as with all major changes to crucial work products, the process has been full of discussions between stakeholders (i.e., users, archivists, developers). They talked about the goals they are working to achieve, the challenges they have faced, and the lessons learned.
Doctoral Consortium
Doctoral Consortium (Presentation 1)Sefika Efeoglu, a PhD student of Freie Universitaet Berlin is presenting her preliminary research titled “A Continual Relation Extraction Approach for Knowledge Graph Completeness”.@tpdl2022 @WebSciDL pic.twitter.com/MKRVo1wyzs— Himarsha R. Jayanetti (@HimarshaJ) September 20, 2022
Three doctoral students presented their doctoral research work to senior researchers and other participants in an informal setting. First, Sluefika Efeog, a Ph.D. student of Corporate Semantic Web Group of Freie Universitaet Berlin presented her preliminary research titled “A Continual Relation Extraction Approach for Knowledge Graph Completeness”. Next, Nicolò Pratelli, a Ph.D. student at the University of Pisa presented his work titled “A Geographical Extension for NOnt Ontology”. Finally, Nikos Vasilogamvrakis from National Documentation Centre in Greece presented his work titled “The Ontological Approach of Modern Greek Morphology” via Zoom to the audience. It was a great opportunity for doctoral students to receive constructive feedback on their preliminary research work.
Day 2: 2022-09-21
“Source code provides a view into the mind of the designer”makes me wonder what the foobar is going on in coders' brains sometime.https://t.co/rzIRR1Z4O6@rdicosmo @tpdl2022 #tpdl2022— Andrea Mannocci (@andremanno) September 21, 2022
Here in beautiful Padua for #TPDL2022 - fascinating keynote from @rdicosmo on why preserving software is a vital task. pic.twitter.com/bsgr2pEAAK
— David Pride (@davejavupride) September 21, 2022
“Programs must be written for people to read, and only incidentally for machines to execute.” -- Harold Abelson@rdicosmo @tpdl2022 #tpdl2022— Andrea Mannocci (@andremanno) September 21, 2022
"If you want an archive, you need an archive. You cannot expect something that is not an archive to be an archive"
— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Session 1: Web Archiving
I am excited and honored to present our work titled “Robots Still Outnumber Humans in Web Archives, But Less Than Before” with @kritika_garg @ibnesayeed @phonedude_mln @weiglemc at @TPDL2022 Session 1: Web Archiving. @WebSciDLConf. Proceedings: https://t.co/JKJl9DP3U1— Himarsha R. Jayanetti (@HimarshaJ) September 21, 2022
How do users and robots access the archives? Himarsha Jayanetti (@HimarshaJ) from Old Dominion University’s Web Science and Digital Libraries (WSDL) research group presented “Robots Still Outnumber Humans in Web Archives, But Less Than Before” where they analyzed the differences between robots and human usage patterns and their temporal preferences. They found that the total bots detected in the Internet Archive 2019 sample (70%) are less than the 2012 sample (91%) and robots accounted for 98% of all requests to arquivo.pt in 2019.
.@EmilyEscamilla_ is doing a great job presenting @abigail_mabe’s work on behalf of her. The paper is titled “A Chromium-Based Memento-Aware Web Browser”.@phonedude_mln @weiglemc #tpdl2022 @tpdl2022 @WebSciDLConf. Proceedings: https://t.co/iZ2GWP8smk pic.twitter.com/3sXigMYUCC— Himarsha R. Jayanetti (@HimarshaJ) September 21, 2022
Emily Escamilla (@EmilyEscamilla_) presented “A Chromium-based Memento-aware Web Browser” on behalf of Abigail Mabe (@abigail_mabe) from Old Dominion University’s Web Science and Digital Libraries (WSDL) research group. Abigail used Chromium to create a prototype of a Memento-aware browser. The browser was able to detect the presence of Mementos in the open tab. She also enhanced the bookmark functionality by allowing users to archive web pages within the bookmarking process. The paper presented at TPDL 2022 was a shortened version of her Master’s Degree project “A Chromium-based Memento-aware Web Browser”.
is presenting his work on CDX Summary, a tool that generates statistical reports from the data available in the CDX indexes of WARC filesThe paper can be found in the proceedings: https://t.co/KCmW6vXogY#TPDL2022 pic.twitter.com/v6giUhU0kB— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Slides: https://t.co/O6EWKMt0VnRecording: https://t.co/eu8dsanAXyRepo: https://t.co/A6Q50xc9am— Sawood Alam (@ibnesayeed) September 21, 2022
Sawood Alam (@ibnesayeed) from Internet Archive (and an ODU WSDL research group alum) presented “CDX Summary: Web Archival Collection Insights”. CDX Summary is a tool that generates machine and human readable reports based on metadata like URLs, hosts, query parameters, status codes, and more.
What is the quality of the archived Web page?Theresa Elstner from @webis_de is presenting her work to automate visual comparison for quality assessment of archived Web pages.— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Theresa Elstner from the Webis group (@webis_de) at Leipzig University presented “Visual Web Archive Quality Assessment”. She categorized the perceivable reproduction error types: existence error, positional error, and positional error. She also created and tested a system of visually aligning page segments to more accurately measure pixel difference and, as a result, the quality of an archived Web page.
I am excited to present our work titled “Creating Structure in Web Archives with Collections: Different Concepts from Web Archivists” with @shawnmjones, @mart1nkle1n, Alex Osbourne, Paul Koerbin, @phonedude_mln, @weiglemc. @WebSciDLConf. Proceedings: https://t.co/veNg2yXy4H— Himarsha R. Jayanetti (@HimarshaJ) September 21, 2022
Himarsha Jayanetti wrapped up the session with her presentation “Creating Structure in Web Archives with Collections: Different Concepts From Web Archivists”. She presented how eight Web archive platforms utilize collections as well as the different types of “collection structures” followed by them. She emphasized, for instance, how some web archive collections support private collections while others do not, and how some collections have sub-collections while others do not. She also identified two main types of navigational hierarchies followed by those web archive platforms. The first type is where the original resource (URI-R) supports the collection’s theme and the second type is where the memento (URI-M) supports the collection’s theme. A much more detailed technical report is available in arXiv.
Booster Session 1
After the lunch break, the conference conducted a booster session, which included quick 3-minute presentations for each of the papers accepted for the “Accelerating Innovation Papers” track. During the poster session at the end of the day, each of these papers had a poster that was exhibited. During this session, eight presenters shared their new ideas/late-breaking results.
#TPDL2022 Booster Session 1:Paper 1“Automatic Knowledge Extraction from a Digital Library and Collaborative Validation”By Eleonora Bernasconi, Miguel Ceriani, Massimo Mecella and Alberto Morvillo— Himarsha R. Jayanetti (@HimarshaJ) September 21, 2022
Session 2: Cultural Heritage
To kick off Session 2 @tpdl2022, Agathi Papanoti presented the work done by https://t.co/kBYGqjm6iA to aggregate key figures in Greek history and cultureThey created mappings between key figures and bibliographic information to enrich recordshttps://t.co/AtrMf2r3hx#TPDL2022 pic.twitter.com/6BSYytvTZI— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Session 2 on “Cultural Heritage” began with a presentation by Agathi Papanoti from National Documentation Centre in Greece on their work titled “Enriching the Greek National Cultural Aggregator with Key Figures in Greek History and Culture: Challenges, Methodology, Tools and Outputs”. She discussed their approach, challenges, and the technologies that had been employed over the past two years for the process of enhancing the metadata of Cultural Heritage Objects (CHOs) that had been collected by the Greek cross-domain Cultural Data Aggregator, SearchCulture.gr.
The second presenter was Hille Ruotsalainen from Tampere University who presented her paper titled “Searching Wartime Photograph Archive for Serious Leisure Purposes”. She described evaluating user success scores and user engagement using the user engagement scale (UES) in a recorded presentation that was played during the session. She also suggested research implications based on the results.
This visualization maps keywords, when the word appeared in a letter, and who received a letter containing that word pic.twitter.com/Ry4puQFVIW
— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Pierre Cubaud from Le CNAM presented his paper titled “Overview visualizations for large digitized correspondence collections: a design study”. He introduced “overview visualization” as a useful alternative to search engines in digital libraries. The tool was built in the context of a large correspondence collection which is 20K letters from the Godin-Moret archive. They also created a video mockup of the system.
@harryhalpin presented "The Knowledge Trust: A Proposal for a Blockchain Consortium for Digital Archives"The project proposes integrating blockchain with LOCKSS protocol with the goal of ensuring integrity in the archiveshttps://t.co/WkM7iBCBoA#TPDL2022 pic.twitter.com/ghV7SLbWJH— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Harry Halpin from the American University of Beirut presented his work titled “The Knowledge Trust: A Proposal for a Blockchain Consortium for Digital Archives”. He presented their model “The Knowledge Trust”, in which current digital libraries may use blockchain technology to leverage on the advantages of their own curation skills and do integrity checks that can assist in identifying data loss in digital archives.
Daniel Zilio presented "Design and evaluation of a mobile application for an Italian UNESCO site"They have worked to develop an app to publicize the Padua UNESCO site and provide tourists with a tool to engage with the site in a different wayhttps://t.co/4gvON5G9Ae#TPDL2022 pic.twitter.com/wG6gc2xPPt— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
Daniel Zilio from University of Padova presented “Design and evaluation of a mobile application for an Italian UNESCO site: Padova Urbs picta”. In this study, he described how a smartphone application that would promote Padua’s fourteenth-century fresco cycle (which has been registered in the UNESCO World Heritage List) was designed and evaluated.
Session 3: Scholarly Communication I
Asheesh Kumar presented their project to predict whether a paper would be accepted or rejected based on reviewer comments.This would assist conference chairs in what is otherwise a time consuming task to create a recommendation and ratinghttps://t.co/9XaAcM3ndg#TPDL2022 pic.twitter.com/Dd6w7JK6U5— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
As the first presentation of session 3, Asheesh Kumar from the Indian Institute of Technology Patna presented his work titled “Investigations on Meta Review Generation From Peer Review Texts Leveraging Relevant Sub-tasks in the Peer Review Pipeline”. He presented their novel method to automatically generate decision-aware meta-reviews that additionally take into account a number of pertinent sub-tasks in the peer-review process.
Are you a researcher with a common name like @phonedude_mln? @ZBoukhers presented Whois?, a model to disambiguate author names based on bibliographic data. https://t.co/6gLOFs9r8s#TPDL2022 pic.twitter.com/p06ij1bfWE
— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
The second presentation was on “Whois? Deep Author Name Disambiguation Using Bibliographic Data” presented by Zeyd Boukhers from University of Koblenz-Landau. By utilizing the co-authors and research domain, this study suggests an Author Name Disambiguation (AND) approach that links author names to their real-world entities. They developed a neural network model that learned from the representations of the co-authors and titles.
Tove Faber Frandsen presented their study tracking the distribution of institutions as contributors to journals.They classified institutions as continuants, movers, newcomers, and transient and looked at the portion of each in each journalhttps://t.co/82hOyTABCF#TPDL2022— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
The next presentation was by Tove Faber Frandsen from the University of Southern Denmark on their work titled “Exploring research fields through institutional contributions to academic journals”. She discussed how they looked at different institutions in terms of who contributed to journals, how they determined whether institutional contributions to Library and Information Science journals have remained consistent over time and whether there are variations among different journals. They found out that for some journals, only around 10% of the contributing institutions are continuants meaning that the institution published in a given year and also published at least one paper within the previous three years or in the three years to come to the same journal.
Rand Alchokr presented "A Closer Look into Collaborative Publishing at Software-Engineering Conferences"RQ1: Are collaboration patterns in software engineering stable over time?RQ2: What are the frequent collaborator patterns for SE?https://t.co/wQ7BV874Er#TPDL2022 pic.twitter.com/GSjpWWawyL— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
The fourth presentation of session 3 was by Rand Alchokr from Otto von Guericke University Magdeburg. She presented their work titled “A Closer Look into Collaborative Publishing at Software-Engineering Conferences” where they studied two properties of research collaborations in software engineering: the number of authors and their research experience. They discovered that collaborative research (multi-author) is increasingly common today with a decline in the percentage of single-author papers. Their research revealed that two to four researchers was the most common team size and that in order to publish at prestigious conferences, junior researchers seemed to require the support of experienced co-authors.
They also found that the accuracy of their model was high. However, missing altmetrics negatively effected precision and the F1-score pic.twitter.com/iB6KzCmgD1
— Emily Escamilla (@EmilyEscamilla_) September 21, 2022
The final presentation of the “Scholarly Communication I” session was by Yusra Shakeel from Karlsruhe Institute of Technology on “Weighted Altmetric Scores to Facilitate Literature”. She discussed their most recent work, which proposes weighted altmetric scores for a more reliable and precise analysis of papers to support the labor-intensive manual literature analysis procedure. Overall, their method performed well with positive results, but further research would help validate the potential of weighted metrics.
Poster Session 1
At the end of the day, the presenters from the booster session had the opportunity to present their posters to the conference participants during the poster session, which was held in the venue's lobby.
Vibrant discussions during the first poster session at #tpdl2022 pic.twitter.com/TFmjl7jcYF
— TPDL2022 (@tpdl2022) September 21, 2022
Social Dinner
On Tuesday evening, TPDL hosted a social dinner at Caffè Pedrocchi, a historic café that opened in 1831.
#tpdl2022 social dinner is kicking off in historical caffè Pedrocchi. #padova pic.twitter.com/HJEKM675ng
— TPDL2022 (@tpdl2022) September 21, 2022
While at the dinner, the program committee presented the Best Paper Award. Three papers were nominated:
Robots Still Outnumber Humans in Web Archives, But Less Than Before by Himarsha Jayanetti, Kritika Garg, Sawood Alam, Michael L. Nelson, and Michele C. Weigle
Whois? Deep Author Name Disambiguation using Bibliographic Data by Zeyd Boukhers and Nagaraj Bahubali Asundi
DETEXA: Declarative Text Analysis Through SQL by Yannis Foufoulas, Eleni Zacharia, Harry Dimitropoulos, Natalia Manola, and Yannis Ioannidis
The following three papers have been nominated for best paper at #tpdl2022Tonight during the social dinner we will announce the winner!(A thread... 1/4)— TPDL2022 (@tpdl2022) September 21, 2022
We were excited to receive the Best Student Paper Award for "Robots Still Outnumber Humans in Web Archives, But Less Than Before", which was accepted by Himarsha Jayanetti. The Best Paper Award was presented to Zeyd Boukhours for "Whois? Deep Author Name Disambiguation using Bibliographic Data." Congratulation to all of the award-winning authors!
Our paper titled “Robots Still Outnumber Humans in Web Archives, But Less Than Before” won the best student paper award at @TPDL2022 😍Thank you @TPDL2022 for this acknowledgment & recognition of our work! @kritika_garg @ibnesayeed @phonedude_mln @weiglemc@WebSciDL for life! pic.twitter.com/lRKDtiENzI— Himarsha R. Jayanetti (@HimarshaJ) September 21, 2022
Congratulations to @HimarshaJ and @ZBoukhers for the best student paper and the best paper award! @ocorcho and @paolomanghi are doing a great job announcing the awards! pic.twitter.com/ETUnO5DglG
— TPDL2022 (@tpdl2022) September 21, 2022
.@paolomanghi and @ocorcho announcing the best student paper and the best paper awards during a stunning #tpdl2022 social dinner @tpdl2022 pic.twitter.com/xZiSPX1HrR
— Nicola Ferro (@frrncl) September 21, 2022
Day 3: 2022-09-22
The second keynote speaker of the conference was Georgia Koutrika from Athena Research Center in Greece. Her talk was titled “Democratizing Data Access: What if we could just talk to our data?” where she talked about making data easily accessible and useful to humans. She began her talk by mentioning how important data is in recent times and benefits in exploring data has become increasingly more prominent. As the presentation's title indicates, the main focus of the talk was on how a human user may engage with the data using a system (referred to as an intelligent data assistant) in a natural way. Real world data is not readily available and requires complex SQL commands that require expertise and an understanding of the data schemas, instead imagine a system that enables users to interact and collaborate with the system as if it were a human in order to explore data and find solutions. She also discussed the challenges of developing these intelligent data assistants, such as the fact that some words can have multiple meanings (for example, "movies" and "films") and that the same idea can be expressed in various ways (for example, "how many people live in" can refer to the "population" column in a dataset). These challenges show how translating a natural language query to a structured query that the machine understands is hard. She also talked about a conversational system where the data assistant could talk back and ask questions for clarification (SQL2NL) and also explain the results (QR2T). Through this keynote, she emphasized the significance of how much data we can explore, not just how much data is present.
The 2nd Day @tpdl2022 main conf.Keynote titled “Democratizing Data Access: What if we could just talk to our data?” by Georgia Koutrika (@gkoutrika) is happening now. @WebSciDL #tpdl2022 pic.twitter.com/e32Iq4PrBs— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
.@gkoutrika keynote at #tpdl2022 on democratizing access to data and how to allow users to explore data more easily pic.twitter.com/uIAz3KU45S
— Nicola Ferro (@frrncl) September 22, 2022
@gkoutrika opened Day 3 of @tpdl2022 with her keynote “Democratizing Data Access: What if we could just talk to our data?” pic.twitter.com/zmRATrwzWE
— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Session 4: FAIR and Open Data
Leon Martin is now presenting their work titled “RDFtex: Knowledge Exchange between LaTeX-based Research Publications and Scientific Knowledge Graphs”— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
Leon Martin kicked off Session 4 @tpdl2022 by presenting their creation of RDFTex, which allows users to import and export contributions to SciKGs to be referenced by other researchers.Code: https://t.co/Qa470Ux40u— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Leon Martin from the University of Bamberg presented “RDFtex: Knowledge Exchange between LaTeX-based Research Publications and Scientific Knowledge”. He presented a RDFtex, a tool that allows the import and export of contributions from and to Scientific Knowledge Graphs (SciKGs). RDFtex can be integrated into automated workflows and implemented with only four additional LaTeX commands.
Session 4 @tpdl2022Presentation 2: “Analysing User Involvement in Open Government Data Initiatives”By Dagoberto Jose Herrera-Murillo, Abdul Aziz, Javier Nogueras-Iso and Francisco Javier Lopez-Pellicer— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
Dagoberto Jose Herrera-Murillo and Abdul Aziz from ODECO (@ODECO_etn) and IAAA Lab (@IAAA_Lab) presented “Analyzing User Involvement in Open Government Data Initiatives”. They are working to shift from the traditional supplier-driven data catalogs that are currently used by Open Data Initiatives (ODIs) to a user-driven solution. They also explored user interactions with ODI portals in the EU. Ideally, their findings would influence portals to modify their approaches to be more user-driven.
Ian Bigelow presented "Conducting the Opera: The Evolution of the RDA Work to Share-VDE Opus and BIBFRAME Hub"https://t.co/E5hLXmFWWd#TPDL2022 pic.twitter.com/eYPA4XDE26
— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Ian Bigelow from the University of Alberta Library presented “Conducting the Opera: The Evolution of the RDA Work to the Share-VDE Opus and BIBFRAME Hub”. His presentation focused on the developmental trajectory that has led to the current status of the RDA and BIBFRAME models.
How can users better access Oral History Interviews?Maria Vrachliotou presented "Ontology-based metadata integration for Oral History Interview". They have evaluated EBUCore and CIDOC CRM to be implemented on an existing OH collectionhttps://t.co/xr4WA5gvxW#TPDL2022 pic.twitter.com/orJk4ixI66— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Oral history collections contain a large variety of contents and metadata; however, users do not want to go through long interviews to find information. Maria Vrachliotou from the Department of Archives at Ionian University presented “Ontology-based metadata integration for Oral History Interviews”. She presented a solution that indexes and segments interviews as well as creates a model for semantic representation of interviews using ontologies. The next step in the project is to test the model on existing oral history collections.
FAIR principles are important but can be perceived as a burden to researchers.To wrap up Session 4, Lyudamila Balakireva presented "Making FAIR Practices Accessible and Attractive", their efforts to create a FAIR-ready data mgmt frameworkhttps://t.co/cGZILwaOEN#TPDL2022 pic.twitter.com/CLZmRO38zR— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
How can organizations make FAIR principles, typically viewed as cumbersome, accessible and attractive to researchers to encourage adoption? Lyudmila Balakireva from Los Alamos National Laboratory presented “Making FAIR Practices Accessible and Attractive”. They created a FAIR-ready data management framework that makes it easier to require FAIR principles through automation.
Booster Session 2
After the lunch break was the second booster session of the conference with eight presentations of the papers accepted for the “Accelerating Innovation Papers” track. Similar to the first day of the conference, each of these papers had a poster that was exhibited at the end of the day poster session.
#TPDL2022 Booster Session 2:Paper 1“Solutions for data sharing and storage: a comparative analysis of data repositories”By Joana Rodrigues and Carla Teixeira Lopes— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
Session 5: Scholarly Communication II
Session 5 @tpdl2022: “Scholarly communication” just started!Presentation 1: “The way we cite: common metadata used across disciplines for defining bibliographic references”By Erika Alves dos Santos et al.— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
@essepuntato presenting "The way we cite: common metadata used across disciplines for defining bibliographic references" @tpdl2022 #tpdl2022 pic.twitter.com/fmTQDEs3hj
— Andrea Mannocci (@andremanno) September 22, 2022
Citations are standardized in theory, but in practice there are a wide variety of "standards" across journals and disciplines.To start Session 5 @tpdl2022 , @essepuntato presented their work "The Way We Cite", analyzing the metadata used in bibliographic references#TPDL2022 pic.twitter.com/WZscbKfopI— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
The first presentation of session 5 was by Silvio Peroni, an associate professor at the University of Bologna in Italy. He presented their work titled “The way we cite: common metadata used across disciplines for defining bibliographic references” (slides). He discussed how they looked into various citation techniques that were used in articles to reference various types of entities. Despite the fact that citations are standardized, numerous "standards" exist amongst journals and disciplines in practice. They examined around 34k bibliographic references extracted from a vast set of journal articles on 27 different subject areas which enabled them to highlight the most used metadata for defining bibliographic references across the subject areas.
Scholarly publications include links to code repos. But how often?Today I am presenting our work studying the "Rise of GitHub in Scholalry Publications" @TPDLProceedings: https://t.co/Y1VaWyXO5IarXiv: https://t.co/nDjZTBPPVqSlides: https://t.co/rCZpswU6dl@WebSciDL— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Next, Emily Escamilla (@EmilyEscamilla_) from Old Dominion University’s Web Science and Digital Libraries research group presented her work titled “The Rise of GitHub in Scholarly Publications” (slides) where she discussed how GitHub is increasingly being referenced in scholarly publications, highlighting the importance of archiving GitHub repositories for reproducibility. She emphasized that although links to Git Hosting Platforms (GHPs) are becoming more prevalent in scholarly publications, GHPs are not permanent. For instance, they found out that 1 out of every 5 publications in arXiv in 2021 has at least one link to GitHub. She mentioned the need for improved archiving strategies for GHPs to preserve scholarly records as she wrapped up her talk.
@essepuntato also presented "Structured References from PDF Articles: Assessing the Tools for Bibliographic Reference Extraction and Parsing"They found that Anystyle and Cermine were the most successful tools overallhttps://t.co/oNuZOi9VCK#TPDL2022— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Silvio Peroni, the session's first presenter, gave his second presentation titled “Structured references from PDF articles: assessing the tools for bibliographic reference extraction and parsing” (slides). He pointed out in this talk that more literature means more data, which also translates into more metadata, and that publishers (big or small) have to put a lot of effort into successfully extracting the metadata in structured forms. Adopting off-the-shelf bibliographic reference extraction tools that automatically extract references from PDF files is the authors' suggested remedy for the problem. They evaluated different tools that can be used in extracting and parsing bibliographic references of academic papers and found out that Anystyle and Cermine were the tools that worked the best overall.
What kinds of institutions benefit most from OA?@davejavupride presented "Cui Bono? Cumulative Advantage in Open Access Publishing", an analysis of the how a variety of institutions consume and produce OA literature https://t.co/mFcbTZQX5F#TPDL2022 pic.twitter.com/SnYNcMq5cw
— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
The next presentation was by David Pride from The Knowledge Media Institute at The Open University on their work titled “Cui Bono? Cumulative Advantage in Open Access Publishing”. Their study looked at open access (OA) production, OA consumption, and who is benefiting the most from the use of current OA publishing policies. He discussed how they discovered whether there is a correlation between institutional prestige variables and their consumption of OA resources. Based on their data, they discovered that OA production and consumption have a moderate to strong correlation, with a stronger correlation for OA consumption by higher ranked institutions than lower ranked ones. This demonstrated that existing OA efforts are more beneficial to prestigious institutions.
To wrap up the session, @CesareConcordia presented "The SSH Data Citation Service, A Tool to Explore and Collect Citation Metadata"https://t.co/jNINgZq0h2#TPDL2022 pic.twitter.com/M4ydpgF4Cf
— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Cesare Concordia from Institute of Information Science and Technologies (IIST), Italian National Research Council (CNR) wrapped up the session with his presentation “The SSH Data Citation Service, a tool to explore and collect citation metadata”. He presented the SSH Data Citation Service (DCS), a piece of software that offers the ability to locate and assess metadata pertaining to digital objects, particularly datasets, that are referenced in citation strings. The DCS is created in accordance with the traditional client-server architecture: the client, known as Citation Metadata Viewer, displays the metadata and offers actionability functions; the backend handles the discovering and managing of metadata. The interaction protocol between client and server components is implemented through a REST API.
Session 6: Text Analysis and Extraction
Session 6 @tpdl2022: “Text Analysis and extraction” just started!Presentation 1: “Declarative text analysis through SQL”By Yannis Foufoulas, Eleni Zacharia, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
Yannis Foufoulas kicked off Session 6 @tpdl2022 with "DETEXA: Declarative Extensible Text Exploration and Analysis"They found that DETEXA was effective and efficient compared to other approaches and outperformed PySpark in most settings https://t.co/05fgkAs1KO#TPDL2022 pic.twitter.com/P65F50RfQ6— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Text analysis on digital libraries involves diverse data sources and complex processing tasks. Yannis Foufoulas from the National and Kapodistrian University of Athens presented “Declarative text analysis through SQL” (nominated for Best Paper Award). He proposed DETEXA, a library of reusable User-Defined Functions (UDFs) for text analysis built on top of YeSQL. Their approach outperformed PySpark in most settings
@astroblend presented "Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features""TL;DR: We build a great model for extracting figures and captions in scientific lit"https://t.co/9rx2GfyA5q#TPDL2022 pic.twitter.com/CIgClWqhGF— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Jill Naiman from the University of Illinois, Urbana-Champaign presented “Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features”. They built a great model for extraction figures and captions scientific literature. They achieved good precision with low false positive rates and 90.9% F1-scores which is a significant improvement over other state-of-the-art methods.
How do authors indicate funder information in publications? How can that be extracted?Jonas Mielck presented "Extracting Funder Information from Scientific Papers - Experiences with Question Answering"— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Funding acknowledgments do not always appear in articles in a standardized format. How can we automatically identify funder recognition? Jonas Mielck from stackOcean presented “Extracting funder information from scientific papers - experiences with questions answering”. The three main approaches typically used to solve this type of problem are rule-based, regular expression, and language models and machine learning. They decided to use the question-answering approach and achieved a ~0.8 accuracy.
Triet Ho Anh Doan presented "MINE – Workspace as a Service for Text Analysis"They created a search portal and a text analysis workspace for digital humanities scientists to work with datahttps://t.co/GXU8k1dBnp#TPDL2022 pic.twitter.com/cBPvR2q59n— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Triet Ho Anh Doan from GWDG presented “MINE - Workspace as a Service for Text Analysis”. They worked to create MINE, a search portal and a text analysis workspace for digital humanities scientists. The workspace allows users to build their own analysis workflows and run them within the workspace.
Maria Inës Bico wrapped up Session 6 @tpdl2022 with "Early Experiments on Automatic Annotation of Portuguese Medieval Texts"https://t.co/POt8Be0KY7#TPDL2022 pic.twitter.com/swoMW9Rk02
— Emily Escamilla (@EmilyEscamilla_) September 22, 2022
Miria Inês Bico from the University of Lisboa presented “Early Experiments on Automatic Annotation of Portuguese Medieval Texts”. They explained their early efforts to manually annotate a large text, train an automatic annotation model, and test this model through two iterations of experimentation. The results of the second automatic annotation model showed 77.3% precision with a textual variant of the same text and 82.4% precision with a new, unseen text.
Poster Session 2
The presenters from the earlier booster session had the opportunity to display their posters during the second poster session which was held in the conference lobby.
The second poster session @tpdl2022 is happening right now.The authors also presented their work at the booster session (3-mins) earlier today.https://t.co/WS3PwmH8os (thread) pic.twitter.com/K2TygUm88E— Himarsha R. Jayanetti (@HimarshaJ) September 22, 2022
Day 4: 2022-09-23
Session 7: Open Science
Esteban González kicked off Session 6 @tpdl2022 with "FAIROs: Towards FAIR assessment in Research Objects"They analyzed the FAIRness of research objects and their resources using available metadata https://t.co/cz2qKS8CFXhttps://t.co/lHgvaqxV3N#TPDL2022 pic.twitter.com/jcGGKYAKcE— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
The first presentation of session 7 was by Esteban González from the Polytechnic University of Madrid. He presented their work titled “FAIROs: Towards FAIR assessment in Research Objects”. He discussed research objects such as datasets, software, and publications that can be utilized to model the scientific production of research. When publishing their research results, academics are increasingly using the FAIR principles as guidance but the results of scientific research are rarely published separately. He introduced FAIROs, a method for evaluating how well a Research Object adheres to the FAIR principles. They discussed the benefits and drawbacks of various scoring systems and verified FAIROs against 165 Research Objects.
Can we send notification to all Belgian institutional repos about publication <-> research data linkages?@hochstenbach presented "Event Notification in Value-Adding Networks" and their work to implement a service node https://t.co/LybN2jhN8N#tpdl2022 @hvdsomp pic.twitter.com/xsT7ISBs9C
— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Next, Patrick Hochstenbach from the Ghent University in Belgium presented his work titled “Event Notification in Value-Adding Networks” where he discussed the criteria for interoperability when utilizing Linked Data Notifications to exchange real-time life cycle information about web resources referred to as “artifacts" (any research outputs like datasets, software, preprints, and peer-reviewed articles). He also presented a user case where they demonstrated how to leverage a national service node to distribute Scholix data-literature links to a network of institutional repositories in Belgium.
Paula Oset Garcia from @EoscPillar presented "Developing the EOSC-Pillar RDM Training and Support Catalogue"She described the process of defining metadata in an effort to curate training and support materials and make them more findable. https://t.co/aixWhT0jcC#TPDL2022 pic.twitter.com/iWsbzSHlaX— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Paula Oset Garcia from the EOSC-Pillar presented her work titled “Developing the EOSC-Pillar RDM Training and Support Catalogue”. She talked about the proposed web application catalog that includes operational and training resources for research data management (RDM) and other FAIR and open science actors. She also mentioned the challenges we currently experience, such as metadata standards, curation, and quality control.
Provided scholarly registries are authoritative sources, will their contents always be available?@andremanno presented "Knock knock! Who’s there? A study on long-term availability of scholarly publications" https://t.co/n1WJnswBX0#TPDL2022 pic.twitter.com/kIeydBSKuL
— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
The next presentation was by Andrea Mannocci from the Institute of Information Science and Technologies (IIST), Italian National Research Council (CNR) on their work titled “Knock knock! Who’s there? A study on long-term availability of scholarly publications”. He discussed how scholarly repositories are quite dynamic and can be often updated, moved, merged, or discontinued, making them like any other web resource which is prone to link rot over time. According to data that was extracted from four well-known scholarly registers and over 13k unique repository URLs, they found that one out of every four repositories registered in scholarly registries is inaccessible.
To wrap up Session 6, Ivan Heibi presented "Enabling Portability and Reusability of Open Science Infrastructures"He proposed an approach to transform OpenCitations from a monolithic structure to a distributed framework with APIshttps://t.co/aNGUmwRx3o#TPDL2022 pic.twitter.com/njEECluOHF— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
To wrap up the session, Ivan Heibi from the University of Bologna presented “Enabling Portability and Reusability of Open Science Infrastructures”. As implied by the title, the main topic of his presentation was how to create an open science infrastructure that is distributed and containerized to make it simpler for it to be reused, replicated, and portable in different environments. He discussed their methodology's four key steps: analysis, design, definition, and managing and provisioning accompanied by examples of potential applications on OpenCitations.
Session 8: NLP and Recommendation
Session 8 @tpdl2022: “NLP and Reccomendation” just started!Presentation 1: “Implementation and evaluation of a multilingual search pilot in the Europeana digital library”By Mónica Marrero and Antoine Isaac— Himarsha R. Jayanetti (@HimarshaJ) September 23, 2022
Mónica Marrero kicked off Session 8 @tpdl2022 with her presentation "Implementation and Evaluation of a Multilingual Search Pilot in the Europeana Digital Library"https://t.co/0919NsfXxr#TPDL2022 pic.twitter.com/7TWhyh2kjO
— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Session 8 began with a presentation by Mónica Marrero from the Europeana Foundation titled “Implementation and Evaluation of a Multilingual Search Pilot in the Europeana Digital Library”. She discussed how the design and implementation of a multilingual information retrieval system based on the translation of queries and metadata to English is part of the strategy for the improvement of multilingual experiences in Europeana (a digital library that aggregates content from libraries, archives, and museums from all over Europe). In order to surface results that contain English metadata linked with them, their work tests query translation from Spanish to English for the website's Spanish-language version.
Eman Abdelrahman presented "Improving Accessibility to Arabic ETDs Using Automatic Classification"She created a deep learning pretrained language model that achieved an accuracy of approx 0.83 in classifying ETDshttps://t.co/3CRxpQViq0#TPDL2022 pic.twitter.com/ILbzle65kX— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Eman Abdelrahman from Virginia Tech presented her work titled “Improving Accessibility to Arabic ETDs Using Automatic Classification”. She talked about how they used data from the AskZad Digital Library to collect key metadata from Arabic Electronic Theses and Dissertations (ETDs). She also discussed the use of several machine learning and deep learning approaches for automatic subject classification of those ETDs.
You wrote a research paper. Where should you publish it?Elias Entrap from @oa_bison presented "B!SON: A Tool for Open Access Journal Recommendation"B!SON recommends the most applicable Open Access journals based on your title and abstracthttps://t.co/PrziPGXZbh#TPDL2022 pic.twitter.com/ILzN4iIPN1— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
The next presentation was by Elias Entrap from the Leibniz Information Centre for Science and Technology. He presented his work titled “B!SON: A Tool for Open Access Journal Recommendation”. It is a web-based journal recommendation system that can be used to recommend the most applicable open-access journals based on the title, abstract, keywords, and references provided by the user. He pointed out that as more open-access journals are becoming available, it is harder to locate the best venue for publishing research findings.
@saber_zerhoudi presented "Simulating User Querying Behavior using Embedding Space Alignment"https://t.co/7AxEuuc0Nj#TPDL2022 pic.twitter.com/dzu7EKlywj
— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Saber Zerhoudi from the University of Passau was the next, presenting “Simulating User Querying Behavior using Embedding Space Alignment”. When user interaction data is inadequate, simulation is utilized as an experiment to provide Information Retrieval (IR) systems and digital libraries with more realistic directives. Through the course of his talk, he addressed the questions of whether we can explore embedded alignment approaches to simulate user querying behavior and to what extent simulated query search sessions can replace or complement sample-based ones.
What if we could automate curation of VR "Museums" with image features and clustering?Florian Spiess wrapped up Session 8 @tpdl2022 with "Automation Generation of Coherent Image Galleries in Virtual Reality"https://t.co/fJ9CZKHc9ihttps://t.co/uO89l4csHm#TPDL2022 pic.twitter.com/0sCro7ajxk— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
The session concluded with the presentation titled “Automation Generation of Coherent Image Galleries in Virtual Reality” by Florian Spiess from the University of Basel. He discussed the rapidly growing size of multimedia collections in archives and museums and the significance of making such vast collections not just available but also accessible. He presented their suggestion to use Self-Organizing Maps (SOMs) to automatically create coherent image galleries, enabling intuitive, user-driven exploration of massive amounts of multimedia collections in virtual reality (VR). More than 300 people participated in a successful pilot test of this proposed system at the Basel Historical Museum.
Session 9: Research and CH Data
Lots of people use lists like Alexa Top 1 million Sites in research. But are they a good choice?To kick off Session 9 @tpdl2022, @daswesen presented "Analyzing the Web: Are Top Websites Lists a Good Choice for Research?"https://t.co/eg5oTcZiqg#TPDL2022 pic.twitter.com/DuSKP33qxP— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Due to the vast nature of the Web, selecting a representative sample of web pages for research is a difficult task. Many researchers use Alexa Top 1 million Sites and other top websites lists in their research. Tom Alby from Humboldt University of Berlin presented “Analyzing the Web: Are Top Websites Lists a Good Choice for Research?” They found that top sites lists miss frequently visited websites. As a result, they created a heuristic-driven alternative based on the Common Crawl.
Are DOIs permanent? Why not?Jiro Kikkawa presented "Analysis of the Deletions of DOIs" where they analyzed broken DOIs.They found over 700k broken DOIs with typos/incorrect formatting being one of the culpritshttps://t.co/FraqUDtU40#TPDL2022 pic.twitter.com/LfqTnpdgja— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Document Object Identifiers (DOIs) are intended to be persistent identifiers; however, they are not always persistent. They refer to broken links to DOIs as deleted DOIs. Jiro Kikkawa from University of Tsukuba presented “Analysis of the deletions of DOIs: What factors undermine their persistence and to what extent?” They investigated the number and content of deleted DOIs and provided guidance for avoiding deleted DOIs and making DOIs more stable. They found over 708,000 DOIs that existed in March 2017 and did not exist in January 2021. They also found that typos and incorrect formatting had an impact on the appearance of deleted DOIs.
Robert B. Allen presented "Implementation Issues for a Highly Structured Research Report"He created an instance of a successful model based on Pasteur's experiment. This work is a step towards direct representation of research reporthttps://t.co/JqOmkAL8Wu#TPDL2022 pic.twitter.com/Vx0KMjYFug— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Robert B. Allen presented “Implementation Issues for a Highly Structured Research Report” He explained that research reports contain structured knowledge. He presented the application of this framework to Pasteur’s swan-neck flask experiment and the challenges they have faced. Overall, this study is a step towards direct representation of research reports and a part of ongoing work.
Chiara Mannari presented "PH-Remix Prototype"They created a prototype platform that combined a film archive, AI extraction, and a Remix applicationProceedings: https://t.co/CcpjbNz4SE— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
Chiari Mannari from University of Pisa presented “PH-remix Prototype: A non-relational approach for exploring AI-generated content in audiovisual archives”. They created PH-remix, a prototype platform containing a film archive, AI extraction, and a remix application. They leverage AI techniques for searching content in audiovisual archives. Their presentation also included a demonstration of the functionality of PH-remix.
Narratives allow us connect things in a story. But is that story plausible?For the last presentation of @tpdl2022, @HermannKroll presented "On Dimensions of Plausibility for Narrative Information Access to Digital Libraries" https://t.co/PODouwFL0M#TPDL2022 pic.twitter.com/5BhuUAwgub— Emily Escamilla (@EmilyEscamilla_) September 23, 2022
To wrap up TPDL 2022, Hermann Kroll from the Institute for Information Systems at TU Braunschweig presented “On Dimensions of Plausibility for Narrative Information Access to Digital Libraries”. Narratives allow us to communicate information in a sequence so someone else can follow our thought and line of thinking. In narrative information access, there is a need to bind the narrative to real world data and ensure the bindings are context-compatible. This presentation dug into the challenges of determining the plausibility of the narrative and proposed a set of dimensions that need to be considered in narrative information access.
Closing and TPDL 2023
Following the Research and CH Data session, TPDL 2022 came to an end. The next TPDL will be held at the University of Zadar, Croatia. This was our first time attending an in-person academic conference. We had the opportunity to present our research in front of a live audience and meet a number of academic experts from around the globe. Overall, the TPDL 2022 conference in Padua, Italy, was a great experience and it was an honor to represent the WS-DL research group!
And it is a wrap! #TPDL 2022 came to an end. We hope you enjoyed the conference and the good company.Now get ready for the next great edition at the University of Zadar, Croatia. #TPDL2023— TPDL2022 (@tpdl2022) September 23, 2022
- Himarsha Jayanetti (@HimarshaJ) and Emily Escamilla (@EmilyEscamilla_)
Comments
Post a Comment