Web Science and Digital Libraries Research Group

Posts

Showing posts with the label Digital Library

2024-12-31: The 27th International Symposium on Electronic Theses and Dissertations (ETD 2024) Trip Report

By Lamia Salsabil - December 31, 2024

ETD 2024 took place in Livingstone, Zambia I had the privilege of participating in the 27th International Symposium on Electronic Theses and Dissertations (ETD 2024) , which took place as a hybrid event in Livingstone, Zambia, from November 4th to 6th, hosted by the University of Zambia . The conference provided a unique opportunity for professionals in the fields of digital libraries, open science, and graduate education to gather, collaborate, and explore advancements in Electronic Theses and Dissertations (ETDs). The hybrid nature of the event made it possible for global audiences to participate, with sessions spanning a wide range of topics, including ETD implementation use cases, open access to ETDs, the intersection of open science and ETDs, long-term preservation, the global visibility of ETDs, and the transformative role of large language models in ETD research. Day 1 at ETD 2024 Workshops ETD 2024 kicked off with workshops designed for all experience levels. " ETDs 101: N...

2023-06-03: A Trip report on Bill Ingram's Visit to ODU

By Lamia Salsabil - May 29, 2023

On Friday, March 24, 2023, we had the pleasure of hosting William A. Ingram , who holds the positions of associate dean, and executive director for information technologies in the University Libraries of Virginia Tech. During his visit, he gave a presentation entitled "Maximizing Access to Long Scholarly Documents." This talk provided an overview of his recent research endeavors, focusing on data analysis, automatic metadata extraction, and strategies for enhancing accessibility to long scholarly documents. Bill Ingram presenting "Maximizing Access to Long Scholarly Documents" Graduate Seminar Talk by Bill Ingram During his talk on his research, he shared his research " Building A Large Collection of Multi-domainElectronic Theses and Dissertations " on making the collection of long scholarly documents computationally driven and excavating knowledge from this rich information source, focusing on electronic theses and dissertations. A Large Collection of Mu...

2022-12-12: Trip Report -- Visit to Virginia Polytechnic Institute and State University (Virgina Tech)

By Muntabir Choudhury - December 12, 2022

Virginia Polytechnic Institute and State University In 2019, Dr. Jian Wu, my research advisor, in collaboration with Bill Ingram and Dr. Ed Fox at Virginia Tech, received a grant from IMLS (Grant# LG-37-19-0078-198 ) on mining book-length documents, represented by electronic theses and dissertations (ETDs), using machine learning and deep learning. The project addresses the lack of research on book-length documents, such as extracting metadata and segmenting scanned and born-digital long documents, using ETD as a case study. In early summer 2022, our ODU team was invited to visit the Digital Library Research Lab (DLRL) at Virginia Tech Computer Science. Our agenda included a meeting with the Virginia Tech DLRL team, student presentations from CS 5604: Information Storage and Retrieval, a graduate seminar talk by Dr. Jian Wu , and a focused discussion with the ETD team. Dr. Wu delivered a presentation titled " Towards Automatically Understanding Scientific Papers ," whic...

2021-09-19: Conditional Random Field with Textual and Visual Features to Extract Metadata From Scanned ETDs

By Muntabir Choudhury - September 19, 2021

Our previous blog described Electronic Theses and Dissertations (ETDs) before 1997, and a significant fraction of ETDs after 1997 are scanned from physical copies. These ETDs are valuable for digital library preservation, but to make them accessible, it is necessary to index these ETDs. Many ETD repositories are accompanied by incomplete, little, or no metadata, posing challenges for accessibility. For example, advisor names appearing on the Scanned ETDs may not be available in the metadata provided in the library repository. Thus, an automatic approach should be adopted to extract metadata from scanned ETDs. We proposed a conditional random field (CRF) based sequence tagging model that combines textual and visual features . The source code can be found in our GitHub repository. Introduction Automatic metadata extraction is important to build scalable digital library search engines. Most existing tools such as GROBID [1], CERMINE [2], and ParsCit [3] developed and applied t...

2020-06-07: Regular Expression — A Powerful Tool to Parse Text with Visually Identifiable Patterns

By Muntabir Choudhury - June 07, 2020

In the previous blog , I have discussed how tesseract-OCR performed on scanned Electronic Theses and Dissertations (ETDs). If you have read my earlier blog , we already saw that the process started with converting the cover page of scanned ETDs into images. Then, tesseract-OCR was applied and saved the extracted result into text files. We also saw that OpenCV OCR failed on scanned ETDs. We could try a widely used open-source tool such as GROBID , designed for scholarly papers. However , this article shows that GROBID is intended for extracting bibliographic metadata for born-digital academic papers. Finally, we decided to apply tesseract-OCR to extract the text from the cover page of scanned ETDs. Afterward, a series of regular expressions (RegEx) was performed to extract seven metadata fields, including titles, authors, academic-programs, institutions, advisors, and years. In this blog, I will introduce how RegEx can be a powerful tool to quickly p...