2023-06-03: A Trip report on Bill Ingram's Visit to ODU

On Friday, March 24, 2023, we had the pleasure of hosting William A. Ingram, who holds the positions of associate dean, and executive director for information technologies in the University Libraries of Virginia Tech. During his visit, he gave a presentation entitled "Maximizing Access to Long Scholarly Documents." This talk provided an overview of his recent research endeavors, focusing on data analysis, automatic metadata extraction, and strategies for enhancing accessibility to long scholarly documents.

Bill Ingram presenting "Maximizing Access to Long Scholarly Documents"


Graduate Seminar Talk by Bill Ingram

During his talk on his research, he shared his research "Building A Large Collection of Multi-domainElectronic Theses and Dissertations" on making the collection of long scholarly documents computationally driven and excavating knowledge from this rich information source, focusing on electronic theses and dissertations.


A Large Collection of Multi-domain Electronic Theses and Dissertations


Bill provided an update on an ongoing efforts to build a collection of more than 530,000 Electronic Theses and Dissertations (ETDs), encompassing both complete texts and associated metadata. The primary objective of this work was to bridge the accessibility gap between lengthy and short textual documents, while also fostering new research possibilities within the scholarly community. To achieve this, an ETD Crawling and Ingestion Framework was developed which automated the retrieval of ETD metadata and PDFs from university libraries.


Applications of Data Analysis on Scholarly Long Documents


Theses and dissertations play a vital role in documenting the research conducted by graduate students, usually as a requirement for their degree. These documents hold valuable information that represents the students' investigation into their research topics. Bill emphasized the importance of Electronic Theses and Dissertations (ETDs) and the challenges of accessing knowledge from these long documents. He also highlighted the significance of making these documents more computationally accessible.


Metadata quality improvement


Bill also talked about his research on metadata quality improvement. Metadata often suffers from issues like incompleteness, inconsistency, and errors. Their study "MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries" focuses on improving scholarly metadata by automatically detecting, correcting, and standardizing it, using electronic theses and dissertations (ETDs) as a case study.


Focused Research Discussion


Bill Ingram and the members of the WSDL lab held a focused discussion regarding various research problems in ETDs and other topics. During the discussion, I (Lamia) presented my research on ETD Crawling and Computational Reproducibility study using URLs linking to open-access datasets and software. We also explored the potential of a generalized ETD Crawling pipeline for all US schools.

A PhD student, Muntabir Choudhury, shared his research on ETD Metadata Quality Improvement and Augmentation, specifically addressing issues related to classifying ETD pages in both scanned and born-digital documents. He employed techniques such as data augmentation, classification using the VGG16 model, and LayoutLMv2. He also discussed "Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations" which proposed a conditional random field (CRF) model to extract metadata from scanned documents such as ETDs.

Bill Ingram's presentation and the focused discussion were enlightening. I learned a great deal from this experience.

-- Lamia Salsabil (@liya_lamia)

Comments