2022-12-12: Trip Report -- Visit to Virginia Polytechnic Institute and State University (Virgina Tech)

Virginia Polytechnic Institute and State University

In 2019, Dr. Jian Wu, my research advisor, in collaboration with Bill Ingram and Dr. Ed Fox at Virginia Tech, received a grant from IMLS (Grant# LG-37-19-0078-198) on mining book-length documents, represented by electronic theses and dissertations (ETDs), using machine learning and deep learning. The project addresses the lack of research on book-length documents, such as extracting metadata and segmenting scanned and born-digital long documents, using ETD as a case study. In early summer 2022, our ODU team was invited to visit the Digital Library Research Lab (DLRL) at Virginia Tech Computer Science. Our agenda included a meeting with the Virginia Tech DLRL team, student presentations from CS 5604: Information Storage and Retrieval, a graduate seminar talk by Dr. Jian Wu, and a focused discussion with the ETD team. Dr. Wu delivered a presentation titled "Towards Automatically Understanding Scientific Papers," which summarized his recent work on understanding the content of scientific papers using artificial intelligence.

CS 5604: Information Storage and Retrieval

CS 5604 was offered in the Fall of 2020 by Dr. Edward A. Fox. The course covered search engines, recommender systems, machine learning, natural language processing, and software engineering practices, including Docker, Kubernetes, and Cl/CD. The class was divided into five teams, and each team was led by Ph.D. students at the DLRL lab. Team 1 was responsible for the database, file system, and knowledge graph. Team 2 was responsible for the search method, indexing, ranking, and recommendation system. Team 3 specifically focused on object detection (e.g., tables, figures, equations, and algorithms) in ETDs using Detectron2 and YOLOv2. They also worked on topic modeling. Team 4 was responsible for language modeling, such as ETD chapter-level summarization and classification. They used a transformer-based model called BigBirdPegasus for the chapter-level summarization and utilized SciBERT for classification. Lastly, team 5 focused on integration, containerization, and generating workflow.

Graduate Seminar Talk by Dr. Jian Wu

Dr. Jian Wu was invited to give a talk on his research for the graduate seminar, where he presented various research studies and solutions to research problems. He raised three crises related to scholarly big data: paper reading, reproducibility, and scientific disinformation. He then introduced how his research helped solve the three crises from information extraction, reproducibility assessment, scientific disinformation, and big data infrastructure.

Information Extraction From Scientific Papers

Dr. Wu highlighted three primary research on information extraction from scientific papers. He introduced his paper titled "A Comparative Study of Sequential Tagging Methods for Domain Knowledge Entity Recognition in Biomedical Papers." He discussed the sequential tagging method for extracting domain knowledge entity (DKE) from biomedical papers on Lyme disease. This work aimed at extracting DKEs using sequential tagging with different models, such as conditional random field (CRF) and bidirectional long short-term memory (BiLSTM). He also explained how transformer-based models such as BERT could boost the performance of DKE extraction.

The second paper he presented was "Theory Entity Extraction for Social and Behavioral Sciences Papers using Distant Supervision." The authors proposed a framework based on distant supervision to mitigate the data sparsity issues to extract theory entities from scientific papers using supervised learning methods.

Finally, he presented a paper titled "Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations." The authors proposed a CRF model with text and visual-based features. Furthermore, he explained the challenges and importance of metadata extraction from scanned Electronic Theses and Dissertations (ETDs). A summary of the complete work can be found in our previous blog.

Reproducibility Assessment

Dr. Wu described how open access datasets/software (OADS) could be leveraged to study computational reproducibility by introducing the paper "A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and Software." In this work, a hybrid classifier was presented to automatically identify OADS URLs in a scientific paper. Moreover, he analyzed how the use of OADS URLs depended on subject categories and how it has changed in the past 20 years using ETD data. A summary of the complete work can be found in our previous blog.

The following work he presented was "Predicting the Reproducibility of Social and Behavioral Science Papers Using Supervised Learning Models," funded by DARPA. This work aimed to investigate the prediction of the reproducibility of social and behavioral science papers using machine learning methods based on a set of features. Dr. Wu elaborated on the top nine features for predicting reproducibility.

Scientific Disinformation

Regarding scientific disinformation, the research was motivated by the acceleration of misinformation and disinformation on social media (MIDIS). Most studies on MIDIS have focused on political news and little research on debunking misinformation and disinformation using scientific papers. Dr. Wu presented two papers and one ongoing research related to this problem.

Data Infrastructure


Dr. Wu concluded his talk by addressing the importance of Data Infrastructure. He reviewed the design, implementation, operation experiences, and lessons of CiteSeerX -- a real-world digital library search engine. He talked about the strengths and weaknesses of the current design. He discussed the newly proposed architecture in their paper titled "Building an Accessible, Usable, Scalable, and Sustainable Service for Scholarly Big Data.He also described his work on "Building A Large Collection of Multi-domain Electronic Theses and Dissertations." There are currently 500k ETDs in this repository. He showed the top ten US universities that contributed the most to this collection.

Focused Research Discussion

Our ETD team from the DLRL lab and our ODU team from Lamp-Sys Lab (member of WSDL) had a focused discussion on various research problems in ETDs. I (Muntabir) presented my research on ETD Metadata Quality Improvement and Augmentation for Segmenting ETD pages. I addressed multiple issues related to ETD segmentation for scanned and born-digital documents using data augmentation, classification with the VGG16 model, and LayoutLMv2. Aman Ahuja from DLRL lab also presented his work on ETD segmentation of born-digital ETDs using YOLOv7. He talked about annotating ETD data, results, and evaluation of classifying ETD pages for 21 categories (e.g., title page, chapter, abstract, tables, and algorithms). We discussed a possible solution to merge both of our tasks and create a single pipeline that can classify an ETD regardless of born-digital or scanned. Although the performance of the existing model for classifying born-digital ETD pages is impressive (achieved around 80%-96% F1 score for 21 categories), we are still researching to build a unified model to segment ETD pages for both scanned and born-digital.


Lamia Salsabil (another Ph.D. student from the ODU team) also presented her work titled "A Summary of Contributions to the ETD Project," in which she talked about ETD collection, harvesting metadata, annotating ETDs, and text extraction using AWS Textract. She also discussed the study of computational reproducibility using URLs linking to open-access datasets and software for scholarly articles and her future direction for this work.


Presentation by Lamia Salsabil: A Summary of Contribution to the ETD Project


It was a great learning experience during our visit to DLRL at Virginia Tech. We are scheduled to meet the team again.


-- Muntabir Choudhury (@TasinChoudhury) and Lamia Salsabil (@liya_lamia)

Comments