2021-09-23: Real-time Header Extraction from Scientific PDF Documents: Summer Research Internship Experience at Los Alamos National Laboratory

This Summer, I was accepted as a Graduate Student in the Institutional Scientific Content (ISC) Team, a sub division of the Research Library at Los Alamos National Laboratory (LANL). LANL is a United States Department of Energy national laboratory, located in Los Alamos, New Mexico, in the southwestern United States. Its mission is to solve national security challenges through scientific excellence. This year, LANL continued its student internship program for Summer 2021. Approximately 1500 students joined LANL to work on various projects during this Summer. Due to social distancing restrictions, most internships were limited to remote work off laboratory property.

My internship was a 12 weeks program which started on 7th of June, 2021. During this internship program, I worked remotely as a Research Intern, under the supervision of Brian Cain. Throughout this program, I attended meetings with my supervisor, Brian Cain and the development team, development sessions, and meetings with the entire ISC team. The meetings with the development team were to update my progress and to obtain feedback to resolve issues or to improve the solution. The entire team meetings were held bi-weekly via Webex, where all team members updated their progress on their work. I contacted my supervisor via emails and Google Chat throughout the internship.

Project

Features to be Extracted from Scientific PDF Documents

The project I worked on focused on extracting features such as title, authors, affiliations, abstract, and keywords from scientific PDF documents using open-source, machine-learning-based tools such as GROBID. The goal of my project was to extract features from scientific PDF documents in real-time and process this data into a desired structure. This structured data can then be ingested into forms that could be used to populate library systems such as RASSTI (Review and Approval System for Scientific and Technical Information). RASSTI is the starting point for collecting LANL’s documented scientific output and it provides workflows for authors and reviewers. RASSTI accepts author entered metadata and associated PDF for review and outputs approvals from reviewers containing Los Alamos Unlimited Release (LA-UR) number and cover sheet with LA-UR number merged into the original PDF. Once the extraction and data structuring work convincingly, the goal is to use the developed system for future integration in the production submission workflow for RASSTI. This integration will help LANL researchers more easily submit their materials for review.

RASSTI Review and Approval System

I began my internship project by first exploring open-source, machine-learning-based tools for header extraction from scientific PDF documents. Initially, I familiarized myself with GROBID by reading the documentation and exploring the functionalities of GROBID. I also learned about different functionalities such as header extraction and parsing, parsing of names, parsing of affiliation, and full text extraction. I also discovered SciWING, a scientific document processing toolkit built on PyTorch. It includes pre-trained models to extract features such as title, abstract, authors, affiliations and keywords from scientific documents. However, when I tested SciWING out, I observed that it takes about one minute on average to extract features from scientific PDF documents. Since speed is one of the key aspects of the development, we decided not to utilize SciWING.

TEI XML Output of GROBID corresponding to the “processHeaderDocument” service

Then, I proceeded with utilizing GROBID’s web services, “processHeaderDocument” and “processFulltextDocument” to extract features from scientific PDF documents. Both of these web services accept a PDF file to extract features, and convert it into a TEI XML format. I implemented a feature extractor based on GROBID’s “processHeaderDocument” web service to process the output and structure the features extracted, tabularly. The fields included in the table are file name, title, authors name list, keywords list, abstract, and all affiliations list. I also curated a dataset with 20 sample PDFs and their features including title, abstract, authors, affiliations and keywords to perform quantitative evaluation. I created two different sets of ground truth, which is the ideal expected result in tabular format; one including affiliations of authors and another without affiliations of authors. I compared ground truth with real-time GROBID extraction and calculated the accuracy of extracted titles (95%), authors (85%), keywords (75%), and abstracts (80%). Upon evaluation, I noticed that sometimes the output of the GROBID gets saved with Unicode characters, reducing accuracy. Upon discussing the initial implementation and the evaluation, the development team suggested to change the code to save the output without Unicode characters.

Next, I created a JSON schema to store the extracted features including file name, title, abstract, keywords list, authors list with author name and affiliations. Upon discussing with the development team, we decided not to proceed with the tabular structure, and instead to use JSON objects to store the output of GROBID. For the affiliations of authors, the development team suggested to use only the institution and not the raw affiliation which includes the street address as well. When I changed the JSON to have only the institution names as the affiliations, I discovered sometimes authors have affiliations and sometimes affiliations are not associated with the authors. I proposed to include a separate complete affiliations list regardless of the author in the JSON as well based on Brian’s suggestion to allow users of RASSTI to select the affiliations from a drop drown in the interface.

JSON schema used to store extracted features

Then I evaluated the JSON files. For this, I created ground truth JSON files, which is the ideal expected result in JSON format from all 50 PDF files. The fields of the JSON objects are abstract, authors list [affiliations list, author first name, middle name, last name], file name, keywords list, title. Then I compared the JSON files of ground truth and JSON files of generated output of GROBID. I calculated the accuracy of extracted titles (96%), keywords (86%), and abstracts (78%), authors' first name (84%), authors' middle name (78%), and authors' last name(78%).

For the affiliation accuracy evaluations, I utilized affiliations only (48% accuracy), authors with affiliations (28% accuracy), micro-average precision (96.37%), macro-average precision (72.99%), micro-average recall (64.56%), and macro-average recall (62.59%). Micro-average precision means that when a PDF is submitted, out of extracted affiliations of authors, 96.37% were actually correct. Micro-average precision is calculated by aggregating the contributions of PDFs and by not treating all PDFs equally. Macro-average precision means that when a PDF is submitted, out of extracted affiliations of authors, on average 72.99% were actually correct regardless the PDF, hence treating all PDFs equally. Micro-average recall means that 64.56% of the time when a PDF is submitted, the affiliations of authors were correctly identified. Micro-average recall is calculated by aggregating the contributions of PDFs and by not treating all PDFs equally. Macro-average recall means that 62.59% of the time when a PDF is submitted, the affiliations of authors were correctly identified regardless the PDF, hence treating all PDFs equally.

Upon evaluation, I noticed that sometimes the output of the GROBID has duplicate affiliations. It happens when an author is associated with two different departments in the same institution. Brian and the development team suggested to remove duplicate affiliations. Sometimes, GROBID identifies institutions as departments. Since I was only extracting institutions, when an institution is misclassified as a department, the output does not contain it. Upon discussing with the team, we decided to include both department and institution in the affiliations. Then, I modified the JSON output to include all combined affiliations list in the form of department, institution, all departments list, and all institutions list. I also modified the authors object to include three lists of affiliations in the above mentioned format (all combined affiliations, all departments, and all institutions).

Moving forwards, I created two web services using the GROBID extractor: /extractfeatures and /extractfulltext. /extractfeatures web service is capable of uploading a PDF document to extract features (titles, authors, keywords, and abstracts) and send the JSON response with the extracted features. /extractfulltext web service is capable of uploading a PDF document to extract titles and full text including abstracts and send the JSON response with the extracted full text. I created these web services using the Flask framework. I used Postman to invoke /extractfeatures and /extractfulltext web services by sending research papers as files and displaying the JSON response from them.

MiSuSup 2021

My supervisor, Brian Cain, and Dr. Martin Klein organized Mini Summer Student Symposium (MiSuSup 2021) to provide an opportunity for the summer students in the LANL Research Library to present the work we accomplished during our internships. MiSuSup 2021 was held on August 25th. Here, I presented my project which aims at automatically extracting metadata such as titles, abstracts, authors, affiliations, and keywords from PDF documents in real-time using GROBID and processing this data into JSON objects. This structured data can be ingested into other library systems, including RASSTI forms for easier user submission. I outlined the developed prototype, JSON responses, and demonstrated the services I implemented.

Me presenting at MiSuSup 2021

Overall Experience

This is my first internship program within the USA, outside graduate assistantships within ODU. During this internship, I learned how to extract metadata from PDFs using machine-learning-based tools, and how to structure the extracted features. I had an incredible experience working as a Summer Research Intern at LANL for 12 weeks. Even though I worked remotely due to the social distancing restrictions because of Covid-19 pandemic, the entire ISC team was supportive. The experience I gained through this internship is crucial for my future career.

Thank You!

I would like to express my gratitude to my PhD advisor, Dr. Sampath Jayarathna, for encouraging me to apply for Summer internship programs, to my internship supervisor, Brian Cain, for guiding me throughout the internship by providing feedback and suggestions, and to Dr. Martin Klein, for recommending me for this position. I am thankful for the opportunity to work at LANL as a Summer Research Intern with Brian Cain and the ISC team!

--Gavindya Jayawardena (@Gavindya2)

Search This Blog

Web Science and Digital Libraries Research Group