2021-09-19: Conditional Random Field with Textual and Visual Features to Extract Metadata From Scanned ETDs

Our previous blog described Electronic Theses and Dissertations (ETDs) before 1997, and a significant fraction of ETDs after 1997 are scanned from physical copies. These ETDs are valuable for digital library preservation, but to make them accessible, it is necessary to index these ETDs. Many ETD repositories are accompanied by incomplete, little, or no metadata, posing challenges for accessibility. For example, advisor names appearing on the Scanned ETDs may not be available in the metadata provided in the library repository. Thus, an automatic approach should be adopted to extract metadata from scanned ETDs. We proposed a conditional random field (CRF) based sequence tagging model that combines textual and visual features. The source code can be found in our GitHub repository.


Automatic metadata extraction is important to build scalable digital library search engines. Most existing tools such as GROBID [1], CERMINE [2], and ParsCit [3] developed and applied to born-digital documents. However, they often fail to extract metadata from scanned ETDs. Extracting metadata from scanned ETDs is challenging due to poor image resolution, typewritten text, different ETD templates, and imperfection of the OCR technology. We adopted Tesseract-OCR, an open-source OCR tool, to extract text from the cover pages of scanned ETDs. Tesseract supports printed and scanned documents and more than 100 languages. It also returns the output in plain text, hOCR, PDF, and other formats. Our previous model used regular expressions to identify and extract metadata for seven fields: title, author, institution, year, degree, academic program, and advisor. However, heuristic methods usually do not generalize well. They often fail when applied to new sets of data with different feature distributions. Therefore, in this blog, we introduce a learning-based approach in our paper titled "Automatic Metadata Extraction Incorporating Visual Features from Scanned ETDs," accepted by the 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2021). 

Figure 1: ETD cover pages from various US universities


We created a dataset of 500 ETDs (Figure 1), including 450 ETDs from 15 US universities and  50 ETDs from 6 non-US universities, as illustrated in Figure 2. These ETDs were published between 1945 and 1990. There is 350 STEM (Science, Technology, Engineering, and Mathematics) and 150 non-STEM majors from 467 doctoral, 27 master's, and 5 bachelor's degrees. The dataset is publicly available on Google Drive.

Figure 2: Distribution of metadata fields: (a) University (b) Year (c) Program in the corpus of 500 ETDs

Framework Architecture

From Figure 3, we can see that Tesseract-OCR was applied to Scanned ETDs. We saved the result as a text and applied four classifiers, including Heuristic, CRF with Sequential Labeling (CRF), CRF with Visual Features (CRF-visual), and BiLSTM-CRF. These classifiers are customizable. Later, we extracted the metadata from the text, compared it against the ground truth data (i.e., GT-meta and GT-rev), and measured the performance of each model. We also compared the CRF model against other baseline models.

Figure 3: Metadata Extraction System

CRF with Sequential Labeling (CRF model)

We first annotated each metadata field using the GATE. Later, we tagged each annotated field token using the BIO (begin, inside, outside) tagging schema (e.g., Figure 4) and parts of speech.

Figure 4: Sequence tagging using BIO schema

Our model extracts the following text-based features:
  1. Whether all the characters in a word are uppercase.
  2. Whether all the characters in a word are lowercase.
  3. Whether all the characters in a word are numeric. 
  4. Last three characters of the current word.
  5. Last two characters of the current word.
  6. POS tag of the current word.
  7. Last two characters are in the POS tag of the current word.
  8. POS of the two tokens after the current word. 
  9. Whether the first character of consecutive words is uppercase. For example, a title may contain consecutive words that start with an uppercase letter.
  10. Whether the first character is uppercase for the word is not at the beginning or end of the document. This is useful for metadata fields such as author, advisor, program, degree, and university. These fields are not generally at the beginning or end of the document.

CRF with Visual Features (CRF-visual)

Both CRF and the heuristic method rely on text-based features. However, when annotating the documents, humans not only rely on the text but also on visual features, such as the positions of the text and their lengths. For example, thesis titles usually appear in the upper half of the cover page, and the authors and advisors usually appear in the lower half of the cover page. This inspires us to investigate whether incorporating visual features can improve the performance of the model.

Figure 5: Bounding Box measurement

Visual information is represented by corner coordinates of the bounding box (bbox) of a text span. We extracted all x-coordinate values (e.g., x1, x2) and y-coordinate values (e.g., y1, y2) for each token. This information is available from hOCR files and XML files, which are output from Tesseract. Figure 5 illustrates the bounding box information of the token with x and y coordinates. x1 is the distance from the left margin to the bottom right corner of the bbox. y1 is the distance from the bottom margin to the bottom right corner of the bbox. x2 is the distance from the left margin to the upper right corner of the bbox. y2 is the distance from the bottom margin to the upper right corner of the bbox. All coordinates are measured with respect to the bottom-left corner of the token. To enhance the model's performance, we incorporated three visual features, each normalized between 0 and 1. The visual features are:
  1. Left margin (x1) for all tokens in the same line.
  2. Upper left corner position (y2) for all tokens.
  3. Bottom right corner (y1) for all tokens.

Figure 6: Text alignment: OCR output text (i.e., noisy) alignment with clean text


Transferring visual features directly from the OCRed text to the ground truth text is non-trivial because the OCRed text may not be well aligned with the ground truth text. Therefore, the characters in the rectified text are not necessarily aligned with Tesseract's output. The position information was only available for the OCR output. We applied text alignment using the longest common sequence [4]. In bioinformatics, sequence alignment has been commonly applied to align protein, DNA, and RNA sequences, usually represented by a string of characters. We used an open-source tool known as Edlib to align the noisy text data and clean text. Edlib computes the similarity and minimum edit distance between two text sequences. Then we map the positions for each token from OCRed text to clean text. Figure 6 illustrates an example of the alignment.


We divided the samples into two sets: 350 for training and 150 for testing. CRF model predicts the labels for each metadata field at the token label. However, we must glue together the predicted tokens for each metadata field and compare them against the ground truth datasets.

Figure 7: Comparing predicted metadata field against the ground truth metadata

Figure 8: Performance (F1) comparison of the models

When comparing the title field, some predicted titles did not match exactly, with a small difference such as a punctuation mark or a space character. For example, Figure 7 illustrates the model predicted the title field as "thermo- fluid dynamics of separated two - phase flow." However, in the GT-meta it is "thermo-fluid dynamics of separated two-phase flow." These small offsets are not caused by the model but by line breaks and additional punctuation marks added in text justification. Therefore, the predicted span should be counted as a true positive. We use a fuzzy matching algorithm based on Levenshtein distance and set a threshold of 0.95 when matching predicted and ground truth titles. Figure 8 illustrates the performance of our model. We can see that the CRF model outperformed the baseline model whereas CRF-visual outperformed both the baseline model and CRF. 


We applied sequence tagging models to automatically extract metadata from the cover pages of ETDs. Our model (CRF-visual) achieved an 81.3%-96% F1 measure on seven metadata fields. Incorporating visual features into the CRF model boosts the F1 by 0.7% to 10.6% depending on the metadata field.


[1] P. Lopez, “GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications,” in Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, ECDL’09, pp. 473–474, Springer-Verlag, 2009.

[2] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, and L. Bolikowski, “CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature,” Int. J. Doc. Anal. Recognit., vol. 18, no. 4, p. 317–335, 2015.

[3] I. Councill, C. L. Giles, and M.-Y. Kan, “ParsCit: an Open-source CRF Reference String Parsing Package,” in Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008.

[4] J. Fonseca and K. Taghva, “Aligning Ground Truth Text with OCR Degraded Text,” in Intelligent Computing. CompCom 2019. Advances in Intelligent Systems and Computing, vol 997. Springer, Cham., pp. 815–833, Springer, Cham, 2019.

-- Muntabir Choudhury (@TasinChoudhury)