2022-07-11: A Summary of "Document Domain Randomization for Deep Learning Document Layout Analysis" (Ling et al. 2021 ICDAR)

Document Understanding is the task of automatically parsing and ingesting the content of documents into a system using artificial intelligence methods to accomplish downstream challenges, such as information retrieval, Q&A, text and non-textual analysis. Document Understanding has trending importance in processing digital documents at scale. Many documents are visually rich, meaning layout and visual information are critical to understanding document content. In the scholarly domain, the layout analysis is challenging due to various document templates (e.g., single or double-column papers), which have title pages, section headings, tables, figures, algorithms, equations, references, and so on. To build an intelligent system to process such downstream tasks, annotating a large number of documents is laborious. Besides, developing training data with an equal amount of samples for each template is challenging and may not be attainable at a large scale. Thus, we often see imbalanced class samples in the dataset and overfitting problems while training the neural network.

The problems can be mitigated in the following ways:

Existing solutions use methods such as crowd-sourced (e.g., PDFMiner) or smart annotation or decoding markup languages (e.g., XML, LaTeX) to produce high-fidelity realistic pages with the correct semantics and figures annotating more publications.
We can manipulate pixels or use an encoder-decoder to synthesize document pages.

Among these solutions, solution (c) is ideal, especially for building an effective document understanding model to process relatively long documents such as Electronic Theses and Dissertations (ETDs).

Ling et al. recently published a paper on Document Domain Randomization for Deep Learning Document Layout Analysis (DDR) by Ling et al. in ICDAR 2021. In this blog, I will work through the dataset, methods, and evaluation of this paper. We will also discuss how we can adapt the approach of generating artificially augmented data for book-length documents such as ETD.

Figure 1: Illustration of Document Domain Randomization (DDR) Approach

(source: https://doi.org/10.1007/978-3-030-86549-8_32)

Research Goals

Inspired by the work of Chatzimparmpas et al. on utilizing domain randomization [1] techniques in vision science, Ling et al. proposed the idea of document domain randomization (DDR) to generate pseudo training samples that do not need human annotation. The authors' randomized layout, font styles, and semantics through graphical depictions during page generator. The idea is that with enough page appearance randomization, the actual page would appear to the model as just another variant. However, the challenge remains to study what styles and semantics can be randomized for the models to learn essential features of interest while creating pseudo-pages so that they can correctly detect nine classes (i.e., abstract, algorithm, author, body-text, caption, equation, figure, table, and title) in the document layout analysis with no additional training data.

Therefore, with the DDR method (Figure-1), the authors randomized three main aspects of a document:

Font styles and sizes in textual components (e.g., paragraphs, section titles, captions)
Non-textual components (e.g., number of pages, tables, figures)
The relationship between textual and non-textual (e.g., position relative to texts and columns, height distances to other components)

DDR Dataset

The framework of the DDR method follows two steps:

Model to Create Content -- the author uses the VIS30K [2] dataset for algorithms, equations, figures, and tables. VIS30K covers 30 years of IEEE VIS Publications in four major visualization conferences. It is a collection of modern, high-quality digital prints and low-quality scans of document pages. The author utilized the SciGen to generate a dataset for textual content, including Authors, Body texts, Captions, Section Headings, and Titles.
Render to Manage Visual Appearance -- The author tried to render the styles, text font size, columns, and captions by incorporating actual scholarly articles pages.

Figure 2: DDR Workflows

(source: https://doi.org/10.1007/978-3-030-86549-8_32)

Model

The authors used the Faster-RCNN architecture because of its success in structural analysis for table detection in PubLayNet [3]. Figure 2 illustrates that the authors have used the 15K training and 5K validation input images rendered with random figures, tables, algorithms, and equations chosen from VIS30K. They also reused authors' names and fixed their format to the IEEE visualization conference syle. Also, they used semantically meaningful textual content of SciGen. Then they rendered it into images. As for rendering choices, they incorporated text font styles and sizes and used the variation of the target domain (ACL+VIS, ACL, or VIS). This approach requires no feature engineering and little work beyond previous work except style randomization. With DDR, the authors could quickly create 100% accurate ground truth labels without decoding markup languages (e.g., XML, LaTeX) and managing document generation engines.

Figure 3: DDR Sample (Left) and Result of Correctly Labelled Image -- Layout Parsing (Right)

(source: https://doi.org/10.1007/978-3-030-86549-8_32)

Evaluation

The proposed model aims to output bounding box locations and detect nine classes: abstract, algorithm, author, body-text, caption, equation, figure, table, and title. The authors reported accuracy, recall, F1 score, and mean average precision (MAP). The authors performed four empirical experiments based on a few hypotheses, which are the following:

Compared against benchmark performance:

DDR achieved competitive results -- performance of extraction accuracy.
Target-domain adapted DDR training data leads to better test performance -- experimented style mismatch in train vs. test.

Faster R-CNN sensitivity:

Faster R-CNN performance -- the author showed that reducing training data lowers performance.
Increased labeling noise impacts the performance -- varying the error rates in label types in a reasonable range.

Experiment 1

Figure 4 illustrates that DDR achieved competitive results in extracting two classes -- tables and figures. The model was trained on DDR-based CS-150x style for this experiment and tested on the CS-150x dataset. While comparing against the benchmark performance, the model achieved a competitive 91.6% F1 score for figure extraction (whereas PDFFigures [4] achieved 93.6% F1 score) and 94.3% for table extraction.

Figure 4: Performance Result of Experiment 1 (Figure and Table extraction)

(source: https://doi.org/10.1007/978-3-030-86549-8_32)

Experiment 2

Figure 5 illustrates if style mismatch in train and test set impacts the performance. The experiment was to detect all nine classes. The authors used DDR training input images ACL + VIS, ACL only, and VIS only and tested on ACL 300 and VIS 300. The performance result (mean average precision at IoU = 0.8) in six pairs shows that the DDR-VIS and DDR-ACL results are not target adapted. This experiment also proves the hypothesis that style matching is essential.

Figure 5: Performance Result of Experiment 2 (Style mismatch)

(source: https://doi.org/10.1007/978-3-030-86549-8_32)

Experiments 3 and 4

Experiments 3 and 4 both used DDR input images ACL+VIS and tested on ACL 300 and VIS 300. This experiment shows the robustness of the model. Figure 6 (a) and (b) illustrate the accuracy of Faster-RCNN. The result shows that reducing the training data (100% to 6.25%) by half reduced the accuracy, which proved the hypothesis that a smaller set of samples decreases the detection accuracy. Also, for experiment 4, we can see from figures 6 (c) and (d) that the model is not sensitive to noisy input.

Figure 6: Performance Results of Experiments 3 and 4 (Faster R-CNN Sensitivity)

(source: https://doi.org/10.1007/978-3-030-86549-8_32)

Conclusion

The authors presented the DDR method that can automatically generate theoretically arbitrary training data to train a document layout parser using deep learning algorithms such as Faster R-CNN. The advantage of DDR is that it does not require human annotation. It is fast, effective, and generates ground truth automatically.

The Adaptability of the Idea of Generating Synthetic Data for ETD Segmentation

We have seen how artificially augmented data can be produced with randomization techniques and without feature engineering. The authors of the DDR paper also applied low-fidelity simulation by altering lighting, viewpoint, and shading. Low-fidelity simulation can be seen in many image augmentation tasks for classifying the image. If we consider born digital and non-born digital ETDs, we will see an imbalance in class samples if we want to segment ETDs at the page level. Figure 7 illustrates the number of pages for each page type in a manually annotated 500 scanned ETDs. We can see that there are 16 classes. Among these classes, most of the samples fall into a few categories. Many classes have small numbers of samples. This is an example of imbalanced classes in the dataset where a possible solution could be augmenting only those classes with fewer samples.

Figure 7: Imbalanced Class Samples in 500 Scanned ETDs

For the augmentation task, we have seen low-fidelity simulation, changing textual font sizes, and font styles impacted the performance. Therefore, we can adopt the idea proposed in the DDR paper to augment ETD pages for different types, which supports training a robust supervised model that classifies ETD images into their respective categories. Later, we will release another blog post on augmenting born and non-born digital ETDs, which requires minimalistic feature engineering in the future.

References

[1] A. Chatzimparmpas, R. M. Martins, I. Jusufi, K. Kucher, F. Rossi, and A. Kerren, "The state of the art in enhancing trust in machine learning models with the use of visualizations," Computer Graphics Forum, vol. 39, no. 3, pp. 713–756, 2020. https://doi.org/10.1111/cgf.14034

[2] J. Chen, M. Ling, R. Li, P. Isenberg, T. Isenberg, M. Sedlmair, T. M¨oller, R. S. Laramee, H.-W. Shen, K. W¨unsche, and Q. Wang, "Vis30k: A collection of figures and tables from IEEE visualization conference publications," IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 9, pp. 3826–3833, 2021. https://doi.org/10.1109/TVCG.2021.3054916

[3] X. Zhong, J. Tang, and A. Jimeno Yepes, "Publaynet: Largest dataset ever for document layout analysis," in 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022, 2019. https://doi.org/10.1109/ICDAR.2019.00166

[4] C. Clark and S. Divvala, "Pdffigures 2.0: Mining figures from research papers," in 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 143–152, 2016. https://doi.org/10.1145/2910896.2910904

-- Muntabir Choudhury (@TasinChoudhury)

Search This Blog

Web Science and Digital Libraries Research Group