2022-12-29: A Summary of "CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents"

Figure 1: Traditional and Deep Learning Approaches to Table Recognition (Hashmi et al.)

Table recognition refers to the process of using optical character recognition (OCR) and machine learning (ML) models to identify the rows, columns, and individual text cells in tables in digital documents either born-digital or scan PDFs. The task of table recognition has been under investigation for more than two decades for automatically extracting textual information from a variety of tables [Kieninger et al., Wei et al.]. Automatic table recognition can be very challenging due to tables having different structures, data types, and misaligned data entries (Figure 2). For instance, some tables have text spanning multiple rows or columns. Also, some tables have clearly defined borders while some do not have any border (borderless) or are partially-bordered. These complexities make it difficult for template-based and ML-based approaches to extract tables from diverse PDFs. In addition, extracted tables and table data may sometimes not retain their original contextual and hierarchical structure, which requires users to manually correct table structure and content.

Figure 2: Example of a complex table structure

(source: https://doi.org/10.3390/app9235102)

Table extraction involves solving two problems: table detection (TD) and table structure recognition (TSR). Prior works involve solving the two problems independently using rule-based and classical machine learning methods such as conditional random fields (Wei et al.). Recent works use end-to-end deep learning-based solutions to solve the two problems together. Below, I will give a brief summary of a paper published by Devashish Prasad et al. on "CascadeTabNet: An Approach for end to end table detection and table structure recognition from image-based documents" in CVPR 2020 open access.

CascadeTabNet

Figure 3: CascadeTabNet Pipeline (Prasad et al.)

CascadeTabNet involves the use of a single convolutional neural network (CNN) to solve the problem of TD and TSR. Specifically, the authors proposed a cascade mask region-based CNN High-Resolution Network (HRNet) based model that simultaneously detects the regions of tables and identifies the structures (rows and columns) of the detected tables. The problem of table detection was solved using instance segmentation (detecting each distinct object of interest in images and demarcating their boundaries). The authors first identified and extracted the tables from each document image. Then, the authors identified the bounding boxes of the cells with text contents in the extracted tables. The CascadeTabNet model takes a document image with zero or more tables as input, and classifies each detected table into two types: bordered and borderless tables. Each type of table detected is further processed to improve the quality of the model's output.

Figure 4: Detecting the tables and cells in Document Images (Prasad et al.)

Structure Recognition for Borderless Tables

Borderless tables are tables with partial or no ruling-based lines (Prasad et al.). The CascadeTabNet predict the segmentations of table cells for borderless tables. When a table is classified by the model as borderless, the cell positions are automatically labeled using the predicted row and column ids. The authors use the position of identified rows and columns in combination with contour-based text detection algorithm (Liu et al.) to estimate the missing table lines (row and column borders). Based on the estimated table lines, the authors again used the contour-based text detection algorithm to detect the cells of previously undetected cells.

Structure Recognition for Bordered Tables

Bordered tables are tables with clearly defined ruling-based lines. The authors used rule-based algorithms to detect the lines of bordered tables. Then, the authors identified the cells using the line intersection points. The text regions within each cell were detected using a contour-based text detection algorithm.

Dataset Preparation

The authors created a new dataset for the task of TD by merging three datasets of ICDAR 19, Marmot, and GitHub. The general dataset contains a total of 1934 document images containing 2835 tables. To create a more robust model, the authors implemented image-augmentation techniques on the original training data. The augmentation techniques implemented are Dilation and Smudge transform. Before implementing the dilation and smudge operations, the original image was first converted to a binary image. In the dilation operation, the image was transformed to thicken the black regions, while in the smudge operation, the original image was transformed to spread the black pixel regions and make it look like a kind of smeary blurred black pixel region (see Figure 5).

Figure 5: An example of applying dilation and smudge transformation on a table image (Prasad et al.)

In order to analyze the effectiveness of the image-augmentation process, the authors created four training datasets to train different baseline models. The first set contains the original images. The second set contains both the original and dilated images. The third set contains both the original and smudged images. The last set contains the original, dilated, and smudged images.

For the task of TSR, the authors manually annotated at random 342 out of 600 images of the ICDAR 19 (Track A Modern) train set. The data contains 114 bordered and 429 borderless tables.

Results and Analysis

The proposed model on table region detection was evaluated based on IoU (Intersection over Union) thresholds and precision, recall, and F1 scores were calculated with IoU threshold 0.6, 0.7, 0.8 and 0.9, respectively. The Weighted-Average F1 (WAvg.) was calculated by assigning a weight to each F1 value of the corresponding IoU threshold. The models were evaluated on the test set of ICDAR 19 (Track A Modern).

Figure 6: F1-scores of the baseline models (Prasad et al.)

The results above show that the image augmentation techniques help the model to learn effectively . Thus, the authors used both augmentation techniques for further experiments. The authors also compared the performance of CascadeTabNet model with those of other competitors in ICDAR 19 competition.

Figure 7: Comparing TD results with participants of ICDAR 19 (Prasad et al.)

The TSR model was evaluated on ICDAR 19 (Track B2) dataset. The evaluation required the TSR model to return the coordinates of a polygon defining the convex hull of each cell’s contents. Additionally, the model was also required to return the start/end column/row information for each cell. Similar to the task of TD, precision, recall and, F1 scores were calculated with IoU threshold of 0.6, 0.7, 0.8 and 0.9, respectively.

Figure 8: Comparing TSR results with participants of ICDAR 19 (Prasad et al.)

Conclusion

In this work, the authors proposed CascadeTabNet, an end-to-end deep-learning-based solution for the task of TD and TSR in document images. The training dataset was augmented using dilation and smudge transformation techniques to create a more robust model. Both TD and TSR models were evaluated on ICDAR 19 (TRACK A and TRACK B2) test datasets using IoU threshold 0.6, 0.7, 0.8, and 0.9.

The results of CascadeTabNet presents more research opportunities to the document layout analysis community especially for the task of table structure recognition in document images.

Acknowledgement

I want to express my gratitude to Dr. Michael Nelson and Dr. Jian Wu for taking out time to review this blog post.

- Kehinde Ajayi (@KennyAJ)

Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, Kavita Sultanpure, 2020, May. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020 (pp. 572-573) .

Search This Blog

Web Science and Digital Libraries Research Group