2023-08-10: A Study on Reproducibility and Replicability of Table Structure Recognition Methods
Introduction
The realm of concerns surrounding reproducibility, replicability, and generalizability (RR&G) of findings has gained substantial attention within the social and behavioral sciences as well as artificial intelligence (AI). While these concerns have evolved over the past decade and have seen recognition in top-tier journals, they have recently extended their reach into the field of AI. Inconsistencies in terminologies have led to the adoption of precise definitions from Goodman et al. (2016) [7]. Reproducibility refers to consistent computational results under the same conditions, replicability involves achieving consistent results on similar datasets, and generalizability pertains to consistent results across different experimental contexts.
AI's reproducibility studies have mostly targeted empirical and computational AI, focusing on open datasets, code availability, and metadata documentation. However, efforts towards the replicability of AI research have remained limited. Therefore, in this blog, we introduce a study that addresses this gap by delving into the reproducibility and replicability aspects of table structure recognition (TSR) in our paper titled "A Study on Reproducibility and Replicability of Table Structure Recognition Methods", accepted by the 2023 International Conference on Document Analysis and Recognition (ICDAR 2023)
Dataset
In this study, we leverage three distinct datasets to assess the reproducibility and replicability of TSR methods:
- ICDAR 2013: This dataset, originally designed for the ICDAR 2013 table competition, contains 238 document pages extracted from government websites, of which 128 pages include tables. We modified the dataset to match the requirements of TSR models by cropping tables from the PDFs and adjusting annotations.
- ICDAR 2019: Created for the Competition on Table Detection and Recognition (cTDaR) organized by ICDAR 2019, this dataset features modern and historical tables. We focused on the modern table dataset from Track B2 (TB2), consisting of 100 samples sourced from scientific papers, forms, and financial documents.
- GenTSR: Our custom dataset, GenTSR, encompasses 386 table images sourced from research papers across six scientific domains – Chemistry, Biology, Materials Science, Economics, Developmental Studies, and Business. We ensured consistency with the ICDAR 2019 format. Manual annotation was carried out by two independent graduate students using the VGG Image Annotator (VIA), where rectangular bounding boxes were drawn around text content within cells. The annotation process achieved a substantial agreement between annotators with a Cohan's kappa score of 0.73. These datasets and codes are publicly available on Codeocean.
To comprehensively evaluate the reproducibility and replicability of TSR methods, we established a rigorous methodology illustrated in Figure 3. The key steps of our approach are outlined below:
- Sample Selection: We commenced by selecting TSR papers using "table structure recognition" as keywords on Google Scholar. Focusing on papers published after 2017, which accepted documents or table images as input, we collected 25 TSR papers from conference websites. From these, we finalized a set of 16 candidate papers employing deep-learning-based methods.
- Meta-level Study: For each paper, we scrutinized its inclusion of URLs to source codes and datasets. If absent, we conducted thorough web searches to locate accessible code and data repositories, bookmarking them for later reference.
- Local Deployment: We downloaded data and source codes, meticulously following instructions from the original papers or code repositories. Each TSR method was isolated within its dedicated virtual environment to circumvent compatibility issues.
- Reproducibility Tests: We executed the codes using default settings and categorized each paper as:
- Reproducible: Code executed without errors, and results were within 10% absolute F1-score deviation from reported results.
- Partially-reproducible: Code executed without errors, producing results surpassing the reported F1-score by more than 10%.
- Non-reproducible: Other cases.
- Replicability Tests on Similar Datasets: Executable TSR methods were evaluated on a different yet comparable benchmark dataset. A paper was labeled as "conditionally replicable" if its results on this dataset were within 10% F1-score deviation from reported results.
- Replicability Tests on New Dataset: We extended replicability assessments to a new dataset, GenTSR. If results exhibited a maximum of 10% F1-score deviation, the paper was considered "conditionally replicable" for this new dataset.
To ensure accuracy and consistency, we used pre-trained models released by authors, using the Intersection over Union (IoU) metric in TSR results.
For reproducibility tests, evaluations were conducted using the same datasets as the original paper (ICDAR 2013 or ICDAR 2019). In replicability tests on similar datasets, alternate datasets from the two benchmarks were used (e.g., ICDAR 2019 if the original paper used ICDAR 2013). For the second replicability test, the new GenTSR dataset was employed. F-scores were computed at IoU thresholds of 0.5, 0.6, 0.7, 0.8, and 0.9.
Discrepancies (Delta) were defined as absolute differences between F1-scores of various tests, ensuring a comprehensive evaluation of each method's reproducibility and replicability.
Experiment Results
Reproducibility test
Majority of papers considered in our study were categorized as non-reproducible due to their lack of datasets, codes, or executable code. In contrast, papers that did provide executable code predominantly met our criteria for reproducibility. Notably, out of the 6 executable TSR methods examined, 4 papers were classified as reproducible, 1 as partially-reproducible, and 1 as not-reproducible.
Replicability test on similar datasets
The outcomes, as presented in Figure 4, highlight noteworthy reductions in F1-scores for several methods, contingent on the IoU thresholds. Particularly notable is the decrease in the F1-score of Graph-based-TSR [2] and Multi-Type-TD-TSR [6], demonstrating varied performance degradation. Impressively, CascadeTabNet [1] exhibited a marginal decrease, showcasing improved replicability.
However, certain methods like ReS2TIM [3] and TGRNet [5] were excluded from replication attempts due to limitations in allowing inference on alternate datasets. Similarly, we were unable to ascertain the discrepancy for the SPLERGE [4] method, because of the absence of its evaluation data. Consequently, among the 6 methods that were either executable or reproducible, only 2 papers (CascadeTabNet and Multi-Type-TD-TSR) demonstrated replicability under specific IoU conditions on similar datasets.
Replicability test on new dataset (GenTSR)
Figure 4. The comparison of the original (O), reproducibility (R0), replicability on similar data (R1), and replicability on GenTSR (R2). The F1-scores of R2 are obtained by averaging the F1-scores across all domains in GenTSR for each IoU.
However, certain methods like ReS2TIM [3] and TGRNet [5] were excluded from replication attempts due to limitations in allowing inference on alternate datasets. Similarly, we were unable to ascertain the discrepancy for the SPLERGE [4] method, because of the absence of its evaluation data. Consequently, among the 6 methods that were either executable or reproducible, only 2 papers (CascadeTabNet and Multi-Type-TD-TSR) demonstrated replicability under specific IoU conditions on similar datasets.
Conclusion
This study undertook a comprehensive examination of reproducibility and replicability within the context of table structure recognition, analyzing 16 recently-published papers. Direct reproduction of reported results, along with testing executable methods on alternative benchmark datasets, was pursued.
The results highlighted substantial challenges in achieving full reproducibility, as only 2 out of 16 papers met the criteria for conditional replicability on a similar dataset, while none were deemed replicable on a new dataset. These findings underscore the intricate nature of reproducing and replicating methods tailored for the TSR task, with implications for defining reproducibility criteria and data-dependence in replicability.
References
[1] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. Cascadetabnet: "An approach for end to end table detection and structure recognition from image-based documents". In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 572–573, 2020.
[2] Eunji Lee, Jaewoo Park, Hyung Il Koo, and Nam Ik Cho. "Deep-learning and graph-based approach to table structure recognition". Multimedia Tools and Applications, 81(4):5827–5848, 2022.
[3] Wenyuan Xue, Qingyong Li, and Dacheng Tao. Res2tim: "Reconstruct syntactic structures from table images". In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 749–755, 2019.
[4] Chris Tensmeyer, Vlad I. Morariu, Brian Price, Scott Cohen, and Tony Martinez. "Deep splitting and merging for table structure decomposition". In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 114–121, 2019.
[5] Wenyuan Xue, Baosheng Yu, Wen Wang, Dacheng Tao, and Qingyong Li. Tgrnet: "A table graph reconstruction network for table structure recognition". In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1295–1304, 2021.[6] Pascal Fischer, Alen Smajic, Giuseppe Abrami, and Alexander Mehler. "Multi-type-td-tsr–extracting tables from document images using a multi-stage pipeline for table detection and table structure recognition": From ocr to structured table representations. In German Conference on Artificial Intelligence (Künstliche Intelligenz), pages 95–108. Springer, 2021.
[7] Steven N Goodman, Daniele Fanelli, and John PA Ioannidis. "What does research reproducibility mean? Science translational medicine", 8(341):341ps12–341ps12, 2016
Comments
Post a Comment