2022-05-19: Regular Expression Rule-Based Approach for Table and Figure Reference Extraction from Scientific Papers

Tables and figures are an essential part of a well-written scientific paper. Scientific papers use tables to present the bulk of the detailed information such as results and their associations. Many of the basic concepts, process flows, key natural trends, and key discoveries are presented in the figures. In this blog, I present a simple but effective rule-based approach using regular expressions (RegEx) for extracting table and figure references from the text in scientific papers.

What does the table or figure reference mean?

In scientific papers, the tables and figures are referred to in body text to support the claims. Below are some examples where tables and figures are referred to in body text.

As seen in Table 3, there are 3 cross-listed top 10 features identified by both ANOVA-F and MI (in blue text).
Figure 4 shows that evaluation results using the core features exhibit significantly different performances.

Overview of the rule-based approach

Prior to using the rule-based method for extracting table and figure references from the scientific papers, we need to convert the documents into text. Once the text is extracted from the paper, this rule-based approach uses RegEx to locate the table and figure references in the body text.

Workflow of the Rule-Based Approach using RegEx for Table and Figure Reference Extraction

Document Conversion

We first need to extract text from scientific papers in PDF format. In order to do that we can use an open-source library such as GROBID (GeneRation Of BIbliographic Data). GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML encoded documents designed for scholarly papers.

Once the XML documents are generated, we need to use an XML parser to extract the text from XML. A Python library like BeautifulSoup can be used for pulling data out of XML files.

Table and Figure Reference Extraction

Given the text extracted from the scientific papers, we can locate the table and figure references in the textual documents using RegEx.

In scientific publications, the authors mention the tables and figures using different styles. For example, figures can be written as fig or figure. The authors can use numerical digits (Table 1, Table 2, etc.), roman numbers (Table I, Table IV, etc.), or alphanumerics (Table 1A, Table 1B, etc.) to number figures and tables. Similarly, the authors refer to more than one table or figure at a time by pluralizing the reference like “Tables 1 and 2” or “Tables 1 - 4”. The different styles that are used in scientific papers to refer to tables and figures are given in the below tables along with the relevant RegEx for each style.

Different Styles for Tables

Style	RegEx Patterns
Table 1 Table 1A	r'table [0-9]+[a-z]*'
Tables 1–4 (em dash)	r'tables [0-9]+–[0-9]+'
Tables 1-4 (en dash)	r'tables [0-9]+-[0-9]+]'
Tables 1 and 2 Tables 1A and 2A	r'tables [0-9]+[a-z]* and [0-9]+[a-z]*'
Table IV	r'table (M{0,4})(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})'

Different Styles for Figures

Style	RegEx Patterns
Fig 1 \| Fig. 1 \| Fig.1 \| Figure 1 Fig 1A \| Fig. 1A \| Fig.1A \| Figure 1A	r'(fig(ure) ?\|fig.( )?)([0-9]+[a-z]*)'
Figs 1–4 \| Figs. 1–4 \| Figures 1–4 (em dash)	r'(fig(ure)?s \|figs. )([0-9]+–[0-9]+)'
Figs 1-4 \| Figs. 1-4 \| Figures 1-4 (en dash)	r'(fig(ure)?s \|figs. )([0-9]+-[0-9]+)'
Figs 1 and 2 \| Figs. 1 and 2 \| Figures 1 and 2 Figs 1A and 1B \| Figs. 1A and 1B \| Figures 1A and 1B	r'(fig(ure)?s \|figs. )([0-9]+[a-z]* and [0-9]+[a-z]*)'
Fig IV \| Fig. IV \| Fig.IV \| Figure IV	r'(fig(ure) ?\|fig.( )?)(M{0,4})(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})'

We can use the aforementioned RegEx patterns to locate the table and figure references in the text extracted from the scientific papers. In order to do that I present the TableAndFigureReferenceExtractor Python library that provides the functionalities to extract the table and figure references from the text using the RegEx.

First, we need to install this library from GitHub using the following command.

pip3 install -e git+https://github.com/lamps-lab/TableAndFigureReferenceExtractor#egg=table_and_fig_ref_extraction

Then, by importing this TableAndFigureReferenceExtractor library into our Python code and calling its findTableRefs() and findFigRefs() methods, we can find the table and figure references in the text that match the RegEx patterns included in the library. These methods will return the lists of table and figure references with alphanumeric and roman numbers. A sample code snippet is given below.

from table_and_fig_ref_extraction.ExtractTableAndFigRefPatterns import *

document_text = "As seen in Table 3, there are 3 cross-listed top 10 features identified by both ANOVA-F and MI (in blue text). Figure 4 shows that evaluation results using the core features exhibit significantly different performances."

alphanumeric_table_refs, roman_number_table_refs = findTableRefs(document_text)

alphanumeric_fig_refs, roman_number_fig_refs = findFigRefs(document_text)

print(alphanumeric_table_refs, roman_number_table_refs)
print(alphanumeric_fig_refs, roman_number_fig_refs)

>> output:
[Table 3], []
[Figure 4], []

Moreover, this TableAndFigureReferenceExtractor library keeps track of the aforementioned RegEx patterns and whether each pattern is case sensitive or not. If any new RegEx patterns emerge, they could be added to this library using the following methods.

addNewTablePattern(isAlphanumericType, pattern, ignoreCase)

addNewFigPattern(isAlphanumericType, pattern, ignoreCase)

If the new pattern represents a table or figure with an alphanumeric number, then we need to set isAlphanumericType as True and if the new pattern represents a table or figure with a roman number, we need to set it as False. We must pass the new RegEx pattern using the pattern parameter. We can set the ignoreCase as True if the new pattern is case-insensitive and as False if the pattern is case-sensitive. The default is True.

In this blog post, I provided a simple rule-based approach using RegEx to extract table and figure references from the scientific papers. I also described the overview of the workflow used in this approach and showed RegEx patterns that could be used to match different styles that authors use to refer to the tables and figures in the body text. I presented a Python library called TableAndFigureReferenceExtractor which I built to locate the table and figure references in the text using RegEx patterns. This library keeps track of all the RegEx patterns which represent different table and figure reference styles that I have identified in scientific papers. Using these pre-defined RegEx patterns, this library provides the capability to find table and figure references in the text. In case new RegEx patterns are identified, the users also have the capability to update the library with them.

The TableAndFigureReferenceExtractor library is available on GitHub and we welcome contributions.

-- Yasasi Abeysinghe (@Yasasi_Abey)

Style	RegEx Patterns
Fig 1 \| Fig. 1 \| Fig.1 \| Figure 1 Fig 1A \| Fig. 1A \| Fig.1A \| Figure 1A	r'(fig(ure) ?\|fig.( )?)([0-9]+[a-z]*)'
Figs 1–4 \| Figs. 1–4 \| Figures 1–4 (em dash)	r'(fig(ure)?s \|figs. )([0-9]+–[0-9]+)'
Figs 1-4 \| Figs. 1-4 \| Figures 1-4 (en dash)	r'(fig(ure)?s \|figs. )([0-9]+-[0-9]+)'
Figs 1 and 2 \| Figs. 1 and 2 \| Figures 1 and 2 Figs 1A and 1B \| Figs. 1A and 1B \| Figures 1A and 1B	r'(fig(ure)?s \|figs. )([0-9]+[a-z]* and [0-9]+[a-z]*)'
Fig IV \| Fig. IV \| Fig.IV \| Figure IV	r'(fig(ure) ?\|fig.( )?)(M{0,4})(CM\|CD\|D?C{0,3})(XC\|XL\|L?X{0,3})(IX\|IV\|V?I{0,3})'

Search This Blog

Web Science and Digital Libraries Research Group