2020-06-05: Math formula extraction from scholarly papers using ScanSSD

Add caption

In order to detect mathematical expressions in PDF document images, we implemented ScanSSD. This document record each step to run ScanSSD in detail, including preparation, constructing directory structures, retraining the model, implementing the model and making visualization. Also an example of a specific project (detecting) is given in this document.

We download and use the ScanSSD model from Parag Mali’s GitHub. To retrain and test the ScanSSD model, we also download the testing data (some PDF files and utility Python programs) from here.

Installation of some Python modules

The server runs Ubuntu 18.04. ScanSSD model is implemented in Python. We have Python 3.6.9 and pip 9.0.1 installed on the Ubuntu server. The python3 command is used to start Python 3.6.9, and the pip3 command is used to run pip to install modules for Python 3.6.

1. Install PyTorch

The following command uses pip3 to installs PyTorch 1.5 package for Python with CUDA 10.2 on Linux, which will install numpy-1.18.4, torch-1.5.0, and torchvision-0.6.0 modules.

~$ pip3 install torch torchvision

To check if the GPU driver and CUDA are enabled and accessible by PyTorch, run the following commands to return whether or not the CUDA driver is enabled:

~$ python3
>>> import
torch
>>> torch.cuda.is_available()

2. Install Visdom

The following command uses pip3 to installs Visdom server and client for Python.

~$ pip3 install visdom

When running train.py, the code also asks to install opencv, matplotlib, and torchviz modules if they are not installed before.

Construct the directory structures for the ScanSSD model

Before we retrain and test the ScanSSD model, we first need to download the package and arrange all programs and data into proper directory structures.

We created a ScanSSD home directory /data/pwang/data/GTDB on the Linux server, and then follow the directory structures provide by Parag at https://github.com/MaliParag/ScanSSD/blob/master/dir_struct to construct the directory structures for the ScanSSD model running environment.

Under the home directory, we create the following sub-directories for ScanSSD model and data:

ssd: includes SSD model in ssd.py, ScanSSD training in train.py, and ScanSSD testing in test.py. All the training code is in the layers sub-directory. Hyper-parameters for training and testing can be specified through command line and through config.py file in the data directory. The data directory also contains gtdb_new.py data reader that uses sliding windows to generate sub-images of page for training. All the scripts regarding stitching the sub-image level detections are in the gtdb directory. Functions for data augmentation, visualization of bounding boxes and heat map are in the utils directory.

images: stores all PDF document images. Each PDF document has a directory that is used to save all PDF pages in PNG image format. Each page has a separate PNG image file.

For retraining, testing, and stitching, we need to create a training file, a testing file and a stitching file. All these files are text files and include image file names (or with page numbers). These files are directly saved under the home directory.

To prepare PDF document images, we follow the testing data information and use ODU library to find all PDF files, and then use the utility program convert_pdf_to_image.py (downloaded with the testing data package) to convert all PDF files into image files by page in the PNG format. Finally, we copy all PDF image files (organized by a directory for each document) to the images directory above.

The following is the file and directory structures under the home directory /data/pwang/data/GTDB:

	GTDB ├── images │ ├── . . . ├── ssd │ ├── base_weights
	│ │ └── vgg16_reducedfc.pth
	│ ├── data
	│ │ ├── config.py
	│ │ ├── gtdb_new.py
	│ │ ├── __init__.py
	│ │ └── __pycache__
	│ ├── eval
	│ ├── gtdb
	│ │ ├── adjust_boxes.py
	│ │ ├── box_utils.py
	│ │ ├── calculate_means.py
	│ │ ├── create_dataset.py
	│ │ ├── create_gt_math.py
	│ │ ├── create_segmentation_gt.py
	│ │ ├── diagnose.py
	│ │ ├── feature_extractor.py
	│ │ ├── fit_box.py
	│ │ ├── generate_subimages.py
	│ │ ├── gen_training_ids.py
	│ │ ├── __init__.py
	│ │ ├── __pycache__
	│ │ ├── remove_rect.py
	│ │ ├── resize_gt.py
	│ │ ├── scale_boxes.py
	│ │ ├── split_annotations_per_page.py
	│ │ ├── stitch_patches_page.py
	│ │ └── stitch_patches_pdf.py
	│ ├── images
	│ │ └── detailed_math512_arch.png
	│ ├── layers
	│ │ ├── box_utils.py
	│ │ ├── functions
	│ │ ├── __init__.py
	│ │ ├── modules
	│ │ └── __pycache__
	│ ├── LICENSE
	│ ├── logs
	│ ├── README.md
	│ ├── ssd.py
	│ ├── test.py
	│ ├── train.py
	│ ├── utils
	│ ├── augmentations.py
	│ ├── helpers.py
	│ ├── __init__.py
	│ ├── __pycache__
	│ └── visualize.py
	├── testing_data
	├── test_pdf
	├── training_data
	├── train_pdf
	├── validation_data

Retraining the ScanSSD model

To retrain the ScanSSD model, we use the training_data file provided by the GitHub package.

1. Start the Visdom server by running:

~$ python3 -m visdom.server

If want to run the Visdom server in the background, run

~$ nohup python3 -m visdom.server &

2. Change to the ssd directory and run train.py with the given training_data file

~$ cd /home/pwang/data/GTDB/ssd
~$ python3 train.py 
--dataset GTDB 
--dataset_root /data/pwang/data/GTDB/ 
--cuda True 
--visdom True 
--batch_size 16 
--num_workers 4 
--exp_name IOU512_pw18 
--model_type 512 
--training_data training_data 
--cfg hboxes512 
--loss_fun ce 
--kernel 1 5 
--padding 0 2 
--neg_mining True 
--pos_thresh 0.75

If want to run in the background, use the nohup command to run the above command.

The following are list of parameters:

--dataset specifies dataset name.

--dataset_root specifies dataset’s home/root directory.

--cuda specifies whether use CUDA to train model, True or False.

--visdom specifies whether use Visdom for loss visualization, True or False.

--batch_size specifies batch size for training.

--num_workers specifies number of workers used in data loading.

--exp_name specifies experiment name. For example, with the value as IOU512_pw18, it means the directory weights_IOU512_pw18 will be created to save all weights files, and the final weight file will be named as IOU512_pw18GTDB.pth.

--model_type specifies type of ssd model, ssd300 or ssd512.

--training_data specifies training data to use. Each line specifies one page image with format as document_name/page_number.

--cfg specifies type of network, gtdb, math_gtdb_512 or hboxes512.

--loss_fun specifies type of loss: either fl (focal loss) or ce (cross entropy) .

--kernel specifies kernel size for feature layers: 3 3 or 1 5.

--padding specifies padding for feature layers: 1 1 or 0 2.

--neg_mining specifies whether or not to use hard negative mining with ratio 1:3, True or False.
--pos_thresh specifies all default boxes with iou>pos_thresh are considered as positive examples.

Note, on our Linux server, we need to change ssd_net = torch.nn.DataParallel(ssd_net) to ssd_net = torch.nn.DataParallel(ssd_net,device_ids=[0])in train.py.

If CUDA 0 is not available, change [0] to the available one, for example, [1].

To start several sessions at the same time, I highly recommend to prepare several test.py files according to the number of GPUs on your machine, which contain CUDA 0, CUDA 1 ... sequentially. This will save a lot time on modifying the code.

3. Run test.py with the given validation_data file in the ssd directory

~$ python3 test.py 
--dataset_root ../ 
--trained_model IOU512_pw18GTDB.pth 
--visual_threshold 0.25
--cuda True 
--exp_name test_MATH512_pw18 
--test_data validation_data 
--suffix "_512" 
--model_type 512 
--cfg hboxes512 
--padding 0 2 
--kernel 1 5 
--batch_size 4

The following are list of parameters:

--dataset_root specifies dataset’s home/root directory.

--trained_model specifies the trained weight file including path and file name.

--visual_threshold specifies final confidence threshold.

--cuda specifies whether use CUDA to train model, True or False.

--exp_name specifies experiment name that is used to create a directory to store all test outputs.
--test_data specifies the testing data file. Each line specifies one page image with format as document_name/page_number.

--suffix specifies suffix of directory of images for testing.

--model_type specifies type of ssd model, ssd300 or ssd512.

--cfg specifies type of network, gtdb, math_gtdb_512 or hboxes512.

--padding specifies padding for feature layers: 1 1 or 0 2.

--kernel specifies kernel size for feature layers: 3 3 or 1 5.

--batch_size specifies batch size for training.

The test output files (.csv) are saved to the test_MATH512_pw18 directory under the eval directory.

4. Run stitch_patches_pdf.py with the given train_pdf file in the ssd directory

~$ python3 gtdb/stitch_patches_pdf.py 
--data_file ../train_pdf --output_dir eval/stitched_MATH512_pw18/ 
--math_dir eval/test_MATH512_pw18/ 
--stitching_algo equal 
--algo_threshold 30 
--num_workers 8 
--postprocess True 
--home_images ../images/

The following are list of parameters:

--data_file specifies training data to use. This is list of file names, one per line.
--output_dir specifies a directory to store outputs.

--math_dir specifies the directory where the testing results are stored.

--stitching_algo specifies stitching algorithm to use.

--algo_threshold specifies stitching algorithm threshold.
--num_workers specifies number of workers used in data loading.

--postprocess specifies whether to fit math regions before pooling, True or False.

--home_images specifies the directory of the original document images.

For the above stitching, the stitching output files are saved to the stitched_MATH512_pw18 directory under the eval directory.

The output files for stitch are still saved as .csv, and the coordinates are represented as floating-point values. To visualize the results with visualize_annotations.py, we need to change the file extension name to .math, and change all floating-point values to integer values (by deleting all .00).

We changed stitch_patches_pdf.py to save output files as .math and output coordinate values to integers.

5. Run visualize_annotations.py to visualize the results

~$ python3 /home/pwang/TFD-ICDAR2019/TFD-ICDAR2019v1/VisualizationTools/ visualize_annotations.py--img_dir /data/pwang/data/GTDB/images --out_dir /data/pwang/data/GTDB/ssd/visual18/ --math_dir /data/pwang/data/GTDB/ssd/eval/stitched_MATH512_pw18/

The following is an example. The retrained model draws many bounding boxes that do not enclose math equations.

Implement Pre-trained model

The ScanSSD model provides a pre-trained weight file AMATH512_e1GTDB.pth.

If we directly used this pre-trained weight file to run test.py, the results look much better than our retrained weight file.

1. Run test.py with the pre-trained weight file AMATH512_e1GTDB.pth and the given validation_data file in the ssd directory

~$ python3 test.py 
--dataset_root ../ 
--trained_model AMATH512_e1GTDB.pth 
--visual_threshold 0.25 
--cuda True
--exp_name test_MATH512_pw 
--test_data validation_data 
--suffix "_512" 
--model_type 512 
--cfg hboxes512 
--padding 0 2 
--kernel 1 5 
--batch_size 4

The test output files (.csv) are saved to the test_MATH512_pw directory under the eval directory.

2. Run stitch_patches_pdf.py with the given train_pdf file in the ssd directory

~$ python3 gtdb/stitch_patches_pdf.py 
--data_file ../train_pdf 
--output_dir eval/stitched_MATH512_pw/ 
--math_dir eval/test_MATH512_pw/ 
--stitching_algo equal 
--algo_threshold 30 
--num_workers 8 
--postprocess True 
--home_images ../images/

The stitch output files are saved to the stitched_MATH512_pw directory under the eval directory.

3. Run visualize_annotations.py to visualize the results

~$ python3
/home/pwang/TFD-ICDAR2019/TFD-ICDAR2019v1/VisualizationTools/ visualize_annotations.py
--img_dir
/data/pwang/data/GTDB/images 
--out_dir
/data/pwang/data/GTDB/ssd/visual/ 
--math_dir /data/pwang/data/GTDB/ssd/eval/stitched_MATH512_pw/

The following is the result for the same page as above. which looks much better than the retrained model.

So we decided to use the pre-trained weight file for our project.

The SBS (social and behavior science) Project

In this project, we want to find all mathematical expressions of 793 SBS papers in PDF.

1. Prepare document images

We create a directory pdfs under the home directory (/data/pwang/data/GTDB), and download and save all SBS PDF papers into this directory. If a PDF file name contains the space characters, we change them to the underscore character.

We create another directory PDF under the home directory, and copy convert_pdf_to_image.py to this directory.

We delete all image files with directories in the /data/pwang/data/GTDB/images directory.

Run the following command from the PDF directory to generate PDF document image files.

~$ convert_pdf_to_image.py /data/pwang/data/GTDB/pdfs /data/pwang/data /GTDB/images

2. Prepare testing data files and stitching data files

The 793 SBS PDF papers have totally 15340 pages. With some testing, we found that the test.py cannot process more than 2000 image files at a time. We create 14 test data files and 14 stitch files according to the first letter of PDF file names as follows.

Test-File-Name Pages Stitch-File-Name Documents

1. test_793A_data 563 test_793A_stitch 27

2. test_793B_data 1744 test_793B_stitch 77

3. test_793C_data 870 test_793C_stitch 42

4. test_793DE_data 1006 test_793DE_stitch 56

5. test_793FG_data 1224 test_793FG_stitch 59

6. test_793H_data 1172 test_793H_stitch 58

7. test_793IJK_data 1420 test_793IJK_stitch 66

8. test_793L_data 1034 test_793L_stitch 61

9. test_793M_data 967 test_793M_stitch 53

10. test_793NOP_data 1312 test_793NOP_stitch 71

11. test_793QR_data 806 test_793QR_stitch 49

12. test_793S_data 1186 test_793S_stitch 63

13. test_793TUV_data 816 test_793TUV_stitch 47

14. test_793WXYZ_data 1220 test_793WXYZ_stitch 64

3. Run test.py with the pre-trained weight file AMATH512_e1GTDB.pth for each of the above test data file in the ssd directory. The following command is for test_793A_data the file.

~$ python3 test.py 
--dataset_root ../ 
--trained_model AMATH512_e1GTDB.pth 
--visual_threshold 0.25 
--cuda True
--exp_name test_MATH512_793A 
--test_data test_793A_data 
--suffix "_512" 
--model_type 512 
--cfg hboxes512 
--padding 0 2 
--kernel 1 5 
--batch_size 8

The test output files (.csv) are saved to the test_MATH512_793A directory under the eval directory.

4. Run stitch_patches_pdf.py with the test_793A_stitch file in the ssd directory

~$ python3 gtdb/stitch_patches_pdf.py 
--data_file ../test_793A_stitch
--output_dir eval/stitched_MATH512_793A/ 
--math_dir eval/test_MATH512_793A/ 
--stitching_algo equal 
--algo_threshold 30 
--num_workers 8 
--postprocess True 
--home_images ../images/

The stitch output files contain page and coordinate(left top&right bottom) information of each detected math formula, which are saved to the stitched_MATH512_793A directory under the eval directory.

5. Run visualize_annotations.py to visualize the results

~$ python3 /home/pwang/TFD-ICDAR2019/TFD-ICDAR2019v1/VisualizationTools/ visualize_annotations.py

--img_dir /data/pwang/data/GTDB/images
--out_dir /data/pwang/data/GTDB/ssd/visual793A/

--math_dir /data/pwang/data/GTDB/ssd/eval/stitched_MATH512_793A/

The following is a random page with found mathematical expressions.

False Positive Cases

ScanSSD is a great tool for detecting math formula inside papers. It allows users to get the location (page and coordinates) of each detected math formula, also the visualization can be done with these statistics. Since math formula is an important feature of papers, further study can be conducted based on that.

References

1. Parag Mali, Puneeth Kukkadapu, Mahshad Mahdavi, and Richard Zanibbi, ScanSSD: Scanning Single Shot Detector for Mathematical Formulas in PDF Document Images, Technical Report arXiv:2003.08005v1 [cs.CV] 18 Mar 2020

2. Parag Mali, https://github.com/MaliParag/ScanSSD/

-- Pei Wang

Search This Blog

Web Science and Digital Libraries Research Group