2020-08-27: Summer Internship Report — Los Alamos National Laboratory

Considering the epidemic of COVID-19, when everything was uncertain, in early May 2020, I was accepted to Applied Machine Learning (AML) Summer Research Fellowship Program at Los Alamos National Laboratory (LANL). I can not be grateful enough to have this excellent opportunity when the world was suffering due to the epidemic caused by Novel Coronavirus. I was offered to work remotely due to the rise of coronavirus cases across the United States. Thus, I started working from home as a Research Intern at LANL on June 1, 2020, under the supervision of Dr. Diane Oyen and Dr. Kari Sentz. The research problem I mainly focused on Offline Handwritten Mathematical Expression Recognitiona subcomponent project called Scientific Image Analysis.


Remotely Working as a Research Intern

Los Alamos National Laboratory is a Federally Funded Research and Development Center (FFRDC). Their top priority is to address research problems that focus on national security and 21st-century science. Every year LANL invites approximately 2000 students to work on various scientific applications. These students get the flexibility to work on different research projects, which is beneficial to the Lab. On the contrary, students also get the opportunity to have hands-on experience by solving various research problems using scientific techniques. 

Los Alamos National Laboratory
(Source: https://www.flickr.com/photos/losalamosnatlab/)

AML Kickoff

The AML program at LANL is sponsored by the Information Science and Technology Institute (ISTI) and Center for Space and Earth Sciences (CSES) that are both centers in the LANL educational outreach organization: National Security Education Center (NSEC). This program is a 10-12 weeks summer school or internship program whose mission is to help students build a solid foundation in modern machine learning techniques through applications of importance to the National Lab. This year AML program accepted 16 students from different universities across the United States. Each student performed work in small collaborations, guided by mentors with scientific and computational expertise.

The program started with a kickoff meeting on June 1, 2020. The AML program offered four different projects, and each project was very diverse and multi-disciplinary. These projects include Scientific Image Analysis, Subsurface Imaging, Atomistic Machine Learning, and Non-Negative Tensor Factorization. Four students among 16 students worked collaboratively for each project under different mentors and co-lead mentors. On the kickoff day, each mentor from four various projects introduced themselves and provided an overview of the projects. At the end of the meeting, students introduced themselves.

Go-Figure Meeting and Weekly Meeting

During the project, I reported my tasks and next steps to both of my mentors daily via the Slack channel. However, we also had a Go-Figure meeting every Monday, where we shared our findings regarding the experiments that carried out in the previous week. I was fully responsible for carrying out the project and get the experiments done by reviewing scientific literature, reaching out to other co-lead mentors, discussing the problems with both of my mentors on every Wednesday during the one-to-one meeting.

Workshops

I spent most of the time working on the project and actively took part in various seminars and workshops. I never felt overwhelmed by the work and extracurriculars. I loved working on every challenge, and my schedule was flexible enough to participate in the meeting on every Wednesday called Text Talk, organized by Dr. Kari Sentz. During Text Talk, I provided my comments and feedback on my co-worker's work in text analysis, who worked on Topic Modeling on Covid-19 Open Research Datasets (Cord-19) using Non-Negative Matrix Factorization (NMF) and Ontologies. Participating in this meeting introduced me to Ontologies and NMF on a high-level.
Weekly Workshop — Design Your Career Goal

I also participated in the weekly workshop called Design Your Career Goal (DYC). During this workshop, the presenter from LANL shared their career experiences and provided excellent recommendations, strategies, and ideas to succeed in the career. Moreover, I attended a few more workshops at the beginning of summer school organized by various organizations at LANL.  One of the workshops called Pytorch Tutorial was beneficial, especially if someone were to implement a deep neural network to train a model with a large number of datasets. Furthermore, I participated in a bi-weekly group presentation called Show and Tell, where students from different research groups used to present and provide updates on the on-going project. 

Show and Tell Webex Meeting

There were many opportunities to get exposure to on-going research at LANL through weekly workshops and talks organized by different institutions. Although attending a weekly seminar was optional; however, it provides information about how people in the other fields utilize machine learning techniques to solve a specific problem. I was honored to attend a few of these considering my schedule and availability.

Work Accomplished

1) Introduction

At LANL, they have patent images and scientific archives that consist of handwritten mathematical expressions embedded with text and images. Knowing my work on scanned Electronic Theses and Dissertations, I was assigned to read math recognition literature as the problem was challenging, and there is a little precedent of research in offline handwritten math expression (HME) recognition. Based on the literature study: I ran a couple of experiments, rendered online math expression images offline, proposed a few techniques to employ, and proposed a pipeline (Fig. 1) to implement the model. 

Fig. 1: Offline Handwritten Math Expression Recognition Proposed Pipeline

In the beginning, we thought that we could use Tesseract-OCR or other OCR techniques since math expressions have sequential structure and textual information. Accordingly, I applied Tesseract-OCR on offline HME, but the result was noisy and failed to recognize complex expressions, math symbols, and operators. The math survey by Richard Zanibbi et al. [1] has shown a few state-of-the-art tools (Fig. 2) for online HME recognition. While applying those tools on offline HME, the researchers found that the expression rate achieved 65%, whereas these state-of-the-art tools achieved 92-95% accuracy on online HME. Therefore, our goal for this summer project was to focus on the offline HME recognition for Easy Case (Fig. 1), which illustrates to recognize the sequential pattern of offline HME using deep neural networks such as CNN and LSTM.

Fig. 2: Online Math Entry Systems


2) Challenges


Nazemi et al. [2] described that the challenges lie to recognize offline HME involves two-dimensional layouts, subscript and superscript positioning, variable-sized character, and unusual math operator depending on the area of mathematics and finding the spatial relationship for complex expressions. Also, the biggest challenge in offline HME recognition stems from the limitations of the datasets: class symbols are limited for labeling, some classes are redundant, and the distribution of images for each class symbol is imbalanced.


In contrast, online HME takes pen strokes, document images, and vector graphics (e.g., PDF) as an input. Thus, having the strokes information, it can be represented to ink Markup Language (inkML), LaTeX, or MathML formats. This is an advantage for online HME recognition since these formats can be converted to label graph data (Fig. 3) where the relationship information (i.e., symbol layout tree) can be found for each expression. 


Due to the complexity of this project, we divided the recognition problem into two sub-categories: the Easy Case and the Hard Case. In an Easy Case, the challenge lies in recognition of the expression with a sequential pattern. We are calling any complex expression a Hard Case, as it is illustrating in Fig. 6, and it needs further segmentation techniques to determine spatial relationships.


Fig. 3: Label Graph for Online HME recognition


3) Datasets

a.) ICDAR 2019 CROHME

i.) Train set: 8835 online images

ii.) Test set from 2012 competition: 489 online images

iii.) Test set from 2013 competition: 672 online images


b.) Kaggle HME symbols datasets: we downloaded 82 classes of symbols for labeling.


Fig. 4: ICDAR 2019 CROME Handwritten Math Expressions Image Datasets


4) Feature Extraction and Labeling

We used two features: a.) contour extraction and b.) skeletonization. Fig. 5 is showing the result of extracting the region of interest (ROIs) of each isolated character from an expression using contour extraction and skeletonization technique, respectively. We demonstrated that skeletonization is a superior technique over contour extraction. Skeletonization provided the best result while extracting the ROIs and produced the best results for recognizing the connected components in the complex expression. Fig. 5 (a) shows that the first isolated character “I” has two bounding rectangles while extracting the contour. In contrast, skeletonization (Fig.5 (b)) correctly recognized the connected components.


Fig. 5: Feature Extraction using Contour Extraction and Skeletonization


Fig. 6: Example of Contour Extraction of a Compound Expression

We found 428644 class images from Kaggle datasets. These datasets we used for labeling and applied the contour extraction technique for each class symbol. Afterward, we cropped the rectangle area for each symbol, resized it 28 by 28, and reshaped it to 784 by 1. Our datasets were ready to feed into CNN after labeling.

5) Result

We used LeNET-5 CNN architecture which consists of two sets of convolutional layers and max-pooling layers, followed by a flattening convolutional layer, then two fully-connected layers, and finally a softmax classifier. Fig.7 is representing a graphical view of the architecture. There are a couple of challenges we anticipated while training with CNN. For example, in the preprocessing steps, a lot of symbols were not cropped properly, the model was overfitting due to imbalanced dataset for 82 classes of symbols, and the train set was not randomly shuffling while splitting the validation set to10% or 20%.
Fig. 7: LeNET-5 CNN Architecture

Nevertheless, we fixed these issues and our model achieved 89% accuracy for the validation set and 86% accuracy for train set while training the model with 50 epochs and a batch size of 32. Fig. 8 is illustrating the performance of the model.


Fig. 8: Performance of the LeNET-5 CNN model

6) Conclusion and Future Work

We tried two features such as contour extraction and skeletonization and further applied contour extraction for 82 classes of symbols. The LeNET-5 achieved 89% accuracy on the test set and 86% accuracy for the train set. We also visualized, analyzed, and fixed the overfitting issue. In the future, we would feed the output of CNN to LSTM to learn the sequential expression. Also, further work involves segmenting the compound expression using vertical and horizontal symbol segmentation techniques.

References

[1] Zanibbi, R., Blostein, D. (2012). Recognition and retrieval of mathematical expressionsInternational Journal on Document Analysis and Recognition (IJDAR)15(4), 331-357.

[2]   Nazemi, A., Tavakolian, N., Fitzpatrick, D., & Suen, C. Y. (2019). Offline handwritten mathematical symbol recognition utilizing deep learningarXiv preprint arXiv:1910.07395.

Acknowledgments

I am grateful to my advisor Dr. Jian Wu who encouraged me to apply for the AML Fellowship program. I am also honored and thankful to work with Dr. Diane Oyen, Dr. Kari Sentz, and a diverse and multi-disciplinary team at Los Alamos National Laboratory as a Research Intern.

-- Muntabir Choudhury


1 Approved for unlimited release by Los Alamos National Laboratory LA-UR-20-26662.

Comments