2020-06-07: Regular Expression — A Powerful Tool to Parse Text with Visually Identifiable Patterns

In the previous blog, I have discussed how tesseract-OCR performed on scanned Electronic Theses and Dissertations (ETDs). If you have read my earlier blog, we already saw that the process started with converting the cover page of scanned ETDs into images. Then, tesseract-OCR was applied and saved the extracted result into text files. We also saw that OpenCV OCR failed on scanned ETDs. We could try a widely used open-source tool such as GROBID, designed for scholarly papers. However, this article shows that GROBID is intended for extracting bibliographic metadata for born-digital academic papers. Finally, we decided to apply tesseract-OCR to extract the text from the cover page of scanned ETDs. Afterward, a series of regular expressions (RegEx) was performed to extract seven metadata fields, including titles, authors, academic-programs, institutions, advisors, and years. In this blog, I will introduce how RegEx can be a powerful tool to quickly parse the text with patterns. 

The work introduced in this blog post is a supplement to our poster accepted by the 2020 Joint Conference on Digital Libraries (JCDL 2020). Throughout this research work, I implemented a heuristic model using RegEx to quickly parse the metadata from the cover page of each ETD. ETD cover page is an excellent example to parse the metadata using RegEx because the majority of ETDs follow similar templates, and have similar patterns (e.g., by John Doe, where John Doe is an author follows after 'by'). On the contrary, we chose RegEx because it is generally faster and suitable for capturing evident patterns. To evaluate the parsing result using RegEx, we downloaded the corresponding ground truth (GT-metadata) of each ETD and compared the parsing result against it. The GT-metadata were semi-structured (XML and JSON). Therefore, I wrote XML and JSON parsers to extract the metadata from GT-metadata. The result achieved up to 97% accuracy measures for seven metadata fields. To our best knowledge, this is the first work attempting to extract metadata from the cover page of scanned ETDs and poses a strong baseline for further development of learning-based methods.

Figure-1: Year extraction using RegEx from MIT and Virginia Tech ETD Cover Page

Figure-2: Degree extraction using RegEx from MIT and Virginia Tech ETD Cover Page

As we already saw from the previous blog that tesseract-OCR produced a good result, but with fewer misspellings. To avoid OCR-generated errors, I manually cleaned the misspellings (TXT-clean). Later, I wrote the RegEx and created rules to extract the metadata from TXT-clean. Figure-1 and Figure-2 are illustrating 'year' and 'degree' field extraction from the cover page of ETD using RegEx, respectively. In these figures, the sample text at the top is representing MIT ETD, and at the bottom is representing Virginia Tech ETD. Moreover, a full description has been provided to write RegEx for extracting each metadata field from TXT-clean below. For a better demonstration of the RegEx, the hyperlink for each metadata field will be opened as an interactive online tool. This online tool will allow us to play with the dataset and RegEx as well. 
  • Title — to extract the title field for both samples (MIT and Virginia Tech), we can see that the first 4-5 lines are the title of each ETD, and a new line starts with 'by'. In this case, we can write a RegEx like below:
    • Regular Expression: r'((.*\n){1,5})(?=by)'
      • (.*\n): the '.*' is a greedy approach that matches any character except for line terminator. Then the '\n' will go to the next line and matches the string until it finds 'by' keyword.
      • {1,5}: this expression is matching all the strings from lines 1-5. By providing the numeric values but comma-separated in the curly braces, the expression is matching all the strings until it finds 'by' keyword.
      • (?=): this expression is called positive lookahead. (?=by) asserts that at that position in the string, what immediately follows is the characters 'by'.
  • Author — to extract the author field from both samples (MIT and Virginia Tech), we can see that the author's name follows after 'by' keyword and starts in a new line. From the MIT samples, we also see that few strings such as "Thesis Supervisor" and "Chairman" have also been extracted since the previous two lines preceded with 'by'. In this case, I left this task for the python code to handle.
    • Regular Expression: r'(?<=by\n)\w.+'
      • (?<=): this expression is called a positive look behind. (?<=by\n) asserts that at that position, what precedes is the character 'by'. If it succeeds, it will go to the next line and perform the task if there is any other instruction given after the parenthesis.
      • \w: it matches any word character [a-zA-Z0-9].
      • .+: it is a greedy approach and matches any characters except line terminator.
  • Degree — to extract the degree field from both samples (MIT and Virginia Tech), we can see that the degree field follows after 'degree of' or starts in a new line. So, we can write a RegEx as below:
    • Regular Expression: r'(degree(s)? of)(\n?|.+?|\n)(\w.+|\w?.+\n.+)'
      • (degree(s)? of): this is the first matching group. This expression is looking for the exact string match that starts either with the 'degree of' or 'degrees of'.
      • (\n?|.+?): this is a 2nd matching group. This is looking for if there is any string following after 'degree of' or 'degrees of.' Also, it is looking for if the string starts in a new line (\n?) but following after 'degree of' or 'degrees of'. 
      • (\w.+|\w?.+\n.+): this is a 3rd matching group. It matches any word character [a-zA-Z0-9]. As mentioned above, '.+' is a greedy approach, and it matches any character except line terminator.
  • Academic Program — to extract the academic program field from both samples (MIT and Virginia Tech), we can see that the program field precedes with 'department of' or 'in' but follows after space or starts in a new line. To extract this metadata field, we can write a RegEx as below:
    • Regular Expression: r'(department(s)? of |in\n)(\w+[ ]?\w+[^,]?\w+[^,]?\w+[^,]?\w+[ ]?\w+)'
      • (department(s)? of |in): this is the first matching group. This expression is looking for the exact string match that starts either with the 'department of' or 'departments of' or 'in'. 
      • \n: putting this new line character with the string 'in' in the first matching group, this expression will go to the next line followed after the keyword 'in'.
      • (\w+[ ]?\w+[^,]?\w+[^,]?\w+[^,]?\w+[ ]?\w+): '\w+' matches one or more word characters. The expression such as '[ ]?' and '[^,]?' are expressing if there is any space in between each word character or if there is any comma in between each word character. I wrote this complex expression based on the analysis of each sample dataset.
  • Institution — to extract the institution field from both samples (MIT and Virginia Tech), we can see that the university name follows after 'at the', 'at, ''faculty of the', and 'faculty of'. But the field either starts in a new line or after space. So, we can write a RegEx as below:
    • Regular Expression: r'(at the[\n| ]|from the\n|at[\n| ]|faculty of the[\n| ]|faculty of\n)(\w.+[\n]?\w+[ ]?\w+[ ]?\w+[ ]?\w+[ ]?\w+[^(Jan(uary)?|Sep(tember)?(\s\d{4})?|(,)?|author?|(fulfillment)?|laminates]\w+)'
      • (at the[\n| ]|from the\n|at[\n| ]|faculty of the[\n| ]|faculty of\n): this is the first matching group. This expression is for the exact string match that starts either with 'at the' or 'at' or 'faculty of the' or 'faculty of'. The expression such as '[\n| ]' in between those matching strings, is looking for if the metadata field starts in a new line or after space.
      • (\w.+[\n]?\w+[ ]?\w+[ ]?\w+[ ]?\w+[ ]?\w+[^(Jan(uary)?|Sep(tember)?(\s\d{4})?|(,)?|author?|(fulfillment)?|laminates]\w+): I wrote this complex regular expression based on the analysis of each sample dataset. As we already know from above, the '\w' matches any word character, '.+' is a greedy approach, and lastly the '(\s\d{4})'is looking for 4 digit numeric values but with space before.
  • Year — to extract the year field from both samples (MIT and Virginia Tech), we can see that the year precedes with a month. To extract this field, we can write a RegEx as below:
    • Regular Expression: r'(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)(,?)(\s\d{4})'
      • (Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?): this is the first matching group. it is simply searching for the month.
      • (,?): this is the 2nd matching group. This expression is looking for if there is any comma after the month.
      • (\s\d{4}): this 3rd matching group is looking for 4 digit numeric values but with space before.
  • Advisor — to extract the advisor field from both samples (MIT and Virginia Tech), we can see that the advisor field follows after 'certified of' or 'approved' but starts in a new line. Thus, we can write a RegEx like below:
    • Regular Expression: r'(certified by\n?|approved:\n?|approved by:\n?)(\w[^,?].+)'
      • (certified by\n?|approved:\n?|approved by:\n?): this is the first matching group. This expression is looking for the exact string match that starts either with 'certified by' or 'approved' or 'approved by'. By providing the '\n' at the end of each matching strings, allowing this expression to look if the metadata field starts in a new line or follows after space.
      • (\w[^,?].+): this is the 2nd matching group. As we already know from above, '\w' matches any word character. Also, '[^,?]' is looking for if there is any comma in between. Lastly, '.+' is a greedy approach.
In this blog post, we provided examples of using RegEx to extract seven metadata fields from the ETD cover page. I also described the performance of the RegEx and showed how RegEx could be written to match a specific pattern by following some logical rules. We also saw from the RegEx explanation of how a complex expression with '\w' (word character), and  '.+' (greedy approach) could be written to match a specific string.

I appreciate the time for reading my blog and hope this blogpost will contribute as a great reference to the scientific community. 

-- Muntabir Choudhury

Comments