Posts

Showing posts with the label Digital Library

2020-06-07: Regular Expression — A Powerful Tool to Parse Text with Visually Identifiable Patterns

Image
In the previous blog , I have discussed how tesseract-OCR performed on scanned Electronic Theses and Dissertations (ETDs). If you have read my earlier blog , we already saw that the process started with converting the cover page of scanned ETDs into images. Then, tesseract-OCR was applied and saved the extracted result into text files. We also saw that OpenCV OCR failed on scanned ETDs. We could try a widely used open-source tool such as  GROBID , designed for scholarly papers. However , this article  shows that GROBID is intended for extracting bibliographic metadata for born-digital academic papers. Finally, we decided to apply tesseract-OCR to extract the text from the cover page of scanned ETDs. Afterward, a series of regular expressions (RegEx) was performed to extract seven metadata fields, including titles, authors, academic-programs, institutions, advisors, and years. In this blog, I will introduce how RegEx can be a powerful tool to quickly parse the text with patterns. 

2020-05-28: Richard Pates (Computer Science PhD Student)

Image
     Welcome to my profile on Blogger! My name is  Richard Pates  and I joined the  Web Sciences and Digital Libraries  (WS-DL) research group in the  Department of Computer Science  (CS) at  Old Dominion Univeristy  (ODU) during the Summer of 2020 as a PhD Student in CS advised by  Dr. Jian Wu  as a member of the research team in the  Lab for Applied Machine Learning and Natural Language Processing Systems  (LAMP-SYS) Group working on the  Mining Electronic Theses and Dissertations  (METD) Project. Upon earning the  Masters of Science in Computer Science  (MSCS) from ODU during the Fall of 2018 approval was granted to join the PhD program in CS during the Spring of 2019 jointly advised by  Dr. Ravi Mukkamala  and  Dr. Cong Wong  with an interest in Artificial Intelligence (AI), Cybersecurity and Systems.      This year the main goal in the PhD program for me will be to advance as a  PhD Candidate  during the Fall of 2020 ( Current Academic Calendar ) having made the  Doctoral Dissert

2020-05-19: OCR Tools Experiment on Scanned Electronic Theses and Dissertations (ETDs)

Image
A thesis or dissertation is one type of scholarly work that shows a student pursuing higher education and has successfully met the partial requirement of a degree. An electronic thesis or dissertation can be found from either a university's electronic theses and dissertations (ETDs) digital library or ProQuest (a third party ETD repository). ETDs contain lots of rich metadata that can be used for searching ETDs from the repository. However, not all ETD metadata are available. Therefore, it is necessary to extract metadata from scholarly ETDs. Also, extracting metadata could be challenging, mainly when it is found as scanned academic ETDs. Although many open-source tools exhibit satisfying performance in certain types of documents, experiments indicate that they tend to produce unacceptable errors or fail on scanned ETDs. In this blog post, I introduce one of the widely used optical character recognition (OCR) tools called tesseract-OCR and show how tesseract-OCR performs on scann