Table 1 Soper et al.: Classification of common OCR error types, their descriptions, and representative examples. These categories highlight challenges in text recognition systems, such as over-segmentation of words into multiple parts, merging of separate words (under-segmentation), character misinterpretations, omissions, and the generation of spurious or nonsensical content.extracted from the 'Details' page for quick and efficient information retrieval.

Optical Character Recognition (OCR) is pivotal in various applications, including online education, industrial automation, robotics, and more. The technology is designed to extract text from different image sources, ranging from scanned documents to text embedded in complex, real-world environments. However, despite its widespread adoption, OCR continues to face significant challenges due to inaccuracies in its output, as shown in Table 1 and Table 2.

Table 2 Soper et al.: Detailed examples of OCR error types in text recognition. The table illustrates five categories of errors—over-segmentation, under-segmentation, misrecognized characters, missing characters, and hallucination—along with source text, the OCR prediction, and the intended target text. These examples emphasize the nuances of text recognition challenges and their impact on maintaining textual integrity in automated systems.

In my IUI 2023 paper, "AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind Users", I addressed the challenge of extracting data from web pages. To achieve this, I utilized Mask R-CNN to identify regions of interest on web pages and capture corresponding image snapshots. These snapshots were then processed using Tesseract OCR engine to extract text. However, due to the low quality of some snapshots, the OCR output often failed to match the ground truth text accurately, despite the correct identification of regions of interest. To address this limitation, I explored potential solutions to mitigate these inaccuracies.

Exploring Approaches to Spelling Autocorrection and OCR Post-Correction

I began by delving into foundational methods for spelling correction, starting with Peter Norvig's seminal work How to Write a Spelling Corrector. This article provides an excellent introduction to spelling autocorrection, offering a Python-based implementation that combines a probabilistic language model (derived from word frequencies in a corpus) with an error model (based on edit distances) to predict the most likely corrections for misspelled words. With an accuracy of approximately 68–75% on test datasets, the approach balances simplicity and effectiveness. It also outlines potential enhancements, such as refining error models, incorporating context-sensitive corrections, and improving dictionary robustness. The article's Further Reading section offers many resources, forming a solid foundation for deeper exploration in this field.

Figure 1 Hládek et al.: The process of correction-candidate generation and error correction. The diagram illustrates the flow from an incorrect word to candidate proposals, ranking of candidates, error correction, and alignment with the intended word and context, emphasizing the interaction between error production and correction mechanisms.

To broaden my understanding, I also reviewed the survey paper "Survey of Automatic Spelling Correction", which comprehensively examines spelling correction research from 1991 to 2019. The survey categorizes methodologies into three primary groups: rule-based systems, context-aware systems, and systems utilizing learned error models. Grounded in Shannon’s noisy channel framework, it details the essential components of spelling correction systems, including dictionaries, error models, and context models. The survey highlights the iterative process of spelling correction, as shown in Figure 1, where error production leads to candidate proposals, ranking of candidates, and subsequent correction to align with the intended word and context. The paper also explores techniques such as edit distances, phonetic algorithms, and machine learning-based approaches while providing benchmarks for evaluating performance across languages and application domains.

Figure 2 ICDAR2019: Overview of the OCR post-correction challenge. The task involves identifying and correcting errors in OCR-processed text to align it with a gold standard, using historical documents as the dataset.

I also reviewed two research-level competitions organized by The International Conference on Document Analysis and Recognition (ICDAR) that focused on OCR post-correction. The first competition in 2017, highlighted the dominance of statistical and neural machine translation approaches. The winning team, Amrhein and Clematide, employed an ensemble of character-based statistical and neural machine translation models to achieve top performance. The second competition, conducted in 2019, showcased advancements in transformer-based language models. Clova AI, the winning team leveraged BERT embeddings as input to train a CNN classifier, followed by a character-level sequence-to-sequence (biLSTM) model for correction. These competitions illustrate the evolving landscape of OCR post-correction methodologies, transitioning from traditional statistical models to transformer-enhanced frameworks.

Figure 3 Whitelaw et al.: Overview of the spelling correction process and associated knowledge sources.

Next, while exploring research papers in general spelling correction domain, I found an insightful and well-written paper titled "Using the Web for Language-Independent Spellchecking and Autocorrection" by Google Inc. This paper introduces a language-agnostic spellchecking and autocorrection system leveraging the web as a noisy corpus, thus eliminating the need for manually annotated training data, as shown in Figure 3. The system effectively detects and corrects errors by employing statistical error models, n-gram language models, and confidence classifiers trained on web frequency data and artificially misspelled news texts, including real-world substitutions. This approach is adaptable to multiple languages, demonstrating superior accuracy compared to traditional dictionary-based methods and promising results for English, German, Russian, and Arabic.

Figure 4 Rijhwani et al.: Examples of scanned documents in endangered languages accompanied by translations. (a) Ainu text with Japanese translation, (b) Griko text with Italian translation, (c) Yakkha text with translations in Nepali and English, and (d) handwritten Shangaji text with typed English glosses.

Following this, I refined my search to focus on research papers tailored to OCR post-correction. One standout paper, "OCR Post-Correction for Endangered Language Texts", serves as an excellent starting point by addressing the challenges of extracting text from scanned books in endangered languages, as shown in Figure 4. The study introduces a benchmark dataset for three critically endangered languages—Ainu, Griko, and Yakkha—and evaluates OCR performance using Google Vision AI. The proposed multi-source encoder-decoder model incorporates innovations like diagonal attention loss, a copy mechanism, and pretraining on unannotated data. This approach significantly reduces the character error rate (CER) and word error rate (WER) by an average of 34% compared to baseline systems, showcasing its potential for preserving low-resource languages.

Video 1: Presentation at JCDL 2020 on a Post-OCR Correction Approach Using Neural Machine Translation and Contextual Language Model BERT.

To focus on English-specific OCR post-correction, I found the paper "Neural Machine Translation with BERT for Post-OCR Error Detection and Correction" particularly relevant which I later leveraged in the Autodesc pipeline. See Video 1 for a detailed presentation on the method and its application in improving OCR accuracy. This study combines pre-trained BERT model with character-level neural machine translation (NMT) to enhance error detection using contextual embeddings. It also employs augmented data and techniques like length difference filtering for improved correction. Evaluated on datasets from ICDAR 2017 and 2019, this approach demonstrates competitive performance in detecting and correcting both real-word and non-word errors, providing a robust solution for enhancing OCR quality in English-language datasets.

Finally, with the rise of Large Language Models post-2023, I see immense potential for accelerating this field. LLMs are exceptionally suited for OCR post-correction tasks because they can understand and generate coherent, context-aware text. For domain-specific documents like legal or scientific texts, fine-tuning LLMs on relevant datasets allows them to adapt to specialized terminology and correct errors that other systems might overlook. Additionally, LLMs can be seamlessly integrated into OCR workflows as intelligent post-processors, refining text outputs without requiring significant changes to existing pipelines. Their multilingual capabilities enhance their applicability, making them ideal for correcting diverse and mixed-language OCR outputs.

Table 3 Thomas et al.: Llama 2 13B Model Performance in Correcting Diverse OCR Error Types.

Table 4 Thomas et al.: Comparative Analysis of Model Performance Across OCR Error Type.

An illustrative example of this potential is the 2024 paper "Leveraging LLMs for Post-OCR Correction of Historical Newspapers". This study explores the use of instruction-tuned Llama 2 for post-OCR correction of 19th-century British newspapers, achieving a 54.51% CER reduction compared to BART's 23.30%, as seen in Table 4. By employing a prompt-based approach, Llama 2 effectively corrects common OCR errors like substitutions and deletions while adapting to historical OCR data's noisy, context-dependent nature, as seen in Table 3. Remarkably, this efficiency is achieved with limited training data, contrasting with the data-hungry requirements of traditional sequence-to-sequence (seq2seq) models. This body of research underscores the transformative potential of combining traditional approaches with modern advancements like LLMs to push the boundaries of spelling autocorrection and OCR post-correction.

In conclusion, spelling autocorrection and OCR post-correction remain vital research areas, transitioning from foundational probabilistic models to state-of-the-art neural and transformer-based techniques. Recent progress with advanced LLMs like Llama 2 has significantly enhanced accuracy and scalability across various languages and domains. Integrating traditional OCR approaches with modern correction technologies enables developing more robust, context-aware, and adaptable systems.

References:

Peter Norvig’s spelling corrector: https://norvig.com/spell-correct.htm
Hládek, D., Staš, J. and Pleva, M., 2020. Survey of automatic spelling correction. Electronics, 9(10), p.1670.
Chiron, G., Doucet, A., Coustaty, M. and Moreux, J.P., 2017, November. ICDAR2017 competition on post-OCR text correction. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (Vol. 1, pp. 1423-1428). IEEE.
Rigaud, C., Doucet, A., Coustaty, M. and Moreux, J.P., 2019, September. ICDAR 2019 competition on post-OCR text correction. In 2019 international conference on document analysis and recognition (ICDAR) (pp. 1588-1593). IEEE.
Whitelaw, C., Hutchinson, B., Chung, G. and Ellis, G., 2009, August. Using the web for language independent spellchecking and autocorrection. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 890-899).
Rijhwani, S., Anastasopoulos, A. and Neubig, G., 2020. OCR post correction for endangered language texts. arXiv preprint arXiv:2011.05402.
Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M. and Doucet, A., 2020, August. Neural machine translation with BERT for post-OCR error detection and correction. In Proceedings of the ACM/IEEE joint conference on digital libraries in 2020 (pp. 333-336).
Thomas, A., Gaizauskas, R. and Lu, H., 2024, May. Leveraging LLMs for Post-OCR Correction of Historical Newspapers. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA)@ LREC-COLING-2024 (pp. 116-121).
Soper, E., Fujimoto, S. and Yu, Y.Y., 2021, November. BART for post-correction of OCR newspaper text. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021) (pp. 284-290).

- YASH PRAKASH (@LunaticBugbear)

Comments

Tony HirstNovember 28, 2024 at 5:13 AM
I am a heavy user of OCR on low quality scans of 19th c. books retrieved from archive.org and newspaper article retrieved from the British Newspaper Archive. I have been using claude.ai to support text extraction, generally prompting it to retrieve the text "as faithfully as possible:. Anecdotally: a) it seems to have improved over the last few months; b) it is still unreliable with names (places, people), and numerics (numbers, financial amounts, dates), i.e. fact-y things. I have even seen errors where a written number has been changed to another number and then written out (eg along the lines of three became four, presumably because the genAI predictor was working in numeric space, and predicted wrong?)

Search This Blog

Web Science and Digital Libraries Research Group

2024-11-27: My Experience with OCR Post-Correction

Exploring Approaches to Spelling Autocorrection and OCR Post-Correction

References:

Comments

Post a Comment