2021-12-31: Installing Several Open Source and Commercial Optical Character Recognition (OCR) Tools on a PC

Optical Character Recognition (OCR) tools are used for extracting text from images. There are many off-the-shelf OCR tools we can choose from. In a previous blogpost, I compared the performance of several open-source and commercial OCR tools. I’d like to go further and summarize the installation of these tools. In this blogpost, I will talk about the installation of Tesseract, Abbyy, Amazon Textract, and Google Cloud Vision.

Tesseract:

Tesseract is a free software package which accepts a wide range of file formats such as JEPG, PNG, TIFF, and BMP. The installation on a Win10 system is as follows:

Step 1:

Download tesseract executable file tesseract.exe from this website. Double click the file and it will guide you through installation.

Step 2:

Download the language package to the installation directory of the tesseract executable file. It must be compatible with the version of tesseract.exe. This is the language package for tesseract version 4.0. You can download tessdata as a whole or just download one of them, such as fra.traineddata.

Step 3:

Then you can use tesseract in the windows command window. The command is:

tesseract image_file output_file

For example:

tesseract USD0872981-20200121-D00003.TIF output.txt

For more complicated applications, you can import tesseract as a module in Python. Install the package with this command:

pip install pytesseract

Since pytesseract is built on top of tesseract.exe, you need to choose a version of pytesseract that is compatible with tesseract.exe you previously installed. Be careful about the versions, otherwise errors will appear.

Abbyy:

Abbyy is a commercial software package developed by Adobe. It accepts many image formats such as JEPG, PNG, TIFF, GIF, and PDF documents. An incomplete list of accepted formats is shown below:

Figure 1: Formats accepted by Abbyy

Abbyy offers two versions of it and both of them are not free.The first one is the desktop version OCR editor. It is part of the FineReader 15 designed for manipulating images on PDF files. There is a free trial for 7 days or 100-pages for the desktop version.

The second version is an API named Abbyy cloud SDK OCR. It offers a free 90-Day trial period or 500-pages free use. The price of the API is as below:

Figure 2: The price of the API for Abbyy

The installation of Abbyy cloud SDK OCR is straightforward and you can follow in instructions in this tutorial.

If you can't see it, you need to sign up and sign in.

Amazon Textract API:

The Amazon Textract API only accepts PNG and PDF as input formats. It offers a free trial as well: for the first three months after account sign-up, new customers can analyze up to 1,000 pages per month using the Detecting Document Text API. But it is not free if you need more services and the price is as follows:

Figure 3: The price of Amazon Textract API

This is the tutorial for the use of Amazon Textract API.

You can also try the API by using the demonstration console. This is more straightforward.

Google Cloud Vision:

Google cloud demonstrates the best performance among all the OCR tools I tried. It accepts several image formats such as PNG, JEPG, TIF, and PDF.

The following three links gives different ways to use this OCR module:

1. Detect text in images

2. Detect text in files (PDF/TIFF)

3. Optical Character Recognition (OCR translation)

All the three applications use the “Google vision” module as the core function but are installed and implemented separately. They serve different purposes: “Detect text in images” is used for regular image files and “Detect text in files” is for pdf/tiff files. "OCR translation" provides text detection and translation in one app. It accepts regular images files, such as PNG and JPEG.

The installation of the three applications are provided in the links above. It's non-trivial to install and use "OCR translation" for novice users, so I will talk about it in detail in the following parts.

Installation of the Google Optical Character Recognition (OCR translation) API

In the rest of this blog, I will talk about the installation of the Optical Character Recognition API in detail. The example below builds on this tutorial.

Stage 1: Create Account and Prepare Google Cloud Storage

First of all, follow steps 1 up to 5 in “before you begin” section in the previous link. Steps 1-4 are easy and just go step by step.

Then proceed to step 5. You have to link your credit card in the billing account on Google Cloud Platform and the billing account looks like this:

This is how the account looks:

Then create the required python environment.

Next, install the Cloud Client Libraries:

pip install --upgrade google-cloud-storage

Stage 2: Install Cloud SDK and Initiate it

Next, install the Cloud SDK. Follow the tutorial on installing the Cloud SDK on Windows.

After installation has completed and you click finish, The installer then starts a terminal window and you need to input this command:

gcloud init

In the future use, each time you need to switch on the SDK, you have to input the same command.

Select “y”:

Pick “1”:

Stage 3: Set Up Authentication and Save Credential Files to Local

Next, you have to set up authentication.

There are various ways of authentication and in our case we just need the most simple one:

gcloud auth application-default login

Input this in the command window and it will start a link. Proceed until you see this:

Stage 4: Set up Google Pub/Sub and Define the Extraction Pipeline

Download the code in this Github directory to local machine to build the Pub/Sub:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/functions/ocr/app/

Then change to the directory that contains the Cloud Functions sample code:

cd functions/ocr/app

Next, set up an input bucket and output bucket in Cloud Storage (this can be found in the 'Preparing the application' part):

gsutil mb gs://YOUR_IMAGE_BUCKET_NAME

gsutil mb gs://YOUR_RESULT_BUCKET_NAME

Then set up Pub/Sub topics:

gcloud pubsub topics create YOUR_TRANSLATE_TOPIC_NAME

gcloud pubsub topics create YOUR_RESULT_TOPIC_NAME

Next, use the following commands one by one to set up buckets and define the pipeline (This can be found in the 'Deploying the functions' part):

gcloud functions deploy ocr-extract --runtime python37 --trigger-bucket YOUR_IMAGE_BUCKET_NAME --entry-point process_image --set-env-vars "TRANSLATE_TOPIC=YOUR_TRANSLATE_TOPIC_NAME,TO_LANG=es,en,fr,ja"

gcloud functions deploy ocr-translate --runtime python37 --trigger-topic YOUR_TRANSLATE_TOPIC_NAME --entry-point translate_text --set-env-vars "RESULT_TOPIC=YOUR_RESULT_TOPIC_NAME"

gcloud functions deploy ocr-save --runtime python37 --trigger-topic YOUR_RESULT_TOPIC_NAME --entry-point save_result --set-env-vars "RESULT_BUCKET=YOUR_RESULT_BUCKET_NAME"

Use the following command to upload a figure to Google cloud storage:

gsutil cp figure_path Storage-path

For example:

gsutil cp D:\research\USD0872981-20200121-D00003.TIF gs://YOUR_IMAGE_BUCKET_NAME

You will see it appear in the first bucket ‘YOUR_IMAGE_BUCKET_NAME’. The second bucket ‘YOUR_RESULT_BUCKET_NAME’ is for the outputs. Once you upload the figure to the image bucket, the output will be given immediately in the result bucket.

Finally, we can delete the functions as follows if we no longer need to use Google Cloud Vision:

gcloud functions delete ocr-extract

gcloud functions delete ocr-translate

gcloud functions delete ocr-save

If the functions are not deleted, you can upload a figure to Google storage at any time and obtain a result. Once they are deleted, this will not work.

-- Xin Wei

Search This Blog

Web Science and Digital Libraries Research Group

2021-12-31: Installing Several Open Source and Commercial Optical Character Recognition (OCR) Tools on a PC

Comments

Post a Comment