2020-12-02: Comparing Four OCR Tools on US Patent Figure Label Recognition

The task is to extract labels from US patent figures. Patent figures are different from natural images. They are usually drawings of an object or diagrams such as circuits. A figure file may contain one or multiple figures, each of which has a label. We need to find a software tool that can reliably identify figure labels. All the figures are in TIF format when they are downloaded from the USPTO patent repository.

In the following experiments, I use OCR tools to extract figure labels using the whole figure file as the input. The candidates I compare include tesseract, Abbyy, Amazon Textract API , and Google cloud vision API. Below are figure samples and my comments.

Figure #1

Figure #1 is a standard type of figure with one drawing and one label.

Figure #2

Figure #2 represents figures with multiple drawings and labels. We need to extract both labels. The dot lines at the bottom of the outsole may be mistaken as words.

Figure #3

Figure #3 represents more abstract drawings with numbers and letters in it.

Figure #4

Figure #4 also represents more abstract drawings with numbers and letters in it.

Figure #5

Figure #5 represents figures with multiple drawings and labels. The label is close to the drawing.

Figure #6

Figure #6 represents figures with multiple drawings and labels. The label is inside the bounding boxes of drawings.

Figure #7

Figure #7 represents figures with multiple drawings and labels.

Tesseract and Abbyy extract the figures in similar ways. They give correct figure labels for images with only one drawing and one figure label (Figure#1 is an example of this). But this is not very robust, for a small portion of the inputs of this format (one drawing and one figure label), they just give empty output. When it comes to images with two or more figure labels (Figure#2 is an example of this), the output consists of only one figure label or even nothing. They also fail with more abstract drawings with numbers of words in them (Figure#3 and Figure#4). Since images with two or more figure labels are the most important part of our task, this is not acceptable.

The next candidate is Amazon Textract API. This API works much better than the previous ones. The first improvement is that it can correctly identity almost all images with only one figure label and in very few cases it will fail for no reason. The second improvement is that it can recognize images with 2 and more figure labels (Figure #5 and Figure #6). But somehow it fails with some others (Figure #7) and just output noises or even nothing. The reason why it fails is not clear. It fails with more abstract drawings with numbers or words in them as well (Figure #3 and Figure #4).

I tested 60 figures to see the performance. The accuracy for figures with only one label is 84%, 91%, and 95% for tesseract, Abbyy, and Amazon, respectively. For figures with 2 or more figure labels the numbers are 21%, 35%, 75%, respectively.

Tesseract, Abbyy, and Amazon API have one thing in common: there is noise in almost every output. It means they don’t separate the label and the drawing clearly. For example, figure #6 is recognized as follows:

It shows some parts of the drawing is recognized as words and the output is as follows:

Even though it captures the labels clearly, it also gives a lot of meaningless and irrelevant stuff. If we can find a way to separate the drawings and labels, a better result can be obtained. One way is to build our own algorithm on basis of them to improve the results.

Google API gives almost perfect results. All the figure labels are correctly identified, and the outputs contain the least noise among all tools. Compared with Amazon OCR, Google API produces noise only for abstract drawings (Figure #3 and Figure #4). The task was challenging because those drawings include both words and numbers. In such cases, even though there may be some irrelevant words in the output, all of them are separated by a space, so it’s easy to parse labels out. For example, you may see output like this: “figure. 5A 310 320 382 380 14 343 330 341 - 340. 360.” and “figure. 5A” can be parsed out using regular expressions.

Below is a comparison of the pros and cons of the tools:

Tesseract:

Pros:

Free
Accepts png/jepg/tif/pdf...

Cons:

Not good with multiple figure labels

Abbyy:

Pros:

Accepts png/jepg/tif/pdf...
Good at pdf files

Cons:

Not free
Not good with multiple figure labels

Amazon Textract:

Pros:

Good with multiple figure labels

Cons:

Not free
Don't accept tif files

Google Cloud Vision:

Pros:

Good with multiple figure labels
Very robust
Accepts png/jepg/tif/pdf...

Cons:

Not free
Google tries to sell a lot of its products
Use of API is very complicated
Have to use gcloud commands
png and tif are different functions

In conclusion, Google gives the best performance among all those tools. If you have a sufficient budget this is your choice. But the installation of it is much more complicated than Abbyy and Amazon. So if you want something easier to use, Amazon would be a good choice. If you don’t have enough budget and just need open-source tools, Tesseract is a good choice. You can build your own algorithm on it.

-- Xin Wei

Search This Blog

Web Science and Digital Libraries Research Group

2020-12-02: Comparing Four OCR Tools on US Patent Figure Label Recognition

Comments

Post a Comment