2020-12-02: Comparing Four OCR Tools on US Patent Figure Label Recognition
The task is to extract labels from US patent figures. Patent figures are different from natural images. They are usually drawings of an object or diagrams such as circuits. A figure file may contain one or multiple figures, each of which has a label. We need to find a software tool that can reliably identify figure labels. All the figures are in TIF format when they are downloaded from the USPTO patent repository. In the following experiments, I use OCR tools to extract figure labels using the whole figure file as the input. The candidates I compare include tesseract , Abbyy , Amazon Textract API , and Google cloud vision API . Below are figure samples and my comments. Figure #1 Figure #1 is a standard type of figure with one drawing and one label. Figure #2 Figure #2 represents figures with multiple drawings and labels. We need to extract both labels. The dot lines at the bottom of the outsole may be mistaken as words. Figure #3 Figure #3 represents more abstract drawings with numbe