Posts

2020-12-02: Comparing Four OCR Tools on US Patent Figure Label Recognition

Image
The task is to extract labels from US patent figures. Patent figures are different from natural images. They are usually drawings of an object or diagrams such as circuits. A figure file may contain one or multiple figures, each of which has a label. We need to find a software tool that can reliably identify figure labels. All the figures are in TIF format when they are downloaded from the USPTO patent repository. In the following experiments, I use OCR tools to extract figure labels using the whole figure file as the input. The candidates I compare include tesseract , Abbyy , Amazon Textract API , and Google cloud vision API . Below are figure samples and my comments. Figure #1 Figure #1 is a standard type of figure with one drawing and one label. Figure #2 Figure #2 represents figures with multiple drawings and labels. We need to extract both labels. The dot lines at the bottom of the outsole may be mistaken as words. Figure #3 Figure #3 represents more abstract drawings with numbe

2020-11-18: Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

Image
Figure 1: Creating collection growth curves with a web page text derivative Recently, I have been learning about Archives Unleashed Toolkit (AUT) , Hypercane , and how these tools can be used together . AUT is one of the tools from the Archives Unleashed Project , which can be used to analyze web archive collections. When AUT is given WARC or ARC files for a web archive collection, it can create network derivatives and text derivatives . The network derivatives have nodes which are the domains in a collection and the links between the nodes occur when there is one or more webpages in one domain that contains a link to a webpage in the other domain. AUT can create text derivatives that include information about either the web pages, images, PDFs, or other documents that are included in the collection. Hypercane, a tool developed by WS-DL's Shawn Jones , can be used to create WARC files that are associated with a public Archive-It collection. The WARC files created by Hypercane can

2020-11-15: Sapien Labs Virtual Symposium on Mental Health Trip Report

Image
The Sapien Labs Virtual Symposium on The Future of Mental Health: Measurement, Treatment and Therapies was held virtually via Adobe Connect on 2-3 November 2020.   The symposium consisted of 2 sessions on each day.  Each session would begin with multiple presentations.  The presenters in each session would also join in a moderated panel discussion in the second half of that session.  Founder and CEO of Sapien Labs , Dr. Tara Thiagarajan , hosted the virtual symposium, introduced many of the speakers, and led multiple panel discussions. Symposium Day 1 Session 1 : Dr. Eiko Fried  from Leiden University   started the first session's presentations with his talk on "Measure Matters: Challenges to Assessing Mental Health Problems Pose a Substantial Barrier to Clinical Progress."  Dr. Fried stated that while proper measurement is critical, it is difficult especially when attempting to measure individuals' internal states to include personality, cognition, and mental healt

2020-11-04: New Twitter UI: Replaying Archived Twitter Pages That Never Existed

Image
  Figure 1: Multiple Temporal Violations in an archived page with the new Twitter interface.  When you visit web archives to go back in time and look at a web page, you naturally expect it to display the content exactly as it appeared on the live web at that particular datetime. That is, of course, with the assumption in mind that all of the resources on the page were captured at or near the time of the datetime displayed in the banner for the root HTML page. However, we noticed that it is not always the case and problems with archiving Twitter's new UI can result in replaying Twitter profile pages that never existed on the live web. In our previous blog post , we talked about how difficult it is to archive Twitter's new UI, and in this blog post, we uncover how the new Twitter UI mementos in the Internet Archive are vulnerable to temporal violations . On Aug 18, 2020, we stumbled upon a recently archived memento (Figure 1) of Donald Trump’s Twitter profile page in the Inter