Thursday, September 25, 2014

2014-09-25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

The Internet Archive (IA) and Open Library offer over 6 million fully accessible public domain eBooks. I searched for the term "dictionary" while I was casually browsing the scanned book collection to see how many dictionaries they have. I found several dictionaries in various languages. I randomly picked A Dictionary of the English Language (1828) - Samuel Johnson, John Walker, Robert S. Jameson from the search result. I opened the dictionary in fullscreen mode using IA's opensource online BookReader application. This book reader application has common tools for browsing an image based book such as flipping pages, seeking a page, zooming, and changing the layout. In the toolbar it has some interesting features like reading aloud and full-text searching. I wondered how could it possibly perform text searching and read aloud an scanned raster image based book? I sneaked inside the page source code which pointed me to some documentation pages. I realized it is using an Optical Character Recognition (OCR) engine called ABBY FineReader to power these features.

I was curious to find out how do they define the term "dictionary" in a dictionary of early 19th century? So I gave the "search inside" feature of IA's book reader a try and searched for the term "dictionary" there. It took about 40 seconds to search for the lookup term in a book with 850 pages and returned three results. Unfortunately, they were pointing to the title and advertisement pages where this term appeared, but not the page where it was defined. After this failed OCR attempt, I manually flipped pages in the BookReader back and forth the way word lookup is performed in printed dictionaries until I reached the appropriate page. Then I located the term on the page and the definition there was, "A book containing the words of any language in alphabetical order, with explanations of their meaning; a lexicon; a vocabulary; a word-book." I thought I would give the "search inside" feature another try. According to the definition above, dictionary is a book, hence I chose "book" as the next lookup term. This time the BookReader took about 50 seconds to search and returned 174 possible places where the term was highlighted in the entire book. These matches include derived words and definitions or examples of other words where the term "book" appeared. Although the OCR engine did work, the goal of finding the definition of the lookup term was still not achieved.

After experimenting with an English dictionary, I was tempted to give another language a try. When it comes to a non-Latin language, there is no better choice for me than Urdu. Urdu is a Right-to-Left (RTL) complex script language inspired from Arabic and Persian languages, shares a lot of vocabulary and grammar rules with Hindi, spoken by more than 100 million people globally (majority in Pakistan and India), and it happens to be my mother tongue as well. I picked an old dictionary entitled, Farhang-e-Asifia (1908) - Sayed Ahmad Dehlavi (four volumes). I searched for several terms one after the other, but every time the response was "No matches were found.", although I verified their existence in the book. It turns out that the ABBY FineReader claims OCR support for about 190 languages, but it does not support more than 60% of the world's 100 most popular languages and the recognition accuracy of the supported languages is not reliable.

Dictionaries are a condensed collection of words and definitions of languages and capture the essence of cultural vocabularies of the era they are prepared, hence they have great archival value and are of equal interest to linguistics and archivists. Improving accessibility of the preserved scanned dictionaries will make them more useful not only for linguistics and archivists, but for the general users too. Unlike general literature books, dictionaries have some special characteristics such as they are sorted to make the lookup of words easy and lookup in dictionaries is fielded searching as opposed to the full-text searching. These special properties can be leveraged when developing an application for accessing scanned dictionaries.

To solve the scanned dictionary exploration and word lookup problem, we chose a crowdsourced manual approach that works well for every language irrespective of how poorly it is supported by OCR engines. In our approach pages or words of each dictionary are indexed manually to load appropriate pages that correspond to the lookup word. Our indexing approach is progressive hence it increases the usefulness and ease of lookup as more crowdsourced energy is put into the system, starting from the base case, "Ordered Pages" which is at least as good as IA's current BookReader. In the next stage the dictionary can go into "Sparse Index" state in which the first lookup word of each page is indexed that is sufficient to determine the page where any arbitrary lookup word can be found if it exists in the dictionary. To further improve the accessibility of these dictionaries, exhaustive "Full Index" is prepared that indexes every single lookup word found in the dictionary with corresponding pages as opposed to just the first lookup words of each page. This index is very helpful in certain dictionaries where sorting of words is not linear. To determine the exact location of the lookup word on the page we have "Location Index" that highlights the place on the page where the lookup word is located to point user's attention there. Apart from indexing we have introduced an annotation feature where users can link various resources to words on dictionary pages. Users are encouraged to help and contribute improving various indexes and annotations as they use the application. For more detailed description of our approach, please read our technical report:
Sawood Alam, Fateh ud din B Mehmood, Michael L. Nelson. Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages. Technical Report arXiv:1409.1284, 2014.
We have built an online application called "Dictionary Explorer" that utilizes the indexing described above and it has an interfaces suitable for dictionaries. The application serves as the explorer of various dictionaries in various languages at the same time it represents various context-aware controls for feedback to contribute to indexes and annotations. In the Dictionary Explorer the user selects a lookup language that loads a tree like word index in the sidebar for the selected language and various tabs in the center region, each tab corresponds to one monolingual or multilingual dictionary that has indexes in the selected language. The user can then either directly input the lookup term in the search field or locate the search term in the sidebar by expanding corresponding prefixes. Once the lookup is performed, all the tabs are loaded simultaneously with appropriate pages corresponding to the lookup term in each dictionary. A pin is placed on pages where the word exists on the page if the location index is available for the lookup word which allows interaction with the word and annotations. A special tab accumulates all the related resources such as user contributed definitions, audio, video, images, examples, and resources from third party online dictionaries and services.

Following are some feature highlights to summarize the Dictionary Explorer application:
  • Support for various indexing stages.
  • Indexes in multiple languages and multiple monolingual and multilingual dictionaries in each language.
  • Bidirectional (right-to-left and left-to-right) language support.
  • Multiple input methods such as keyboard input, on screen keyboard, and word prefix tree.
  • Simultaneous lookup in multiple dictionaries.
  • Pagination and zoom controls.
  • Interactive location marker pins.
  • Context aware user feedback and annotations.
  • Separate tab for related resources such as user contributions, related media, and third-party resources.
  • API for third-party applications.
We have successfully developed a progressive approach of indexing that enables lookup in scanned dictionaries of any language with very little initial effort and improves over time as more people interact with the dictionaries. In the future we want to explore specific challenges of indexing and interaction in several other languages such as Mandarin or Japaneses where dictionaries are not sorted essentially based on their huge alphabet. We also want to utilize our current indexes that were developed by users over time to predict pages for lookup terms in dictionaries that are not indexed yet or have partial indexing. We have intuition that we can automatically predict pages of an arbitrary dictionary for a lookup term with acceptable variance by aligning pages of a dictionary with one or more resources such as indexes of other dictionaries in the same language, corpus of the language, most popular words in the language, and partial indexes of the dictionary.


Sawood Alam

No comments:

Post a Comment