2022-01-27: MAN, A New Tool For Normalizing Transliterations of Arabic Named Entities in Cross-Language Information Retrieval

MAN, A New Tool For Normalizzing Transliterations of Arabic Names

Natural Language Processing (NLP):

The recent advancement in Natural Language Processing (NLP) has allowed machines to process massive amounts of data found on the internet and elsewhere. The data revolution isn’t only about numbers because, in addition to numbers, data include words, images, videos, etc. Therefore, researchers are working on teaching machines how to process natural languages to interact with humans, summarize data, extract information, etc. The fact that machines now have to interpret human languages opens the door for new opportunities for NLP software that facilitate interactions between humans and computers. Named Entity Recognition and Classification (NERC) is one of the most important techniques in NLP. Names of persons, locations, and organizations extracted from a document enable computers to understand the content of the document.

Cross-Language Information Retrieval (CLIR):

The importance of Cross-language information retrieval (CLIR) comes from the exceptional increase in internet usage worldwide, globalization, social media, eCommerce, and international news outlets. For CLIR to work, translation services must be used. Therefore, translation APIs (cloud-based translation services) are essential in CLIR. I proposed an approach for extracting named entities from Arabic text using a combination of tools, Google Translate and Stanford NERC, and produced comparable results to Arabic Linguistic Pipeline (ALP). The implementation of my approach, GTS, is available on GitHub.

Named Entity Normalization (NEN):

One of the important steps in NERC is normalizing named entities. In CLIR, discrepancies between transliterations introduce a problem for entity linking. They occur very often in Arabic names because the Arabic language uses a different alphabet from English and the transliteration process is not standardized. Furthermore, multiple correct transliterations, typos, illiteracy, personal preferences, and cultural differences allow for different transliterations to occur even in the same document.

Different transliterations of the same Arabic name in the same document

In December 2020, I wrote a blog post outlining tools and libraries for matching Arabic names written in English. Double Metaphone produced the best results in names merging on the dataset I constructed using a list of Arabic given names on Wikipedia. I ran multiple string matching and phonetic algorithms on the dataset.

Merge Arabic Names (MAN), A New Tool That Solves The Problem of Entity Normalization in Cross-Language Information Retrieval:

In January 2022, I published a blog post about using Google Translate for matching Arabic name 's transliterations. However, how is that useful for researchers without a tool that will take a list of transliterations, map them to their corresponding Arabic names, and group transliterations by the Arabic name from which they were derived?

Sample Input:

List of transliterations:

Muhammed, Ahmad, Muhamed, Hamid, Muhamad, Husam, Mohamad, Mahmood, Hussam, Mahmud, Ahmed, Husam, Mohammed, Mohamed, Housam, Mahmod, Hameed, Muhammad, Houssam, Mohammad

Sample Output:

List of transliterations grouped by the Arabic name they were derived from:

محمد: Muhammed, Muhamed, Muhamad, Mohamad, Mohammed, Mohamed, Muhammad, Mohammad

أحمد: Ahmad, Ahmed

حسام: Husam, Hussam, Husam, Housam, Houssam

حميد: Hamid, Hameed

محمود: Mahmood, Mahmud, Mahmod

In this post, I present Merge Arabic Names (MAN), the new tool I developed to normalize/merge transliterations of Arabic named entities. This tool leverages Google Translate to do the heavy lifting of translating the names from English to Arabic. MAN can process a list of named entities with multiple transliterations each, It groups transliterations by their corresponding Arabic named entity, and most importantly, it solves the problem of the presence/absence of the Article "The" (ال) in Arabic names allowing all matching named entities to be merged (not currently supported by Google Translate). By default, the output is printed to STDOUT and saved to a JSON file.

The Article "The" in Arabic (ال):

I do not know what rule Google Translate uses to determine that "Miqdad" and "Maqdad" should have the article "The" (ال) as a prefix in the Arabic name "مقداد" but not for the rest of transliterations! Some translations of the same entity, although correct, are nevertheless different in the presence/absence of the Article "The" (ال).

Google Translate adding the, not needed, article "The" in the Arabic translation

Adding/omitting the article "The" is common in Arabic names. The article (ال) is not present in most modern Arabic persons' first names. It is sometimes present in middle and last names. The presence/absence of the article "The" does not change the meaning of the name.

Examples:

Old Arabic first names with (ال):

العباس: Alabbas

الوليد: Alwaleed

الفاروق: Alfarouq

البراء: Albaraa

المعتصم: Almotasem

النعمان: Alnuman

الحسن: Alhasan

الحسين: Alhussein

المأمون: Almamoon

The same Arabic first names, used nowadays, without (ال):

عباس: Abbas

وليد: Waleed

فاروق: Farouq

براء: Baraa

معتصم: Motasem

نعمان: Numan

حسن: Hasan

حسين: Hussein

مأمون: Mamoon

MAN solves the problem by mapping all different variations of transliterations of the same name to the Arabic name without the article "The" (ال) enabling merging names regardless of their form (that could not be accomplished using only Google Translate).

Biblical Names:

Google Translate, and therefore, MAN is able to map Biblical names and their transliterations to their Arabic counterpart. Biblical names are different since they are, generally, not Arabic. Unless these names are of an Arabic origin, the Hebrew pronunciation of the name is used to generate the Arabic spelling.

For Example, the Biblical name "Abraham" is "ابراهيم" in Arabic and is pronounced "Ebraheem". MAN can map Abraham, Ebrahim, Ibraheem, Ebraheem, and Ibrahim to the same Arabic name "ابراهيم".

Google Translate output for the Biblical name "Abraham" and its various transliterations from its Arabic counterpart

MAN Usage:

Input: A list of Arabic names written in English (transliterations) in a .txt file

Output: Transliterations grouped by their Arabic counterpart in a JSON format

Running MAN:

$ python3 merge_names.py <input_file> <output_file>

Example:

$ python3 merge_names.py path/to/input/file.txt path/to/output/file.json

Requirements:

Python 3.X

The following Python Libraries: googletrans, collections, sys, os, io, and json.

Limitations and Shortcomings:

MAN is limited by the ability of Google Translate to translate the named entity transliterations, its performance, and the quality of its translation.

MAN also has availability limitations because it uses the Google Translate Ajax API. Google may ban your client IP address for too many requests. The best way to avoid being banned or getting an HTTP 429 Too Many Requests response code is to alter the code and include a delay between requests if you have a large amount of data to be processed. You can look at the value for the Retry-After header in the response object and add a sleep with that value to fix an HTTP 429 Too Many Requests or an HTTP 503 Service Unavailable. For a stable tool, I encourage altering the code and using the Google's official translate API. It costs money, but it is more reliable.

Improvements:

MAN is a tiny effort towards merging Arabic named entities in English documents. It relies on Google Translate and I assume that MAN will only get better since Google Translate will only get better. I will continue to work on improving MAN and may include other Translation APIs as options/alternatives to Google Translate. Feedback, suggestions, and pull requests are always welcome!

-- Hussam Hallak

Search This Blog

Web Science and Digital Libraries Research Group