2022-01-27: MAN, A New Tool For Normalizing Transliterations of Arabic Named Entities in Cross-Language Information Retrieval
Natural Language Processing (NLP):
The recent advancement in Natural Language Processing (NLP) has allowed machines to process massive amounts of data found on the internet and elsewhere. The data revolution isn’t only about numbers because, in addition to numbers, data include words, images, videos, etc. Therefore, researchers are
working on teaching machines how to process natural languages to interact with humans, summarize data, extract information, etc. The fact that machines now have to interpret human languages opens the door for new opportunities for NLP software that facilitate interactions between humans and computers. Named Entity Recognition and Classification (NERC) is one of the most important techniques in NLP. Names of persons, locations, and
organizations extracted from a document enable computers to understand
the content of the document.
Cross-Language Information Retrieval (CLIR):
The importance of Cross-language information retrieval (CLIR) comes from the exceptional increase in internet usage worldwide, globalization, social media, eCommerce, and international news outlets. For CLIR to work, translation services
must be used. Therefore, translation APIs (cloud-based translation services) are essential in CLIR. I proposed an approach for extracting named entities from Arabic text using a combination of tools, Google Translate and Stanford NERC, and produced comparable results to Arabic Linguistic Pipeline (ALP). The implementation of my approach, GTS, is available on GitHub.
Named Entity Normalization (NEN):
One of the important steps in NERC is normalizing named entities. In CLIR, discrepancies between transliterations introduce a problem for entity linking. They occur very often in Arabic names
because the Arabic language uses a different alphabet from English and the transliteration process is not standardized.
Furthermore, multiple correct transliterations, typos, illiteracy,
personal preferences, and cultural
differences allow for different transliterations to occur even in the
same document.
In December 2020, I wrote a blog post outlining tools and libraries for matching Arabic names written in English. Double Metaphone produced the best results in names merging on the dataset I constructed using a list of Arabic given names on Wikipedia. I ran multiple string matching and phonetic algorithms on the dataset.
Merge Arabic Names (MAN), A New Tool That Solves The Problem of Entity Normalization in Cross-Language Information Retrieval:
In January 2022, I published a blog post about using Google Translate for matching Arabic name's transliterations. However, how is that useful for researchers without a tool that will take a list of transliterations, map them to their corresponding Arabic names, and group transliterations by the Arabic name from which they were derived?
Sample Input:
List of transliterations:
Muhammed, Ahmad, Muhamed, Hamid, Muhamad, Husam, Mohamad, Mahmood, Hussam, Mahmud, Ahmed, Husam, Mohammed, Mohamed, Housam, Mahmod, Hameed, Muhammad, Houssam, Mohammad
Sample Output:
List of transliterations grouped by the Arabic name they were derived from:
محمد: Muhammed, Muhamed, Muhamad, Mohamad, Mohammed, Mohamed, Muhammad, Mohammad
أحمد: Ahmad, Ahmed
حسام: Husam, Hussam, Husam, Housam, Houssam
حميد: Hamid, Hameed
محمود: Mahmood, Mahmud, Mahmod
In this post, I present Merge Arabic Names (MAN), the new tool I developed to normalize/merge
transliterations of Arabic named entities. This tool leverages Google Translate to do the heavy lifting of translating the names from English
to Arabic. MAN can process a list of named entities with multiple transliterations each, It groups transliterations by their corresponding Arabic named entity, and most importantly, it solves the problem of the presence/absence of the Article "The" (ال) in Arabic names allowing all matching named entities to be merged (not currently supported
by Google Translate). By default, the output is printed to STDOUT and saved to a JSON file.
The Article "The" in Arabic (ال):
I do not know what rule Google Translate uses to determine that "Miqdad" and "Maqdad" should have the article "The" (ال) as a prefix in the Arabic name "مقداد" but not for the rest of transliterations! Some translations of the same entity, although correct, are nevertheless different in the presence/absence of the Article "The" (ال).
Adding/omitting the article "The" is common in Arabic names. The article (ال) is not present in most modern Arabic persons' first names. It is sometimes present in middle and last names. The presence/absence of the article "The" does not change the meaning of the name.
Examples:
Old Arabic first names with (ال):
العباس: Alabbas
الوليد: Alwaleed
الفاروق: Alfarouq
البراء: Albaraa
المعتصم: Almotasem
النعمان: Alnuman
الحسن: Alhasan
الحسين: Alhussein
المأمون: Almamoon
The same Arabic first names, used nowadays, without (ال):
عباس: Abbas
وليد: Waleed
فاروق: Farouq
براء: Baraa
معتصم: Motasem
نعمان: Numan
حسن: Hasan
حسين: Hussein
مأمون: Mamoon
MAN solves the problem by mapping all different variations of transliterations of the same name to the Arabic name without the article "The" (ال) enabling merging names regardless of their form (that could not be accomplished using only Google Translate).
Biblical Names:
Google Translate, and therefore, MAN is able to map Biblical names and their transliterations to their Arabic counterpart.
Biblical names are different since they are, generally, not Arabic. Unless these names are of an Arabic origin, the Hebrew
pronunciation of the name is used to generate the Arabic spelling.
MAN Usage:
Input: A list of Arabic names written in English (transliterations) in a .txt file
Output: Transliterations grouped by their Arabic counterpart in a JSON format
Running MAN:
$ python3 merge_names.py <input_file> <output_file>
Requirements:
Limitations and Shortcomings:
MAN is limited by the ability of Google Translate to translate the named entity transliterations, its performance, and the quality of its translation.
MAN also has availability limitations because it uses the Google Translate Ajax API. Google may ban your client IP address for too
many requests. The best way to avoid being banned or getting an HTTP 429
Too Many Requests response code is to alter the code and include a delay between requests if you have a large amount of data to be processed. You can look at the value for the Retry-After header in the response object and add a sleep with that value to fix an HTTP 429 Too Many Requests or an HTTP 503 Service Unavailable. For a stable tool, I encourage altering the code and using the Google's official translate API. It costs money, but it is more reliable.
Improvements:
MAN is a tiny effort towards merging Arabic named entities in English documents. It relies on Google Translate and I assume that MAN will only get better since Google Translate will only get better. I will continue to work on improving MAN and may include other Translation APIs as options/alternatives to Google Translate. Feedback, suggestions, and pull requests are always welcome!
Good explanation of Natural Language Processing. MY blog help you understand NLP techniques that are best for AI. https://www.xavor.com/blog/top-10-must-know-nlp-techniques-for-data-scientists/
ReplyDelete