2022-01-19: Leveraging Google Translate for matching Arabic names written in English

Faisal Mekdad spelling differences in Aljazeera


Introduction:

There is a significant amount of research papers and tools for Named-entity recognition (NER); however, only a small potion of it addresses Arabic text and even less tools for extracting named entities from documents written in Arabic. In August 2020, I proposed an approach for extracting named entities from Arabic text using a combination of tools, Google Translate and Stanford NERC, and produced comparable results to Arabic Linguistic Pipeline (ALP). The implementation of my approach, GTS, is available on GitHub. In December 2020, I wrote a blog post outlining tools and libraries for matching Arabic names written in English, which is important for Entity Linking, a subtask of Natural Language Processing (NLP). While discussing the importance of Entity linking is beyond the scope of this post, merging Arabic named entities written in English is the first step for Entity Linking when processing English documents. This is because discrepancies between spellings, of the same name, occur very often in Arabic names written in English. One would think there are spelling standards in place, but that is far from true. Furthermore, typos, illiteracy, personal preferences, and cultural differences contribute to making this problem even worse. Out of all string matching and phonetic algorithms I experimented with in 2020, Double Metaphone produced the best results on the dataset I constructed using a list of Arabic given names on Wikipedia.

Names Matching is a Real-World Problem:

Spelling differences of Arabic names written in English is so common that even the most popular name, Muhammed, has, at least, 14 different spellings in English, but only one spelling in Arabic. This is just one part of the problem. Things get worse when you find out that spelling differences of Arabic names written in English is not limited to texts from different sources written by different people. It can be found within the same document that is written by the same person!

Faisal Mekdad, the Syrian Foreign Minister, is Mr. Mekdad in Arab News
Faisal Mekdad spelling in Arab News

He is Mr. Mukdad in Islam Web
Faisal Mekdad spelling in Islam Web

He is Mr. Mokdad in Almanar
Faisal Mekdad spelling in Almanar

He is Mr. Miqdad in BBC News
Faisal Mekdad spelling in BBC News

He can be Mr. Maqdad or Mr. Mokdad in the same story in Aljazeera English.
Faisal Mekdad spelling differences in Aljazeera
These are just a few spelling variations of "Mekdad" that I found on the web in less than a minute. There are, at least, a dozen more variations.

Google Translate:

We all know that Google Translate was, mainly, built to translate text from one language to another, but that's not the only task it can be used for. Google Translate is an effective tool for matching names written in foreign languages. I am not sure if it was built with this purpose in mind or what are the algorithms used to achieve this goal, but I found that it works.

Google Translate was able to map all different spellings of "Faisal Mekdad", mentioned earlier, to the same Arabic name "فيصل مقداد".
Google Translate mapping different spellings of Mekdad


The last two variations of "Mekdad" are mapped to the same word as the first three variations but with the article "The" added in the beginning ("الـ" is the Arabic equivalent to the article "The" in English). Adding/omitting the article "The" is common in Arabic names. In fact, my name, Hussam Aldeen Hallak, has it in "Aldeen"; it translates to "Faith" or "The Faith".

Biblical Names:

Mapping Biblical names to their Arabic counterpart using string matching and/or phonetic algorithms is even harder because Biblical names are different since they are not originally Arabic. There are Arabs named after Biblical Characters, but the Arabic pronunciation of their names is actually Hebrew (very different from English pronunciation). The spelling of these names in Arabic is often dictated by their Hebrew pronunciation.

For Example, the Biblical name "Michael" is "ميخائيل" in Arabic and is pronounced "mikhayiyl". Google Translate was able to map "Michael" and its various transliterations to the same Arabic name "ميخائيل". The mapping of this mixture of variations is absolutely impressive.
Michael and its various transliterations mapped to the same Biblical name in Arabic



Conclusion:

There are so many applications for names matching including Entity Linking, a subtask of Natural Language Processing (NLP). It is mainly used in Information Retrieval and Extraction, Spam Detection, Database Normalization, etc. Matching Arabic names written in English is not an easy task since they can be spelled in many different ways due to the lack of standards, typos, illiteracy, and different cultural/regional traditions. Google Translate is a powerful tool in solving this problem. It is very effective in matching Biblical names to their Arabic counterpart.


Comments