2022-02-25: Evaluating MAN, the Tool that Utilizes Google Translate to Normalize Arabic Names' Transliterations in Cross-Language Information Retrieval
Introduction:
The increased use of Natural Language Processing
(NLP) techniques is fueled by the need to process massive amounts of data, the demand for clever chat bots, and other human-computer interaction tasks. Named Entity Recognition (NER)
is one of the most important techniques in NLP. The extracted named entities offer computers a way to classify documents, perform semantic analysis on textual information, etc. In other words, NLP allows machines to understand human language(s). Speaking of languages, Cross-Language Information Retrieval (CLIR) gained traction in the past two decades or so due to the unprecedented rise in globalization, transnational companies, international news outlets, social media, and internet use. CLIR requires a translation service since CLIR deals with retrieving information written in languages different from the language of the user's query. In August 2020, I proposed an approach for extracting named entities from Arabic text using a combination of tools, Google Translate and Stanford NERC, and produced comparable results to Arabic Linguistic Pipeline (ALP). The implementation of my approach, GTS, is available on GitHub.
My Journey with Arabic Names Matching:
The research I am conducting is about Cross-Language Information Retrieval, specifically Information Retrieval from Arabic documents. Normalizing named entities is an essential task in Cross-Language Information Retrieval. Furthermore, if CLIR tasks are performed on Arabic named entities, the tasks become hard. Transliteration of Arabic named entities is not standardized and it is difficult to do so due to the fact that the Arabic language uses a Semitic Abjad, an alphabet different from English. Thus, multiple correct transliterations of Arabic names are very common. Furthermore, personal preferences and cultural differences allow for multiple transliterations to emerge and be accepted.
In December 2020, I wrote a blog post outlining tools and libraries for matching Arabic names written in English. Double Metaphone produced the best results in names merging on the dataset I constructed using a list of Arabic given names on Wikipedia. I ran multiple string matching and phonetic algorithms on the dataset.
In January 2022, I published a blog post about using Google Translate for matching Arabic name's transliterations.
Later that month, I developed a tool, Merge Arabic Names (MAN), which
takes a list of transliterations, map them to their corresponding Arabic
names, and group transliterations by the Arabic name from which they
were derived. In the blog post announcing the availability of MAN, I outlined all the pros and cons of using Google Translate to normalize Arabic names based on the few examples I tested it on at the time of developing MAN. Now, it is time for a methodical evaluation of Google Translate's ability to merge English transliterations of Arabic names. Furthermore, I will conduct a thorough investigation of cases where Google Translate failed to produce the desired output.
Sample Input:
List of transliterations:
Muhammed,
Ahmad, Muhamed, Hamid, Muhamad, Husam, Mohamad, Mahmood, Hussam,
Mahmud, Ahmed, Husam, Mohammed, Mohamed, Housam, Mahmod, Hameed,
Muhammad, Houssam, Mohammad
Sample Output:
List of transliterations grouped by the Arabic name they were derived from:
محمد: Muhammed, Muhamed, Muhamad, Mohamad, Mohammed, Mohamed, Muhammad, Mohammad
أحمد: Ahmad, Ahmed
حسام: Husam, Hussam, Husam, Housam, Houssam
حميد: Hamid, Hameed
محمود: Mahmood, Mahmud, Mahmod
Dataset:
I tested MAN on the same dataset I used for testing multiple string matching and phonetic algorithms. I constructed this dataset from a list of Arabic given names on Wikipedia. All different Arabic names' transliterations in English were copied from the Wikipedia page that is linked to the name. For example, transliterations of ريّان were manually extracted from Rayan's Wikipedia page. The page listed two possible spellings in English, Rayan and Rayyan.Although the transliterations found on Wikipedia are subjective and representative of the opinions of the people who wrote them due to the lack of standards, it is safe to say that they are acceptable by the average bilingual Arab.
Difficulties:
I faced difficulties running MAN on the entire dataset as a whole because MAN uses Google Translate Ajax API. Google did not process the entire dataset at once. It generated an HTTP 429 Too Many Requests response status code. I had to break the dataset into smaller chunks, run MAN on them separately with a break between each two chunks, and then combine the outputs. I wouldn't have this problem if I were using the Google's official translate API.Results and Comparisons:
Although Google Translate generated no false positives, Precision is 1.0, it generated enough false negatives to put it behind most phonetic algorithms in Recall. Since I compared all 250 names with each other, the number of true negatives is much higher than the rest of cases combined (true positive, false positive, and false negative). For such imbalanced cases, I am relying on Balanced Accuracy rather than Accuracy. Nevertheless, Double Metaphone produced the same Accuracy, better Balanced Accuracy, and better F-measure.
Cases where Google Translate Failed to Match Different Transliterations of the Same Arabic Name:
1. Since Google Translate was built to perform translation rather than reverse transliterations, its performance drops when the transliteration is close, in spelling, to a dictionary word, a place name, etc.
Abeer was translated to "البيرة" which mean "The Beer". It should be translated to عبير
Google assumed that there is a missing space after the "A" in "Abeer".
Also Maysoon was translated to "لعله قريبا" which mean "may soon". It should be translated to ميسون
Google assumed that there is a missing space after the "y" in "Maysoon".
Similar cases include Someya, Wail.
Another strange case was translating the name Sulaymaa to السليمانية, a city name. They actually are not that close in spelling.
2. Google mapped English transliterations of some Arabic names to their Farsi/Persian spelling rather than Arabic. I assume that Google thinks that Farsi (Persian) and Arabic spellings of a name are interchangeable. Not only that Arabic and Farsi are different languages, they represent two different language groups: Semitic and Iranian. Furthermore, they are not even from the same language family. Arabic is from the Afro-Asiatic language family while Farsi is from the Indo-European language family.
Google translated Jumana and Joumana to جومانا, a Persian/Iranian name. It is derived from the Arabic name جمانة. This is not saying that no Arabs give their daughters the Persian name جومانا but that's not how the majority of Arabs spell the name.
The same issue with the translation of Sarah and Sara. The Arabic name is ساره for both. However, Google translated Sarah to ساره and Sara to سارا, which is an Iranian name. Some celebrities' name is سارا; all of them are originally from Iran.
3. Google either spelled some Arabic names incorrectly or came up with a very unpopular spelling. For example, The Arabic name يوسف is a very popular biblical name, equivalent to Joseph in English. The name has so many different transliterations, Yusuf, Yosef, Yossef, Youcef, Yousaf, Yousef, Yousif, Yousof, Youssef, Youssif, Youssof, Youssouf, Youssuf, Yousuf, Yusaf, Yusef, Yuseff, Yusof, Yussef, and Yusuph. Google mapped all of them correctly except the last one. It was mapped to يوسوف. I have never seen the name spelled this way and when I put the name in Google search, it asked me if I meant يوسف!
4. Google unnecessarily differentiated between names that are the same but could be spelled in multiple ways in Arabic. For example, the name نادية can be spelled as ناديا or ناديه but these are the same name. It is possible that one spelling is more correct than the other, but these names are the same and they need to be grouped together. This is the output from Google Translate:
ناديه: Nadia
نادية: Nadiya, Nadya
ناديا: Nadiia
5. Google was not able to map any of the transliteration variations for عَلِيَّة correctly. Its transliterations (Aliya, Aaliyah, Alia, and Aliyah) were mapped to two different names with three spellings in Arabic:
عالية
عاليه
علياء
عاليه
It is possible that Google is too sensitive to the presence of "h" at the end of the name like Sarah and Sara, but I am not sure if this is the problem. Aliyah is a popular name and I feel like it shouldn't have been a problem for Google Translate to map all transliteration variations to it. The name Aliyah could be Hebrew originally since it has a meaning in Judaism.
6. Google added the Article "The" to some of the Arabic names, but since MAN takes care of this problem to normalize the names, this issue did not affect the results.
7. I am not sure if I should count the last case "for" or "against" Google Translate since both could be true depending on the nature of the input and the intended use of the output. In short, Google is aware of some cases where an Arab person might have a non-Arabic name, mostly biblical names. For example, the biblical name Miriam has an Arabic equivalent مريم, however, a few Arabs spell it as ميريام, which is the Arabic transliteration of Miriam. This is the output that Google Translate generated:
مريم: Maryam, Mariam
ميريام: Miriam, Miryam
This may seem like a good thing, however, in my case, this is not what I want to be given as output because on top of ميريام not being a popular Arabic name, I doubt that many Arabic documents have both names. in my case, I want all variations to be mapped to either مريم or ميريام.
Conclusion:
It is possible to tweak the output produced by Google Translate and experiment with it to improve the results on this dataset, but it is not guaranteed that it won't worsen the results if used on other datasets by causing false positives to occur. In any case, not being a Google employee along with the availability challenge I am facing due to using the unofficial Google Translate API, I am afraid that there will be a diminishing return if I attempt to improve this method any further. I believe that it is more efficient to use Double Metaphone, which is included in a Python library, Fuzzy, that gets installed on any machine that has Python. It can process datasets of all sizes quickly, easily, without availability problems, and produce better results.
There might be other Translation APIs that produce better results than Google Translate, but I have not tested any of them.
Comments
Post a Comment