2020-12-29: Tools and libraries for matching Arabic names written in English

Hatice Cengiz (Jamal Khashoggi's fiancee) is "Khadija" in Saudi24News and "Khadijah" in Alarabiya
Tools and libraries for matching Arabic names written in English

Introduction:

While working on my research, I needed to find a way to scan a set of Arabic and non-Arabic names written in English to find matches. This is especially difficult because you cannot count on the spelling of the name being consistent and distinct when written in a foreign language. Discrepancies between spellings, of the same name, may be due to the lack of name spelling standards, typos, translations, illiteracy, personal preferences, cultural differences, or all of the above. In this post, I discuss different approaches to solving this problem by using string matching and/or phonetic algorithms. The latter set of algorithms enable us to compare two strings based on how they sound, rather than how they are spelled, which is what the former set does. The real-world applications of names matching include Information Retrieval, Entity Recognition and Extraction, Natural Language Processing, Machine Translation, Fraud and Spam Detection, and Database cleansing and Normalization. 

Matching Arabic Names Written in English is Tricky:

In Arabic, and all other languages, it is uncommon for a name to have more than one spelling when written in its original language. However, when a name is written in a language different from its original language, there are multiple ways to spell the same name. For example, there are, at least, 14 different spellings for the name "Muhammed". This is not including hyphenated names, such as Muhammad-Ali. Variations included Muhammad, Mohammed, Mohammad, Muhammed, Mohamed, Mohamad, Muhamad, Muhamed, Mohamud, Mohummad, Mohummed, Mouhamed, Mohammod and Mouhamad. I could also add Moohammad, Moohammed, Moohamed, Moohamad, and so on to the previous 14 variations. While it is easy for most people who speak Arabic and English to tell that these names are variations of one Arabic name, محمّد, it is certainly not easy for a computer program to do so. Also Hatice Cengiz, Jamal Khashoggi's fiancee, is "Khadija" in Saudi24News and "Khadijah" in Alarabiya. She wrote her name in Arabic on her Twitter account as خديجة. My advisor, Dr. Nelson, suggested using Soundex, a phonetic algorithm for indexing names by how they sound when pronounced in English. Granted that any automated method will produce some false positives, but, for the record, there are some false positives that are absolutely undetectable even by people fluent in both languages Arabic and English. For example, the name "Hala" can be هلا or حلا in Arabic. These are two different female names in Arabic and have totally different meanings. The former means "welcome" and the latter means "sweetness or beauty". One reading an English text that has "Hala" twice or more will have no way of telling that these two mentions are for the same female or two different ones without their last name. If the two are sisters or cousins and have the same last name, there is absolutely no way to distinguish between them.

String Matching:

First, I used the fuzzywuzzy string matching library in my Python program. This library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. There are multiple functions in this library, but Simple Ratio is what works best for our case because others, like Token Sort Ratio, might match two different names if the names are the same when you do a first and last name swap on one of them. For example, Ahmed Azmi and Azmi Ahmed will have a similarity score of 100 if we use Token Sort Ratio.

I tested the Library on a very small manually-created sample, 50 names. I intentionally tried to challenge this library in most of the names. Based on this sample, a threshold of 75 resulted in 88% accuracy. This is a sample of the output:

score ( Muhammed , Mohamed ) = 80 TP
score ( Muhammad , Mohammed ) = 75 TP
score ( Mohamad , Muhammed ) = 67 FN
score ( Abdel aziz , Abdul aziz ) = 90 TP
score ( Raqia , Rakia ) = 80 TP
score ( Hiba , Heba ) = 75 TP
score ( Ahmed , Ahmad ) = 80 TP
score ( Noor , Nour ) = 75 TP
score ( Hamid , Hameed ) = 73 FN
score ( Hamid , Hamed ) = 80 FP
score ( Fatima , Fatema ) = 83 TP
score ( Hussam , Hosam ) = 73 FN
score ( Husam , Houssam ) = 83 TP
score ( Hussam , Housam ) = 83 TP
score ( Hussam , Houssam ) = 92 TP
score ( Widad , Noor ) = 0 TN
score ( Widad , Wael ) = 44 TN
score ( Widad , Waleed ) = 55 TN
score ( Abdul aziz , Abdul rahman ) = 64 TN
score ( Esam , Husam ) = 67 TN
score ( Wesam , Husam ) = 60 TN
score ( Wesam , Esam ) = 89 FP
score ( Hussam , Hassan ) = 67 TN
score ( Hasan , Hassan ) = 91 FP
score ( Eman , Emad ) = 75 FP
score ( Hussam Hallak , Hussam Masri ) = 64 TN
score ( Muhammed , Muhanned ) = 75 FP

Matching By Sound:

There are multiple phonetic algorithms for matching strings by how they are pronounced. I tested the following algorithms:

Soundex:

Soundex is a phonetic algorithm for indexing names by sound as pronounced in English. Its implementation in Fuzzy is broken, but I found a slightly different implementation, LibIndic Soundex, and used it. LibIndic Soundex module implements Soundex algorithm for Engish as well as a modified version of Soundex algorithm for Indian languages. Shortly, the original Soundex returns a hash value for every string. A string hash value is calculated by using the first letter of the name and using a lookup table to convert the consonants in the rest of the name to digits. Vowels and duplicates are dropped, and the result is padded up by zeros, or truncated down, to four characters (this number varies by implementation). LibIndic Soundex does not pad the hash value with zeros, and the number of characters in the outputted hash value depends on the input. These are some examples:

"Hussam": H25

"Hussam Hallak": H2542

"HussamHallak": H2542

"Hussam Aldeen": H25435

"Hussam Aldeen Hallak": H2543542

"Ahmed": A53

"Sami": S5

LibIndic Soundex has two methods, soundex and compare.

  • Method: modules.Soundex.soundex
    • arg1 : the word
    • Return : The soundex code for the word
  • Method: modules.Soundex.compare
    • arg1 : first word
    • arg2 : second word
    • Return :
      0 if both strings are same
      1 if both strngs sounds alike and from same language
      2 if strings are from different langauges but sounds alike

These are few examples of false positives using LibIndic Soundex:

The names Hasan, Housny, Hussam, Housien, Hassan, Housneya, and Hassouna are different but they return the same Soundex hash value, H25.

Running the Soundex algorithm on these names in both directions, left-to-right and right-to-left, will not solve the problem for Housneya and Hassouna. It will also not solve the problem for Hasan, Housien, and Hassan.

Another implementation of Soundex is found in the Jellyfish library, a python library for doing approximate and phonetic matching of strings. This is the output of the same examples I used previously:

"Hussam": H250

"Hussam Hallak": H254

"HussamHallak": H254

"Hussam Aldeen": H254

"Hussam Aldeen Hallak": H254

"Ahmed": A530

"Sami": S500

The names Hasan, Housny, Hussam, Housien, Hassan, Housneya, and Hassouna are different but they return the same Soundex hash value, H250.

Running the Soundex algorithm on these names in both directions, left-to-right and right-to-left, will not solve the problem for Housneya and Hassouna. It will also not solve the problem for Hasan, Housien, and Hassan.

NYSIIS:

The New York State Identification and Intelligence System, NYSIIS phonetic algorithm features an accuracy increase of 2.7% over the traditional Soundex algorithm. The procedure for producing the hash value using the NYSIIS algorithm is different from that of Soundex, however, it also uses a lookup table. The hash values for the names used in the above sections are:

'Hasan': HASAN

'Housny': HASNY

'Hussam': HASAN

'Housien': HASAN

'Hassan': HASAN

'Housneya': HASNY

'Hassouna': HASAN

As we can see, NYSIIS eliminated some incorrect matches, but Hasan, Hussam, Housien, Hassan, and Hassouna all map to the same hash value, HASAN. Also Housny and Housneya map to the same hash value, HASNY.

Running the NYSIIS algorithm on these names in both directions, left-to-right and right-to-left, will eliminate the incorrect match between Housneya and Housny. It will also eliminate the incorrect match between Hassouna and the rest of the names. The names Hasan, Hussam, Housien, and Hassan will remain among the false positives.

Metaphone:

In 1990, Lawrence Philips published a new phonetic algorithm and called it Metaphone. It improves on the Soundex and NYSIIS algorithms by using information about spelling inconsistencies and looking at combinations of consonants as well as vowels to produce more accurate results. An updated version of Metaphone was developed by the same author in 2000 and was named Double Metaphone. It is called "Double" because it can return both a primary and a secondary code for a string. This algorithm goes even further by adding rules to take into account spelling peculiarities of a number of languages other than English. In addition to having a broader set of encoding rules than other phonetic algorithms, returning two alternate hashes for each name, if a name has two possible pronunciations, gives the ability to find matches with two levels of precision.

I tested the Double Metaphone on the set of names I tested earlier using other algorithms. All names Hasan, Housny, Hussam, Housien, Hassan, Housneya, and Hassouna returned the same set of hashes, [b'HSN', None]. Running the Double Metaphone on these names in both directions, left-to-right and right-to-left, eliminated the incorrect match between Hussam and the rest of the names. It solved other incorrect matches, but Housneya, Housny, and Hassouna were incorrectly matched and Hasan, Housien, and Hassan were also incorrectly matched. This is the output of running the names from right-to-left.

'Hasan': [b'NS', None]

'Housny': [b'ANS', None]

'Hussam': [b'MS', None]

'Housien': [b'NS', None]

'Hassan': [b'NS', None]

'Housneya': [b'ANS', None]

'Hassouna': [b'ANS', None]

The second value is always "None" because the algorithm did not produce a secondary hash value for each of these names.

I assume that the Double Metaphone did not do better than NYSIIS on these names because it did not produce a secondary code for them and the primary code was limited three letters. Changing the number passed as an argument did not have any effect on the output whatsoever.

Note: 

In 2009, Lawrence Philips released a third version, called Metaphone 3, which is said to increase the accuracy to 98% for names and words commonly used in American English. Metaphone 3 is not open source, which is why I decided not to use it. It is commercially sold by Anthropomorphic Software.

Match Rating Approach:

The Match Rating Approach (MRA) is a phonetic algorithm developed by Western Airlines in 1977 for the indexation and comparison of homophonous names. The hash values for the names used in the above sections are:

'Hasan': HSN

'Housny': HSNY

'Hussam': HSM

'Housien': HSN

'Hassan': HSN

'Housneya': HSNY

'Hassouna': HSN

As we can see, Match Rating Approach mapped each of the names Hasan, Housien, Hassan, and Hassouna to the same hash value, HSN. Also Housny and Housneya map to the same hash value, HSNY. The algorithm succeeded at distinguishing Hussam from the rest of the names. This was only possible in other algorithms when the names are run in reverse. I suspect that this is because this algorithm, by design, processes strings in both directions.  

Other Phonetic Algorithms: 

There are several other phonetic algorithms including Beider-Morse Phonetic Matching (BMPM) and Caverphone. I only tested the ones for which I was able to find an implementation that is free, working, and easy to use in the best language ever (Ahem, Python)! 

Evaluation:

I evaluated these algorithms:

Dataset:

I constructed a small dataset, 75 unique Arabic names, from a list of Arabic given names on Wikipedia. The names I included in the dataset were those that had multiple English spellings on their Wikipedia page. The total number of names and their spelling variations is 250. This means that I can examine each algorithm against 31,125 combinations.

Evaluation Methods:

I tested each one of the 31,125 combinations against 6 algorithms in three ways:
  • Using all algorithms to code names in one direction and compare the results.
  • Using all algorithms to code names in both directions and compare the results:
    1. Consider two names a match if their codes in both directions match.
    2. Consider two names a match if their codes in either of the two directions match.
I ended up with 16 output files that I annotated. Each file contains 31,125 combination that was labelled as TP, FP, TN, or FN. I should have had 18 output files, but since Levenshtein Distance produces the same score regardless of direction, it was not tested on both directions (and/or).

Results: 

There are multiple ways to look at the results. However, using Double Metaphone in one direction produced the best results overall. For best precision, using Match Rating Approach generated the best results. I pushed the code, dataset, annotated output, results, and graphs to this github repository.

 


Conclusion:

Matching non-English names written in English is not an easy task because most names can be spelled in many different ways due to the lack of spelling standards, spelling errors, translations, illiteracy, personal choice, and cultural/regional traditions. Phonetic algorithms are powerful in solving this problem. They are easy to implement and use. The nature of the data determines the best algorithm to use. Combining multiple algorithms and creating a "Hybrid" algorithm might be useful if performance is not an issue. There are so many applications for names matching including Information Retrieval, Entity Recognition and Tracking, Natural Language Processing, Machine Translation, Spam Detection, and Database Normalization. Double Metaphone produced the best results when I compared it against other phonetic algorithms on the dataset I constructed using a list of Arabic given names on Wikipedia.
 
 
 
 

Comments