2020-08-30: Google Translate + Stanford NERC produce comparable results to Arabic Linguistic Pipeline (ALP)

Arabic Named Entity Recognition and Classification

Named Entity Recognition and Classification (NERC) is very important for many text processing tasks. However, Arabic NERC research is not popular because only three percent of online content is in Arabic. Furthermore, NERC for Arabic is a challenging task due to Arabic's lack of capitalization, multiple types, different writing styles, complex morphology, ambiguity, and lack of resources. There has not been many Arabic NERC systems available for the same reasons. In this post, I evaluate the performance of ALP (Arabic linguistic pipeline), one of the Arabic NERC tools that provides state-of-the-art precision and recall [1] and compare the results to a new approach that relies on Google Translate and one of the rich and mature English NERC tools, Stanford Named Entity Recognizer and Classifier (NERC). I expected the combination of Google Translate and Stanford NERC to have reduced the precision and recall in comparison to a dedicated Arabic NERC software, but I was pleasantly surprised when the two approaches proved comparable.

Why Named Entity Recognition is Important

Named Entity Recognition and Classification (NERC) is an essential part of Information Extraction from unstructured text. Furthermore, many Natural Language Processing (NLP) tasks depend on NERC. Some of the research domains that can benefit from NERC include Information Retrieval, Questions Answering, Machine Translation, Data Mining, Text Classification, and Machine Learning.

Arabic Named Entity Recognition and Classification Research

While there has been significant contributions and many mature NERC systems for English language, the progress made towards Arabic Named Entity Recognition and Classification is small and the tools are rare [2]. The demand for Arabic NERC systems is increasing because the increase of Arabic speaking users on the internet is, by far, the highest among all other languages.

Why Arabic Named Entity Recognition and Classification is Challenging

In addition to being one of the Semitic languages. Arabic is a language of rich morphology and complex syntax. Its characteristics and peculiarities make Arabic text processing extremely difficult [3]. Some of the key reasons that make dealing with Arabic textual information a daunting task include:

The absence of capitalization in Arabic makes obvious named entities very difficult to extract. This is not something that can be resolved by having a dictionary of proper nouns because the same word can be a proper noun or a verb (more on that later). For example: The word "أكرم" can be an Arabic name for a male in one sentence, and a verb in another, meaning (he-respected or he was generous towards).
There are three types of Arabic [4]:

Classical Arabic (CA), which is used in religious books and ancient Arabic poem. CA is still used by Muslims in prayers and ritual ceremonies.
Modern Standard Arabic (MSA), which is used in education, government documents, today's newspapers, articles, and modern books. The main difference between CA and MSA is vocabulary [5].
Colloquial Arabic Dialects, which are used by Arabs in their daily informal communication. Dialects vary from one country to another and even from one city to another within the same country. Colloquial Arabic is a spoken language and can only be found in a written format in some social media posts and comments. Colloquial Arabic Dialects do not follow any rules and have no specific syntax. Most NERC research is geared towards handling text written in MSA. It would be very difficult to create an NERC system that can extract named entities from all three types of Arabic language simultaneously.

Complicated morphology (Agglutination): Arabic words can contain multiple prefixes and suffixes forming a very complex morphology. These additions to the root word can be pronouns, conjunctions, prepositions, or a combination of them and some of them can be invisible (understood from the context).
For instance, multiple mentions in one word when a pronoun appears as a suffix:
نحن نصوت لرئيسنا التونسي translates to: We vote for our Tunisian president.
While "Tunisian president" is easy to identify as a named entity for NERC tools processing English text, "لرئيسنا التونسي" is not easy for NERC tools processing Arabic text to identify because of the pronoun "نا" (our) which appears as a suffix to "رئيس" (president) and the preposition "لـ" (for) which appears as a prefix.
Optional Short Vowels: Arabic contains short vowels (marks appear above and/or under the letters). They are not letters, and they do not have to be written. In fact, they are only written in CA. These short vowels can change the meaning of the word and/or entirely change the part of speech for the word, which enables multiple words with multiple different meanings and parts of speech to have the same lexical form and sometimes the same short vowels too. The context is the only way to identify the meaning.
For example: The Arabic word "عقد" can have eight different meanings depending on the short vowels and context. It can have any of the following meanings: A decade (noun), he got married (verb), a contract (noun), he tied up (verb), a necklace (noun), he complicated (verb), knots (noun), he held - a meeting - (verb).
Ambiguity: In addition to the ambiguity challenges faced when extracting and classifying named entities in English, Arabic can have entirely different meanings to the same word or the same combination of words.
For example: الدار البيضاء which translates to "the white house" is not the white house on 1600 Pennsylvania Avenue NW in Washington, D.C. It is actually the Arabic name for the famous city in Morroco, Casablanca. It also means "the house of the white color".
This ambiguity problem is solved for extracting named entities from English text because of capitalization and the classification can be solved from the terms surrounding the entity. George Washington is a person, but is also a location (street name in Suffolk, VA) if followed by "St. Suffolk", as in "George Washington St. Suffolk".
Lack of standard writing style: This problem occurs in almost all words transliterated from other languages to Arabic.
A few examples include:
The word "Google" can written as "غوغل" or "جوجل".
The word "Instagram" can be written as "انستاجرام", "انستاغرام", "إنستاجرام", or "إنستاغرام".
The word "Suffolk" can be written as "سافوك", "سافولك", "سوفولك", or "سوفوك".
The name "Laura" can written as "لاورا" or "لورا".
Systematic spelling variations/mistakes: Writing styles differ from one region to another. Arabic linguistic scholars are in disagreement on how some Arabic words should be written.
A few examples include:
The word "responsible" is written as "مسؤول" by Arabs living in the Levant and the Gulf countries, but is written as "مسئول" by Arabs living in Egypt.
The word "Dubai" is written as "دبى" by Egyptians and as "دبي" by the rest of Arabs.
Lack of Resources: The number of available resources necessary for Arabic NERC research is very limited [6] and expensive to use [7]. Researchers often develop their own corpora (tagged documents) and gazetteers (predefined lists of typed NEs) which require the time-consuming human annotation and verification.

Arabic NERC Systems

A Survey of Arabic Named Entity Recognition and Classification [8] that highlights most Arabic NERC challenges and the approaches used by researchers to overcome these challenges was published in 2014. A few Arabic NERC systems and NLP tools were evaluated (Rule-based, Machine Learning, and Hybrid). However, ALP (Arabic Linguistic Pipeline) is an entirely new system and was introduced in 2018. Unlike other systems, it solves the NLP tasks of word segmentation, POS tagging, and named entity recognition as a single sequence labeling task. The tool is available to use for free on the web. I tested the ALP tool on ten, manually-tagged, Arabic news stories and articles from Aljazeera and Alarabiya. The results and links were uploaded to GitHub to be able to reproduce the evaluation in this post.

Evaluating ALP

Data Collection and Annotation

I downloaded ten Arabic documents from various categories from the Arabic news websites: Aljazeera and Alarabiya.
As a native speaker, I manually extracted and labeled all named entities for person, location, and organization from each document.

ALP NERC

I used ALP to extract all named entities for person, location, and organization from each document.

Google Translate + Stanford NERC

I used Google Translate to obtain the English translation for each Arabic document.
I used Stanford Named Entity Recognizer and Classifier (NERC) to extract all named entities for person, location, and organization from the translated text for each document.

Comparison

I calculated the precision, recall, and other evaluation metrics for each document as well as the average of all metrics across all ten documents. I performed this quick evaluation on ten Arabic documents I downloaded from news websites. The results from the two approaches were comparable. Some of the observed cases where one or both approaches failed to extract or classify named entities are analyzed and briefly discussed to spark possible improvements. Although this comparison is not performed on a large dataset, it is meant to serve as a proof of concept. I plan on performing this comparison on large manually-tagged datasets in the future.

I chose ten news stories and articles from Aljazeera and Alarabiya and tested ALP's named entities recognizer on them one by one. I extracted the named entities from the text using ALP and labeled them as True Positive (TP), False Positive (FP). After that, I manually extracted all named entities from the text to find False Negatives (FN). The rest of the words are all considered True Negatives (TN). Named entities identified by ALP were considered True Positive even if the system failed to correctly classify them. Three classes were used in this test (Person, Location, and Organization). The ten documents vary in topic and lengths. The topics covered politics, general news, sports, science, culture, and celebrity news. The lengths of articles range between 149 and 624 words. The average precision, recall, F-measure, True Negative Rate, Accuracy, and Balanced Accuracy were as follows:

Precision = 0.907

Recall = 0.863

True Negative Rate = 0.994

F-Measure = 0.876

Accuracy = 0.985

Balanced Accuracy = 0.93

ALP (FULL credit given for mistakenly classified NEs)

The same measures were recalculated but this time I only gave partial credit for named entities extracted but not correctly classified. The drop in the results was not significant:

Precision = 0.842

Recall = 0.851

True Negative Rate = 0.991

F-Measure = 0.835

Accuracy = 0.981

Balanced Accuracy = 0.923

ALP (Partial credit given for mistakenly classified NEs)

ALP Full vs Partial credit for mistakenly classified NEs

A New Approach

While the results I found using the ALP tool are outstanding, it goes without saying that Arabic NERC tools are not as mature as English NERC systems. This is because of the challenges I mentioned earlier pertaining Arabic and because English is a dominate language in all sorts of communications, web content, services, news, science, entertainment, etc. For that reason I decided to run the same Arabic documents through Google Translate to get the English translation of the text, then use Stanford Named Entity Recognizer and Classifier (NERC) to extract and classify named entities and compare the results to those I found using the ALP tool. I extracted the named entities and labeled them as True Positive (TP), False Positive (FP). After that, I manually extracted all named entities from the translated text to find False Negatives (FN). The rest of the words are all considered True Negatives (TN). Named entities identified by Stanford Named Entity Recognizer and Classifier (NERC) were considered True Positive even if the system failed to correctly identify their class. Three classes were used in this test (Person, Location, and Organization). It is worth mentioning that the translation, in few cases, was missing, wrong, or inaccurate which lowers the scores but since I am not comparing a tool to another, instead a process to another, I kept the translation "as is" and did not manually fix it to get better results from the Stanford NERC system.

Surprisingly, the result were comparable and sometimes better than those I got from using ALP for some of the measures:

Precision = 0.977

Recall = 0.816

True Negative Rate = 1

F-Measure = 0.881

Accuracy = 0.989

Balanced Accuracy = 0.91

Google Translate & Stanford NER (Full credit given for mistakenly classified NEs)

The same measures were recalculated but this time I only gave partial credit for named entities extracted but not correctly classified. The drop in the results was not significant:

Precision = 0.925

Recall = 0.805

True Negative Rate = 0.996

F-Measure = 0.849

Accuracy = 0.985

Balanced Accuracy = 0.903

Google Translate & Stanford NER (Partial credit given for mistakenly classified NEs)

GTS Full vs Partial credit for mistakenly classified NEs

These results would have increased with a better translation system.

For example:
This sentence "قررت نيابة قسم ثان الشيخ زايد، الخميس، سجن وزيري 4 أيام على ذمة التحقيق"
was translated as:
"On Thursday, the Prosecution of the second department, Sheikh Zayed, decided that a ministerial prison would be held for 4 days, pending investigation".
However, the correct translation is:
"On Thursday, the Prosecution of the second precinct, Sheikh Zayed, decided to keep Waziri in custody for 4 days, pending investigation".
This translation error resulted in Waziri not being identified as a named entity (person) and Sheikh Zayed being mistakenly classified as a person because in this context, it represents a location.
This happened because the word "وزيري" - in this case a last name - was mistakenly translated as "ministerial" by Google Translate.
The full name "محمد وزيري" was correctly translated in two other mentions in the same text and they were correctly extracted and classified as (person) in both occurrences.
This problem could easily be solved if the translator used entity tracking (another difficult NLP task for Arabic text) [8].

Precision and Recall

After I plotted the averages of all metrics I calculated for ALP and the new approach, Google Translate + Stanford Named Entity Recognizer and Classifier (GTS), I found that the new approach (GTS) resulted in a higher average precision than ALP; the opposite is true for the average recall.

GTS vs ALP Full credit for mistakenly classified NEs

GTS vs ALP Partial credit for mistakenly classified NEs

Observed Shortcomings

Both Stanford NERC and ALP failed to classify named entities used indirectly to describe a different entity. For example, "Iran" was classified as a location while it indirectly described "The Iranian Government" (an organization in the context). Also "Paris" was classified as a location, but in the context it meant "the French government".

Both Stanford NERC and ALP failed in named entity classification or were unable to identify them as named entities for some locations named after persons. For example: ALP failed to identify "مربع الشهداء" (Martyrs Square) as a named entity (location). Also, The Culture Palace in Algeria (named after the Algerian activist Moufdi Zakariaa) was not identified by Stanford NERC as a named entity (location).

I also found that ALP was able to extract and correctly classify some named entities in one place of the text, but it failed to extract the same named entities in another place in the same document. For example: "جمعية صيادي الأسماك" (Fishermen's Association) was correctly extracted and classified as an organization in two places, but failed to identify it as a named entity in a third place in the same text.

Finally, ALP had some difficulties extracting persons' named entities if one of the names (first or last name) was missing. For example: "حمودة" (person's last name) was not identified as a named entity (person) in eight places. It was mistakenly classified as a location in one place. But "شعبان حمودة" (full name) was correctly classified as a person. Stanford NERC did not have this problem.

Conclusion

Information Extraction from unstructured text relies on NERC. The research around Arabic NERC has not been sufficient because Arabic is a very complex language and it only comprises three percent of the content on the internet. The lack of capitalization, having multiple types (forms), complicated morphology, different writing styles, combined with the lack of resources make Arabic text very difficult to process. The few Arabic NERC systems that are available are not mature yet. The performance of ALP (Arabic linguistic pipeline), one of the Arabic NERC tools available on the web was similar to using Google Translate to translate the Arabic text to English, and then extracting the named entities using Stanford NERC. Improvements are possible as some of the observed NEs where the ALP tool failed to extract and classify are the same NEs that were extracted in other places in the text.

References

[1] A. A. Freihat, G. Bella, H. Mubarak and F. Giunchiglia, "A single-model approach for Arabic segmentation POS tagging and named entity recognition", Proc. 2nd Int. Conf. Natural Lang. Speech Process., pp. 1-8, Apr. 2018.
[2] Wael Etaiwi, Arafat Awajan, Dima SuleimanStatistical "Statistical Arabic name entity recognition approaches: A survey", Procedia Computer Science, 113 (2017), pp. 57-64

[3] A. Farghaly and K. Shaalan, "Arabic natural language processing: Challenges and solutions", ACM Transactions on Asian Language Information Processing (TALIP) 8, no. 4, 2009:14.

[4] Korayem M., Crandall D., Abdul-Mageed M. (2012) Subjectivity and Sentiment Analysis of Arabic: A Survey. In: Hassanien A.E., Salem AB.M., Ramadan R., Kim T. (eds) Advanced Machine Learning Technologies and Applications. AMLTA 2012. Communications in Computer and Information Science, vol 322. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35326-0_14
[5] Farber, Benjamin, Dayne Freitag, Nizar Habash, and Owen Rambow. 2008. Improving NER in Arabic using a morphological tagger. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pages 2,509–2,514, Marrakech
[6] Abouenour L., Bouzoubaa K. and Rosso P. (2010). On the Extension of Arabic Wordnet Named Entities and Its Impact on Question / Answering. In Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2010) ISBN 978-989-8425-29-4, pages 424-429. DOI: 10.5220/0003102004240429

[7] Bies, Ann, Denise DiPersio, and Mohamed Maamouri. 2012. Linguistic resources for Arabic machine translation: The Linguistic Data Consortium (LDC) catalog. In Abdelhadi Soudi, Ali Farghaly, Günter Neumann, and Rabih Zbib, editors, Challenges for Arabic Machine Translation, volume 322 of Natural Language Processing 9. John Benjamins Publishing Company, Amesterdam, pages 15–22.

[8] Shaalan, K.: A survey of arabic named entity recognition and classification. Comput. Linguist. 40(2), 469–510 (2014)

-- Hussam Hallak

Search This Blog

Web Science and Digital Libraries Research Group