2020-08-30: Google Translate + Stanford NERC produce comparable results to Arabic Linguistic Pipeline (ALP)
Arabic Named Entity Recognition and Classification
Named Entity Recognition and Classification (NERC) is very important for many text processing tasks. However, Arabic NERC research is not popular because only three percent of online content is in Arabic. Furthermore, NERC for Arabic is a challenging task due to Arabic's lack of capitalization, multiple types, different writing styles, complex morphology, ambiguity, and lack of resources. There has not been many Arabic NERC systems available for the same reasons. In this post, I evaluate the performance of ALP (Arabic linguistic pipeline), one of the Arabic NERC tools that provides state-of-the-art precision and recall [1] and compare the results to a new approach that relies on Google Translate and one of the rich and mature English NERC tools, Stanford Named Entity Recognizer and Classifier (NERC). I expected the combination of Google Translate and Stanford NERC to have reduced the precision and recall in comparison to a dedicated Arabic NERC software, but I was pleasantly surprised when the two approaches proved comparable.
Why Named Entity Recognition is Important
Named Entity Recognition and Classification (NERC) is an essential part of Information Extraction from unstructured text. Furthermore, many Natural Language Processing (NLP) tasks depend on NERC. Some of the research domains that can benefit from NERC include Information Retrieval, Questions Answering, Machine Translation, Data Mining, Text Classification, and Machine Learning.Arabic Named Entity Recognition and Classification Research
Why Arabic Named Entity Recognition and Classification is Challenging
- The absence of capitalization in Arabic makes obvious named entities very difficult to extract. This is not something that can be resolved by having a dictionary of proper nouns because the same word can be a proper noun or a verb (more on that later). For example: The word "أكرم" can be an Arabic name for a male in one sentence, and a verb in another, meaning (he-respected or he was generous towards).
- There are three types of Arabic [4]:
- Classical Arabic (CA), which is used in religious books and ancient Arabic poem. CA is still used by Muslims in prayers and ritual ceremonies.
- Modern Standard Arabic (MSA), which is used in education, government documents, today's newspapers, articles, and modern books. The main difference between CA and MSA is vocabulary [5].
- Colloquial Arabic Dialects, which are used by Arabs in their daily informal communication. Dialects vary from one country to another and even from one city to another within the same country. Colloquial Arabic is a spoken language and can only be found in a written format in some social media posts and comments. Colloquial Arabic Dialects do not follow any rules and have no specific syntax. Most NERC research is geared towards handling text written in MSA. It would be very difficult to create an NERC system that can extract named entities from all three types of Arabic language simultaneously.
- Complicated morphology (Agglutination): Arabic words can contain multiple prefixes and suffixes forming a very complex morphology. These additions to the root word can be pronouns, conjunctions, prepositions, or a combination of them and some of them can be invisible (understood from the context).
For instance, multiple mentions in one word when a pronoun appears as a suffix:
نحن نصوت لرئيسنا التونسي translates to: We vote for our Tunisian president.
While "Tunisian president" is easy to identify as a named entity for NERC tools processing English text, "لرئيسنا التونسي" is not easy for NERC tools processing Arabic text to identify because of the pronoun "نا" (our) which appears as a suffix to "رئيس" (president) and the preposition "لـ" (for) which appears as a prefix. - Optional Short Vowels: Arabic contains short vowels (marks appear above and/or under the letters). They are not letters, and they do not have to be written. In fact, they are only written in CA. These short vowels can change the meaning of the word and/or entirely change the part of speech for the word, which enables multiple words with multiple different meanings and parts of speech to have the same lexical form and sometimes the same short vowels too. The context is the only way to identify the meaning.
For example: The Arabic word "عقد" can have eight different meanings depending on the short vowels and context. It can have any of the following meanings: A decade (noun), he got married (verb), a contract (noun), he tied up (verb), a necklace (noun), he complicated (verb), knots (noun), he held - a meeting - (verb). - Ambiguity: In addition to the ambiguity challenges faced when extracting and classifying named entities in English, Arabic can have entirely different meanings to the same word or the same combination of words.
For example: الدار البيضاء which translates to "the white house" is not the white house on 1600 Pennsylvania Avenue NW in Washington, D.C. It is actually the Arabic name for the famous city in Morroco, Casablanca. It also means "the house of the white color".
This ambiguity problem is solved for extracting named entities from English text because of capitalization and the classification can be solved from the terms surrounding the entity. George Washington is a person, but is also a location (street name in Suffolk, VA) if followed by "St. Suffolk", as in "George Washington St. Suffolk". - Lack of standard writing style: This problem occurs in almost all words transliterated from other languages to Arabic.
A few examples include:
The word "Google" can written as "غوغل" or "جوجل".
The word "Instagram" can be written as "انستاجرام", "انستاغرام", "إنستاجرام", or "إنستاغرام".
The word "Suffolk" can be written as "سافوك", "سافولك", "سوفولك", or "سوفوك".
The name "Laura" can written as "لاورا" or "لورا". - Systematic spelling variations/mistakes: Writing styles differ from one region to another. Arabic linguistic scholars are in disagreement on how some Arabic words should be written.
A few examples include:
The word "responsible" is written as "مسؤول" by Arabs living in the Levant and the Gulf countries, but is written as "مسئول" by Arabs living in Egypt.
The word "Dubai" is written as "دبى" by Egyptians and as "دبي" by the rest of Arabs. - Lack of Resources: The number of available resources necessary for Arabic NERC research is very limited [6] and expensive to use [7]. Researchers often develop their own corpora (tagged documents) and gazetteers (predefined lists of typed NEs) which require the time-consuming human annotation and verification.
Arabic NERC Systems
Evaluating ALP
Data Collection and Annotation
- I downloaded ten Arabic documents from various categories from the Arabic news websites: Aljazeera and Alarabiya.
- As a native speaker, I manually extracted and labeled all named entities for person, location, and organization from each document.
ALP NERC
I used ALP to extract all named entities for person, location, and organization from each document.Google Translate + Stanford NERC
- I used Google Translate to obtain the English translation for each Arabic document.
- I used Stanford Named Entity Recognizer and Classifier (NERC) to extract all named entities for person, location, and organization from the translated text for each document.
Comparison
I calculated the precision, recall, and other evaluation metrics for each document as well as the average of all metrics across all ten documents. I performed this quick evaluation on ten Arabic documents I downloaded from news websites. The results from the two approaches were comparable. Some of the observed cases where one or both approaches failed to extract or classify named entities are analyzed and briefly discussed to spark possible improvements. Although this comparison is not performed on a large dataset, it is meant to serve as a proof of concept. I plan on performing this comparison on large manually-tagged datasets in the future.A New Approach
This sentence "قررت نيابة قسم ثان الشيخ زايد، الخميس، سجن وزيري 4 أيام على ذمة التحقيق"
was translated as:
"On Thursday, the Prosecution of the second department, Sheikh Zayed, decided that a ministerial prison would be held for 4 days, pending investigation".
However, the correct translation is:
"On Thursday, the Prosecution of the second precinct, Sheikh Zayed, decided to keep Waziri in custody for 4 days, pending investigation".
This translation error resulted in Waziri not being identified as a named entity (person) and Sheikh Zayed being mistakenly classified as a person because in this context, it represents a location.
This happened because the word "وزيري" - in this case a last name - was mistakenly translated as "ministerial" by Google Translate.
The full name "محمد وزيري" was correctly translated in two other mentions in the same text and they were correctly extracted and classified as (person) in both occurrences.
This problem could easily be solved if the translator used entity tracking (another difficult NLP task for Arabic text) [8].
Precision and Recall
Observed Shortcomings
Both Stanford NERC and ALP failed to classify named entities used indirectly to describe a different entity. For example, "Iran" was classified as a location while it indirectly described "The Iranian Government" (an organization in the context). Also "Paris" was classified as a location, but in the context it meant "the French government".Both Stanford NERC and ALP failed in named entity classification or were unable to identify them as named entities for some locations named after persons. For example: ALP failed to identify "مربع الشهداء" (Martyrs Square) as a named entity (location). Also, The Culture Palace in Algeria (named after the Algerian activist Moufdi Zakariaa) was not identified by Stanford NERC as a named entity (location).
Conclusion
References
[2] Wael Etaiwi, Arafat Awajan, Dima SuleimanStatistical "Statistical Arabic name entity recognition approaches: A survey", Procedia Computer Science, 113 (2017), pp. 57-64
[5] Farber, Benjamin, Dayne Freitag, Nizar Habash, and Owen Rambow. 2008. Improving NER in Arabic using a morphological tagger. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pages 2,509–2,514, Marrakech
[6] Abouenour L., Bouzoubaa K. and Rosso P. (2010). On the Extension of Arabic Wordnet Named Entities and Its Impact on Question / Answering. In Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2010) ISBN 978-989-8425-29-4, pages 424-429. DOI: 10.5220/0003102004240429
Comments
Post a Comment