Tuesday, March 22, 2016

2016-03-22: Language Detection: Where to start?

Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived?. Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language, Python-Language Detector, LangID and Google Language Detection API. And since Python is my favorite coding language I searched for tools that were written in Python.

I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setting. Guess-Language and Python-Language Detector are fast in detecting language, but they are more accurate with more text. Also, you have to extract the HTML tags before passing the text to the tools. LangID is a tool that detects language and gives you a confidence score, it's fast and works well with short texts and it is easy to install and use. Google Language Detection API is also a powerful tool that can be downloaded for different programming languages, it also has a confidence score, but you need to sign in and if the dataset you need to detect is large, (larger than 5000 requests a day (1 MB/day)), you must choose a payable plan.

HTTP Language Header:
If you want to detect the language of a web site a primary method is to look at the HTTP response header, Content-Language. The Content-Language lets you know what languages are present on the requested page. The value is defined as a two or three letter language code (such as ‘fr’ for French), and sometimes followed by a country code (such as ‘fr-CA’ for French spoken in Canada).

For example:

curl -I --silent http://bagergade-bogb.dk/ |grep -i "Content-Language"

Content-Language: da-DK,da-DK

In this example the webpage's language is Danish (Denmark).

In some cases you will find some sites offering content in multiple languages, and the Content-Language header only specifies one of the languages.

For example:

curl -I  --silent http://www.hotelrenania.it/ |grep -i "Content-Language"

Content-Language: it

In this example, when looking at the browser the webpage has three languages available Italian, English and Dutch. And it only states Italian as its Content-Language.

You have to note that the Content-Language does not always match the language displayed in your bowser, because the browser's displayed language depends on the browser's language preference which you can change.

For example:

curl -I --silent https://www.debian.org/ |grep -i "Content-Language"

Content-Language: en

This webpage offers its content in more than 37 different languages. Here I had my browsers language preference set as Arabic, and the Content-Language found was English.

In addition, most cases the Content-Language is not included in the header. From a random sample of 10,000 English websites in DMOZ I found that only 5.09% have the Content-Language header.

For example:

curl -I --silent http://www.odu.edu |grep -i "Content-Language"


In this example we see that the Content-Language header was not found.

HTML Language:
Another indication of the language of a web page is the HTML language tag (such as, <html language='en'>….</html>). Using this method will require you to save the HTML code first then search for the HTML language code.

For example:

curl -I —silent http://ksu.edu.sa/ > ksu.txt

grep "<html lang=" ksu.txt

<html lang="ar"" dir="rtl" class="no-js" >

However, I found from a random sample of 10,000 English websites in DMOZ directory that only 48.6% have the HTML language tag.

Guess-Language:
One tool to detect language in Python is Guess-Language. It detects the nature of the Unicode text. This tool detects over 60 languages. However, two important notes to be taken are that 1)this tool works better with more text and 2) don’t include the HTML tags in the text or the result will be flawed. So if you wanted to check the language of a webpage I recommend that you filter the tags using the beautiful soup package and then pass it to the tool.

For example:

curl --silent http://abelian.org/tssp/|grep "title"|sed -e 's/<[^>]*>//g'

Tesla Secondary Simulation Project

python
from guess_language import guessLanguage
guessLanguage(“Tesla Secondary Simulation Project”)
’fr’
guessLanguage("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

’en'

This example shows detecting the title language of a randomly selected English webpage from DMOZ  http://abelian.org/tssp/. The language test using Guess-Language package will detect the language as French which is wrong. However, when we extract more text the result will be English.  In order to determine the language of short text you need to install Pyenchant and other dictionaries. By default it only supports three languages: English, French, and Esperanto. You need to download any additional language dictionary you may need.

Python-Language Detector (languageIdentifier):
Jeffrey Graves built a very light weight tool in C++ based on language hashes and wrapped in python. This tool is called Python-Language Detector. It is very simple and effective. It detects 58 languages.

For example:

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“Tesla Secondary Simulation Project”,300,300)
’fr’
languageIdentifier.identify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.”,300,300)

’en’

Here, we also noticed that the length of the text affects the result. When the text was short we falsely got "French" as the language. However, when we add more text from the webpage the correct answer appeared.

Another example where we check the title of a Korean webpage, which was selected randomly from the DMOZ Korean webpage directory.

For example:

curl --silent http://bada.ebn.co.kr/ | grep "title"|sed -e 's/<[^>]*>//g'

EBN 물류&조선 뉴스

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“EBN 물류&조선 뉴스”,300,300)

’ko’

Here the correct answer showed up “Korean”, although some English letters were in the title.

LangID:
The other tool is LangID. This tool can detects 97 different languages. As an output it  states the confidence score for the probability prediction. The scores are re-normalized and it produces an output in the 0-1 range. This tool is one of my favorite language detection tools because it is fast, detects short texts and gives you a confidence score.

For example:

python
import langid
langid.classify(“Tesla Secondary Simulation Project”)

(‘en’, 0.9916567142572572)

python
import langid
langid.classify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

(‘en’, 1.0)

Using the same text above. This tool identified a small text correctly, with a confidence rate of 0.99. And when full text is provided the confidence score was 1.0.


For example:

python
import langid
langid.classify(“السلام عليكم ورحمة الله وبركاته”)

(‘ar’, 0.9999999797315073)

By testing other language such as an Arabic phrase, it had a 0.99 confidence score for Arabic language.

Google Language Detection API:
The Google Language Detection API detects 160 different languages. I have tried this tool and I think it is one of the strongest tools found. The tool can be downloaded in different programming languages: ruby, java, python, php, crystal, C#. To use this tool you have to download an API key after creating an account and signing-up. The language tests results in three outputs: isReliable (true, false), confidence (rate), language (language code). The tool's website mentions that the confidence rate is not a range and can be higher than 100, no further explanation of how this score is calculated was mentioned. The API  allows 5000 free requests a day (1 MB/day) free requests. If you need more than that there are different payable plans you can sign-up for. You can also detect text language in an online demo. I recommend this tool if you have a small data set, but it needs time to set-up and to figure out how it runs.

For example:

curl --silent http://moheet.com/ | grep "title"| sed -e 's/<[^>]* > //g' > moheet.txt

python
file1=open(“moheet.txt”,”r”)
import detectlanguage
detectlanguage.configuration.api_key=“Your key”
detectlanguage.detect(file1)
[{‘isReliable’: True, ‘confidence’: 7.73, ‘language’: ‘ar’}]

In this example, I extract text from an Arabic webpage from DMOZ Arabic Directory. The tool detected its language Arabic with True reliability and a confidence of 7.73. Note you have to remove the new line from the text so it doesn’t consider it a batch detection and give you result for each line.

In Conclusion:
So before you start looking for the right tool you have to determine a couple of things first:
  • Are you trying to detect the language of a webpage or some text?
  • What is the length of the text? usually more text is better and gives more accurate results (check this article on the effect of short texts on language detection: http://lab.hypotheses.org/1083)
  • What is the language you want to determine (if it is known or expected), because certain tools determine certain languages
  • What programming language do you want to use?

Here is a short summary of the language detection methods I reviewed and a small description of all:

Method Advantage Disadvantage
HTML language header and HTML language tag can state language not always found and sometimes affected by browser setting.
Guess-Language fast, easy to use works better on longer text.
Python-Language Detector fast, easy to use works better on longer text.
LangID fast, gives you confidence score works on both long and short text.
Google Language Detection API gives you confidence score, works on both long and short text needs creating an account and setting-up.


--Lulwah M. Alkwai

No comments:

Post a Comment