2016-03-22: Language Detection: Where to start?
Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived?. Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language, Python-Language Detector, LangID and Google Language Detection API. And since Python is my favorite coding language I searched for tools that were written in Python.
I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setting.
Guess-Language and Python-Language Detector are fast in detecting language, but they are more accurate with more text. Also, you have to extract the HTML tags before passing the text to the tools.
LangID is a tool that detects language and gives you a confidence score, it's fast and works well with short texts and it is easy to install and use.
Google Language Detection API is also a powerful tool that can be downloaded for different programming languages, it also has a confidence score, but you need to sign in and if the dataset you need to detect is large, (larger than 5000 requests a day (1 MB/day)), you must choose a payable plan.
HTTP Language Header:
If you want to detect the language of a web site a primary method is to look at the HTTP response header, Content-Language. The Content-Language lets you know what languages are present on the requested page. The value is defined as a two or three letter language code (such as ‘fr’ for French), and sometimes followed by a country code (such as ‘fr-CA’ for French spoken in Canada).
For example:
curl -I --silent http://bagergade-bogb.dk/ |grep -i "Content-Language"
Content-Language: da-DK,da-DK
In some cases you will find some sites offering content in multiple languages, and the Content-Language header only specifies one of the languages.
For example:
curl -I --silent http://www.hotelrenania.it/ |grep -i "Content-Language"
Content-Language: it
For example:
curl -I --silent https://www.debian.org/ |grep -i "Content-Language"
Content-Language: en
In addition, most cases the Content-Language is not included in the header. From a random sample of 10,000 English websites in DMOZ I found that only 5.09% have the Content-Language header.
For example:
curl -I --silent http://www.odu.edu |grep -i "Content-Language"
HTML Language:
Another indication of the language of a web page is the HTML language tag (such as, <html language='en'>….</html>). Using this method will require you to save the HTML code first then search for the HTML language code.
For example:
curl -I —silent http://ksu.edu.sa/ > ksu.txt
grep "<html lang=" ksu.txt
<html lang="ar"" dir="rtl" class="no-js" >
Guess-Language:
One tool to detect language in Python is Guess-Language. It detects the nature of the Unicode text. This tool detects over 60 languages. However, two important notes to be taken are that 1)this tool works better with more text and 2) don’t include the HTML tags in the text or the result will be flawed. So if you wanted to check the language of a webpage I recommend that you filter the tags using the beautiful soup package and then pass it to the tool.
For example:
curl --silent http://abelian.org/tssp/|grep "title"|sed -e 's/<[^>]*>//g'
Tesla Secondary Simulation Project
python
from guess_language import guessLanguage
guessLanguage(“Tesla Secondary Simulation Project”)
’fr’
guessLanguage("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")
’en'
Python-Language Detector (languageIdentifier):
Jeffrey Graves built a very light weight tool in C++ based on language hashes and wrapped in python. This tool is called Python-Language Detector. It is very simple and effective. It detects 58 languages.
For example:
python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“Tesla Secondary Simulation Project”,300,300)
’fr’
languageIdentifier.identify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.”,300,300)
’en’
Another example where we check the title of a Korean webpage, which was selected randomly from the DMOZ Korean webpage directory.
For example:
curl --silent http://bada.ebn.co.kr/ | grep "title"|sed -e 's/<[^>]*>//g'
EBN 물류&ì¡°ì„ ë‰´ìŠ¤
python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“EBN 물류&ì¡°ì„ ë‰´ìŠ¤”,300,300)
’ko’
LangID:
The other tool is LangID. This tool can detects 97 different languages. As an output it states the confidence score for the probability prediction. The scores are re-normalized and it produces an output in the 0-1 range. This tool is one of my favorite language detection tools because it is fast, detects short texts and gives you a confidence score.
For example:
python
import langid
langid.classify(“Tesla Secondary Simulation Project”)
(‘en’, 0.9916567142572572)
python
import langid
langid.classify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")
(‘en’, 1.0)
For example:
python
import langid
langid.classify(“السلام عليكم ورØمة الله وبركاته”)
(‘ar’, 0.9999999797315073)
Google Language Detection API:
The Google Language Detection API detects 160 different languages. I have tried this tool and I think it is one of the strongest tools found. The tool can be downloaded in different programming languages: ruby, java, python, php, crystal, C#. To use this tool you have to download an API key after creating an account and signing-up. The language tests results in three outputs: isReliable (true, false), confidence (rate), language (language code). The tool's website mentions that the confidence rate is not a range and can be higher than 100, no further explanation of how this score is calculated was mentioned. The API allows 5000 free requests a day (1 MB/day) free requests. If you need more than that there are different payable plans you can sign-up for. You can also detect text language in an online demo. I recommend this tool if you have a small data set, but it needs time to set-up and to figure out how it runs.
For example:
curl --silent http://moheet.com/ | grep "title"| sed -e 's/<[^>]* > //g' > moheet.txt
python
file1=open(“moheet.txt”,”r”)
import detectlanguage
detectlanguage.configuration.api_key=“Your key”
detectlanguage.detect(file1)
[{‘isReliable’: True, ‘confidence’: 7.73, ‘language’: ‘ar’}]
In Conclusion:
So before you start looking for the right tool you have to determine a couple of things first:
- Are you trying to detect the language of a webpage or some text?
- What is the length of the text? usually more text is better and gives more accurate results (check this article on the effect of short texts on language detection: http://lab.hypotheses.org/1083)
- What is the language you want to determine (if it is known or expected), because certain tools determine certain languages
- What programming language do you want to use?
Here is a short summary of the language detection methods I reviewed and a small description of all:
Method | Advantage | Disadvantage |
HTML language header and HTML language tag | can state language | not always found and sometimes affected by browser setting. |
Guess-Language | fast, easy to use | works better on longer text. |
Python-Language Detector | fast, easy to use | works better on longer text. |
LangID | fast, gives you confidence score | works on both long and short text. |
Google Language Detection API | gives you confidence score, works on both long and short text | needs creating an account and setting-up. |
--Lulwah M. Alkwai
Comments
Post a Comment