Tuesday, March 22, 2016

2016-03-22: Language Detection: Where to start?

Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived?. Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language, Python-Language Detector, LangID and Google Language Detection API. And since Python is my favorite coding language I searched for tools that were written in Python.

I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setting. Guess-Language and Python-Language Detector are fast in detecting language, but they are more accurate with more text. Also, you have to extract the HTML tags before passing the text to the tools. LangID is a tool that detects language and gives you a confidence score, it's fast and works well with short texts and it is easy to install and use. Google Language Detection API is also a powerful tool that can be downloaded for different programming languages, it also has a confidence score, but you need to sign in and if the dataset you need to detect is large, (larger than 5000 requests a day (1 MB/day)), you must choose a payable plan.

HTTP Language Header:
If you want to detect the language of a web site a primary method is to look at the HTTP response header, Content-Language. The Content-Language lets you know what languages are present on the requested page. The value is defined as a two or three letter language code (such as ‘fr’ for French), and sometimes followed by a country code (such as ‘fr-CA’ for French spoken in Canada).

For example:

curl -I --silent http://bagergade-bogb.dk/ |grep -i "Content-Language"

Content-Language: da-DK,da-DK

In this example the webpage's language is Danish (Denmark).

In some cases you will find some sites offering content in multiple languages, and the Content-Language header only specifies one of the languages.

For example:

curl -I  --silent http://www.hotelrenania.it/ |grep -i "Content-Language"

Content-Language: it

In this example, when looking at the browser the webpage has three languages available Italian, English and Dutch. And it only states Italian as its Content-Language.

You have to note that the Content-Language does not always match the language displayed in your bowser, because the browser's displayed language depends on the browser's language preference which you can change.

For example:

curl -I --silent https://www.debian.org/ |grep -i "Content-Language"

Content-Language: en

This webpage offers its content in more than 37 different languages. Here I had my browsers language preference set as Arabic, and the Content-Language found was English.

In addition, most cases the Content-Language is not included in the header. From a random sample of 10,000 English websites in DMOZ I found that only 5.09% have the Content-Language header.

For example:

curl -I --silent http://www.odu.edu |grep -i "Content-Language"


In this example we see that the Content-Language header was not found.

HTML Language:
Another indication of the language of a web page is the HTML language tag (such as, <html language='en'>….</html>). Using this method will require you to save the HTML code first then search for the HTML language code.

For example:

curl -I —silent http://ksu.edu.sa/ > ksu.txt

grep "<html lang=" ksu.txt

<html lang="ar"" dir="rtl" class="no-js" >

However, I found from a random sample of 10,000 English websites in DMOZ directory that only 48.6% have the HTML language tag.

Guess-Language:
One tool to detect language in Python is Guess-Language. It detects the nature of the Unicode text. This tool detects over 60 languages. However, two important notes to be taken are that 1)this tool works better with more text and 2) don’t include the HTML tags in the text or the result will be flawed. So if you wanted to check the language of a webpage I recommend that you filter the tags using the beautiful soup package and then pass it to the tool.

For example:

curl --silent http://abelian.org/tssp/|grep "title"|sed -e 's/<[^>]*>//g'

Tesla Secondary Simulation Project

python
from guess_language import guessLanguage
guessLanguage(“Tesla Secondary Simulation Project”)
’fr’
guessLanguage("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

’en'

This example shows detecting the title language of a randomly selected English webpage from DMOZ  http://abelian.org/tssp/. The language test using Guess-Language package will detect the language as French which is wrong. However, when we extract more text the result will be English.  In order to determine the language of short text you need to install Pyenchant and other dictionaries. By default it only supports three languages: English, French, and Esperanto. You need to download any additional language dictionary you may need.

Python-Language Detector (languageIdentifier):
Jeffrey Graves built a very light weight tool in C++ based on language hashes and wrapped in python. This tool is called Python-Language Detector. It is very simple and effective. It detects 58 languages.

For example:

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“Tesla Secondary Simulation Project”,300,300)
’fr’
languageIdentifier.identify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.”,300,300)

’en’

Here, we also noticed that the length of the text affects the result. When the text was short we falsely got "French" as the language. However, when we add more text from the webpage the correct answer appeared.

Another example where we check the title of a Korean webpage, which was selected randomly from the DMOZ Korean webpage directory.

For example:

curl --silent http://bada.ebn.co.kr/ | grep "title"|sed -e 's/<[^>]*>//g'

EBN 물류&조선 뉴스

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“EBN 물류&조선 뉴스”,300,300)

’ko’

Here the correct answer showed up “Korean”, although some English letters were in the title.

LangID:
The other tool is LangID. This tool can detects 97 different languages. As an output it  states the confidence score for the probability prediction. The scores are re-normalized and it produces an output in the 0-1 range. This tool is one of my favorite language detection tools because it is fast, detects short texts and gives you a confidence score.

For example:

python
import langid
langid.classify(“Tesla Secondary Simulation Project”)

(‘en’, 0.9916567142572572)

python
import langid
langid.classify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

(‘en’, 1.0)

Using the same text above. This tool identified a small text correctly, with a confidence rate of 0.99. And when full text is provided the confidence score was 1.0.


For example:

python
import langid
langid.classify(“السلام عليكم ورحمة الله وبركاته”)

(‘ar’, 0.9999999797315073)

By testing other language such as an Arabic phrase, it had a 0.99 confidence score for Arabic language.

Google Language Detection API:
The Google Language Detection API detects 160 different languages. I have tried this tool and I think it is one of the strongest tools found. The tool can be downloaded in different programming languages: ruby, java, python, php, crystal, C#. To use this tool you have to download an API key after creating an account and signing-up. The language tests results in three outputs: isReliable (true, false), confidence (rate), language (language code). The tool's website mentions that the confidence rate is not a range and can be higher than 100, no further explanation of how this score is calculated was mentioned. The API  allows 5000 free requests a day (1 MB/day) free requests. If you need more than that there are different payable plans you can sign-up for. You can also detect text language in an online demo. I recommend this tool if you have a small data set, but it needs time to set-up and to figure out how it runs.

For example:

curl --silent http://moheet.com/ | grep "title"| sed -e 's/<[^>]* > //g' > moheet.txt

python
file1=open(“moheet.txt”,”r”)
import detectlanguage
detectlanguage.configuration.api_key=“Your key”
detectlanguage.detect(file1)
[{‘isReliable’: True, ‘confidence’: 7.73, ‘language’: ‘ar’}]

In this example, I extract text from an Arabic webpage from DMOZ Arabic Directory. The tool detected its language Arabic with True reliability and a confidence of 7.73. Note you have to remove the new line from the text so it doesn’t consider it a batch detection and give you result for each line.

In Conclusion:
So before you start looking for the right tool you have to determine a couple of things first:
  • Are you trying to detect the language of a webpage or some text?
  • What is the length of the text? usually more text is better and gives more accurate results (check this article on the effect of short texts on language detection: http://lab.hypotheses.org/1083)
  • What is the language you want to determine (if it is known or expected), because certain tools determine certain languages
  • What programming language do you want to use?

Here is a short summary of the language detection methods I reviewed and a small description of all:

Method Advantage Disadvantage
HTML language header and HTML language tag can state language not always found and sometimes affected by browser setting.
Guess-Language fast, easy to use works better on longer text.
Python-Language Detector fast, easy to use works better on longer text.
LangID fast, gives you confidence score works on both long and short text.
Google Language Detection API gives you confidence score, works on both long and short text needs creating an account and setting-up.


--Lulwah M. Alkwai

Thursday, March 10, 2016

2016-03-07: Archives Unleashed Web Archive Hackathon Trip Report (#hackarchives)

The Thomas Fisher Rare Book Library (University of Toronto)
Between March 3 - March 5, 2016, Librarians, Archivists, Historians, Computer Scientists, etc., came together for the Archives Unleashed Web Archive Hackathon at the University of Toronto Robarts Library, Toronto, Ontario Canada. This event gave researchers the opportunity to collaboratively develop open-source tools for web archives. The event was organized by Ian Milligan, (assistant professor of Canadian and digital history in the Department of History at the University of Waterloo), Nathalie Casemajor (assistant professor in communication studies in the Department of Social Sciences at the University of Québec in Outaouais (Canada)), Jimmy Lin (the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo), Matthew Weber (Assistant Professor in the School of Communication and Information at Rutgers University), and Nicholas Worby (the Government Information & Statistics Librarian at the University of Toronto’s Robarts Library).

Additionally, the event was made possible due to the support of the Social Sciences and Humanities Research Council of Canada, the National Science Foundation, the University of Waterloo, the University of Toronto, Rutgers University, the University of Québec in Outaouais, the Internet Archive, Library and Archives Canada, and Compute Canada. Sawood Alam, Mat Kelly and myself, joined researchers from Europe and North America to exchange ideas in efforts to unleash our web archives. The event was split across three days.

DAY 1, THURSDAY MARCH 3, 2016

Ian Milligan kicked off the presentations by presenting the agenda. Following this, he presented his current research effort - 

HistoryCrawling with Warcbase (Ian Milligan, Jimmy Lin)

The presenters introduced Warcbase as a platform for exploring the past. Warcbase  is an open-source tool used to manage web archives built on Hadoop an HbaseWarcbase was introduced through two case studies and datasets, namely, exploring Canadian Political Parties and Political Interest Groups (2005 - 2015), and Geocities datasets.



Put Hacks to Work: Archives in Research (Matthew Weber)

Following Ian Milligan's presentation, Matthew Weber emphasized some important ideas to guide the development of tools for web archives, such as considering the audience.




Archive Research Services Workshop (Jefferson Bailey, Vinay Goel)

Following Matthew Weber's presentation, Jefferson Bailey and Vinay Goel presented a comprehensive introduction workshop for researchers, developers, and general users. The workshop addressed data mining and computational tools and methods for working with web archives.




Embedded Metadata as Mobile Micro Archives (Nathalie Casemajor)

Following Jefferson Bailey and Vinay Goel's presentation, Nathalie Casemajor presented her research effort for tracking the evolution of images shared on the web. She talked about how embedded metadata in images helped track dissemination of images shared on the web.





Revitalization of the Web Archiving Program at LAC (Tom Smyth)

Following Nathalie Casemajor's presentation, Tom Smyth of the Library and Archives Canada presented their archiving activities such as the domain crawls of Federal sites, curation of thematic research collections, and preservation archiving of resources at risk. He also talked about their recent collections such as Federal Election 2015, First World War Commemoration, and the Truth and Reconciliation collections.

After the first five short presentations, Jimmy Lin gave presented a technical tutorial of Warcbase. After which Helge Holzmann, presented ArchiveSpark: framework built to make accessing Web Archives easier for researchers, which makes for easy data extraction and derivation.


After a short break, there were five more presentations targeting Web Archiving and Textual Analysis Tools:

WordFish (Federico Nanni)

Federico Nanni presented WordFish: a R computer program used to extract political positions from text documents. Wordfish is a scaling technique and does not need any anchoring documents to perform the analysis but relies instead on a statistical model of word frequencies.


MemGator (Sawood Alam)

Following Federico Nanni's presentation Sawood Alam presented a tool he developed called MemGator: a Memento Aggregator CLI and Server written in Go. Memento is a framework that adds the time dimension to the web. Additionally, a timestamped copy of the presentation of a resource is also called a Memento. A list/collection of such mementos is called a TimeMap. MemGator can generate TimeMap of a given URI or provide the closest Memento to a given time.



Topic Words in Context (Jonathan Armoza)

Following Sawood Alam's presentation,  Jonathan Armoza presented a tool he developed - TWIC (Topics Words in Context) by demonstrating LDA topic modeling of Emily Dickenson's poetry. TWIC provides a hierarchical visualization of LDA topic models generated by the MALLET topic modeler.
Following Jonathan Armoza's presentation, Nick Ruest presented Twarc: a Python command line tool/Python library tool for archiving Tweet JSON data. Twarc runs in three modes: search, filter stream and hydrate.
Following Nick Ruest's presentation, I presented Carbon date: a tool originally developed by Hany SalahEldeen, which I current maintain. Carbon date is a tool for estimating the creation date of a website. Carbon date polls multiple sources for datetime evidence. It returns a Json response which contains the estimated creation date of the website.
After the five short presentation about Web Archiving and Textual Analysis Tools, all participants engaged in a brain storming session in which ideas where discussed. And clusters of researchers with common interests where iteratively developed. The brainstorming session led to the formation of seven groups, namely:
  1. I know words and images
  2. Searching, mining, everything
  3. Interplanetary WayBack
  4. Surveillance of First Nations
  5. Nuage
  6. Graph‐X‐Graphics
  7. Tracking Discourse in Social Media



Following the brainstorming and group formation activity, all participants were received at the Bedford Academy for a reception that went on through the late evening. 


DAY 2, THURSDAY MARCH 4, 2016



The second day of the Archives Unleashed Web Archive Hackathon began with breakfast, after which the groups formed on Day 1 met for about three hours to begin working on the ideas discussed the previous day. At noon, lunch was provided as more presentations took place:
Evan Light began the series of presentations, by talking about a box he created called the Snowden Archive-in-a-Box : The box features a stand-alone wifi network and web server that allows researchers to utilize the files leaked (subsequently published by the media) by Edward Snowden. The box which serves as a portable archive protects users from mass surveillance.

Mediacat (Alejandro Paz and Kim Pham)

Following Evan Light's presentation, Alejandro Paz and Kim Pham presented Mediacat: an open-source  web crawler and archive application suite which enables ethnographic research to understand how digital news is disseminated and used across the web.

Data Mining the Canadian Media Public Sphere (Sylvain Rocheleau)

Following Alejandro Paz and Kim Pham's presentation, Sylvain Rocheleau talked about his research efforts to provide near real time Data Mining of the Canadian news media. His research involves the mass crawl of about 700 Canadian news websites at 15-minute intervals, and Data Mining processes which includes Named Entity Recognition.

Tweet Analysis with Warcbase (Jimmy Lin)

Following Sylvain Rocheleau's presentation, Jimmy Lin gave another tutorial in which he showed how to extract information from Tweets from the Warcbase platform.

A five hour Hackathon session continued. The Hackathon was briefly suspended for a visit to the Thomas Fisher Rare Books Library.
After the visit to the Thomas Fisher Rare Books Library, the hackathon session continued until the evening, after which all participants went for Dinner at the University of Toronto Faculty Club. 

DAY 3, THURSDAY MARCH 5, 2016



The third and final day of the Archives Unleashed Web Archive Hackathon began in a similar fashion as the second: first breakfast, second a three hour hackathon session, third presentations over lunch:

Malach Collection (Petra Galuscakova)

Petra Galuscakova started the series of presentations by talking about the Czech Malach Cross-lingual Speech Retrieval Test Collection: a collection of multimedia about the testimonies of survivors and other witnesses of the Holocaust.

Waku (Kyle Parry)
Digital Arts and Humanities Initiatives at UH Mānoa (or how to do interesting things with few resources) (Richard Rath)

After the presentations, the hackathon session continued until 4:30 pm EST, thereafter, the group presentations began:

PRESENTATIONS

I know words and images (Kyle Parry, Niel Chah, Emily Maemura, and Kim Pham)

Inspired by John Oliver's #MakeDonaldDrumpfAgain, this team sought to research memes by processing words and images. They investigated what people say, how they use and modify the text and images of others, and how computers read text and classify images, etc.

Searching, mining, everything (Jaspreet SinghHelge Holzmann, and Vinay Goel)

Interplanetary WayBack (Sawood Alam and Mat Kelly)

"Who will archive the archives?"

To answer this question Sawood Alam and Mat Kelly presented the archiving and replay system called Interplanetary Wayback (ipwb). In a nutshell, during the indexing process ipwb consumes WARC files one record a time, splits the record into headers and payload, pushes the two pieces into the IPFS (a peer‐to‐peer file system) network for persistent storage, and stores the references (digests) into to file format called CDXJ along with some other lookup keys and metadata. For replay it it finds the records in the index file and builds the response by assembling headers and payload retrieved from the IPFS network and performing necessary rewrites. The major benefits of this system include deduplication, redundancy, and shared open access.

Surveillance of First Nations (Evan Light, Katherine Cook, Todd Suomela, and Richard Rath)

Nuage (Petra Galuscakova, Neha Gupta, Rosa Iris R. Rovira, Nathalie CasemajorSylvain Rocheleau, Ryan Deschamps, and Ruqin Ren)

Graph‐X‐Graphics (Jeremy Wiebe, Eric Oosenbrug, and Shane Martin)

Tracking Discourse in Social Media (Tom Smyth, Allison Hegel, Alexander Nwala, Patrick EganNick RuestYu Xu, Kelsey UtneJonathan Armoza, and Federico Nanni)

This team processed ~11.2 million tweets and ~50 million reddit comments which referenced the Charlie Hebdo and Bataclan attacks, in an effort to track the evolution of social media commentary about the attacks. The team sought to measure the attention span, information/misinformation flow, as well as the co-occurence network of terms in order to understand the dynamics of commentary about these events.

The votes were tallied and Nuage team got the most votes, and were declared winners. The event concluded after some closing remarks.

--Nwala

Monday, March 7, 2016

2016-03-07: Custom Missions in the COVE Tool

When I am not studying Web Sciences at ODU, I work as a software developer at Analytical Mechanics Associates. In general, my work there aims to make satellite data more accessible. As part of this mission, one of my primary projects is the COVE tool.

The COVE tool allows a user to view where a satellite could potentially take an image. The above image shows the ground swath of both Landsat 7 (red) and Landsat 8 (green) over a one day period. 
The CEOS Visualization Environment (COVE) tool is a browser-based system that leverages Cesium, an open-source JavaScript library for 3D globes and maps, in order to display satellite sensor coverage areas and identify coincidence scene locations. In other words, the COVE tool allows the user to see where a satellite could potentially take an image and where two or more satellite paths overlap during a specified time period. The Committee on Earth Observing Satellites (CEOS) is currently operating and planning hundreds of Earth observation satellites.  COVE initially began as a way to improve Standard Calibration and Validation (Cal/Val) exercises for these satellites. Cal/Val exercises need to compare near-simultaneous surface observations and identify corresponding image pairs in order to calibrate and validate the satellite's orbit. These tasks are time-consuming and labor-intensive. The COVE tool has been pivotal in making these Cal/Val exercises much easier and more efficient.

The COVE tool allows a user to see possible coincidences of two satellites. The above image shows the coincidences of ALOS-2 with Landsat 7 over a one week period.
In the past, the COVE tool only allowed for this analysis to be done on historical, operational, or notional satellite missions with known orbit data, which COVE could then use to predict the propagation of the orbit accurately, within the bounds of the model’s assumptions, for up to three (3) months passed the last-known orbit data. This has proven extremely useful for those missions that the orbit data is known; however, it was limited to these missions.

Mission planning is another task which includes the prediction of satellite orbits, a task the COVE tool was well equipped for. However, in mission planning exercises, the orbit data of the satellite is unknown. Based on this need, we wanted to extend COVE to include customized missions, in which the user could define the orbit parameters and the COVE tool would then predict the orbit of the customized mission through a numerical propagation. I had the opportunity to be the lead developer for this new feature, which recently went live and can be accessed through the Custom Missions tab on the right of the COVE tool, as shown in the video below. This is an important addition to the COVE tool, as it allows for better planning of potential future missions and will hopefully help to improve satellite coverage of Earth in the future.

video

Video Summary:
00:07:04 - The "Custom" Missions and Instruments tab shows a list of the current user's custom missions. Currently, we do not have any custom missions.
00:09:03 - To create a custom mission, choose "Custom Missions" on the right panel. First, we need to "Add Mission." Once we have a mission we can add additional instruments to the instrument or delete the mission.
00:20:15 - After choosing a mission name, we need to decide if we want to use an existing mission's orbit or define a custom orbit. We want to create a custom orbit. Clicking on "Custom defined orbit" gives three more options. A circular orbit is the most basic and for the novice user. A repeating sun synchronous orbit is a subset of circular orbits that must cover each area around the same time. For example, if the satellite passes over Hampton, VA at 10:00 AM, its next pass over Hampton should also be at 10:00 AM. The advanced orbit is for the experienced user and allows full control over the orbital parameters. We will create a repeating sun synchronous orbit, similar to Landsat 8.
00:33:14 - When creating a repeating sun synchronous orbit, the altitude given is only an estimate as only certain inclination/altitude pairs are able to repeat. Thus, the user has the option to calculate the inclination and altitude that will be used.
00:37:24 - The instrument and mode, along with the altitude of the orbit we just defined, determine the swath size of the potential images the satellite will be able to take.
00:49:23 - We need to define "Field of View" and "Pointing Angle" of the instrument. We will also choose "Daylight only," our custom mission will only take images during the daylight hours. This is useful because many optical satellites, such as Landsat 8 are "Daylight only" since they cannot take good optical images at night.
01:02:06 - We will now choose a date range over which we will propagate the orbit to see what our satellite's path will look like.
01:21:18 - We can now see what path our satellite will take during the daylight hours, since we chose "Daylight only."

This project was only possible thanks to other key AMA associates involved, namely Shaun Deacon--project lead and aerospace engineer, Andrew Cherry--developer and ODU graduate, and Jesse Harrison--developer.

--Kayla