Web Science and Digital Libraries Research Group

Posts

Showing posts with the label HTML Language

2020-06-08: Who is that person in the picture? Or, how Python, and Haar can add value to an image.

By Chuck Cartledge - June 08, 2020

(Sung to the tune of "How Much is that Doggie in the Window") Who is that person in the picture? The one with the light brown hair. Who is that person in the picture? I do hope that someone would share. Introduction Often times when a group gets together, for whatever reason, there will be a group picture at the end to commemorate the good times had by all. If this "sea of faces" gets published, in hard or soft copy, there may be a one or two line caption giving the name of the group and perhaps where and when the image was created. Six months or a year later, the image has only marginal value to the people who were there, and almost no value to those who were not there, because it is just a sea of faces. We are interested in finding a low cost (very little human time) method of providing a way to add value to the soft copy of the image, so the image will have greater value later. We have developed a Python script that uses a Haar facial detection cascade to create ...

2016-03-22: Language Detection: Where to start?

By Lulwah M. Alkwai - March 22, 2016

Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived? . Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language , Python-Language Detector , LangID and Google Language Detection API . And since Python is my favorite coding language I searched for tools that were written in Python. I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setti...

2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

By Shawn M. Jones - February 24, 2016

Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives. These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from AlNoamany's work or Andy Jackson's work , because we are trying to acquire and extract text from mementos across many web archives, rather than just one. We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex. We document our findings in a technical report entitled: " Rules of Acquisition for Mementos and Their Content ". Our technical report briefly covers the following key points: Special technique...