Tuesday, August 2, 2011

2011-07-26: Universal Access to All Knowledge

On July 26, 2011, the Web Science and Digital Library group at Old Dominion University hosted Kris Carpenter Negulescu, Director of the Web Group at the Internet Archive who gave a talk entitled “Universal Access to All Knowledge”. The presentation started with an introduction about what the Internet Archive is, then, she gave us some information about what are the archived materials in Internet Archive for now: Text (+2.9M books), Moving Images (+542,500 items), Audio (+950,000 items), Television broadcast (+1M hours), Web Pages (+150 billion pages). Moreover, she gave an overview about some of the special collections such as K-12 students and NASA images.

After that, Kris explained the common collection strategies that are used by the Internet Archive to crawl the web. Frequently, they are doing a broad survey for the wide range domains such as .com, .net, .org, etc. They also considered the frequency of change for these websites and gave more support to the sites without archiving capabilities. Internet Archive has a special focus on the exhaustive websites when the web master decided to shutdown the website and would like to take a snapshot for the last time (for example, geocities). Another strategy is crawling by topics or specific collections based on a feedback from researchers or experts in this topic. In general, the key inputs for the URIs seeds are nominations from various partners (e.g., domain experts, trusted directories, Wikipedia).

Kris explained the methods of access to the web archive. The default method is "Host based" by browsing the website as it was using the WayBack Machine. In addition to other novel techniques such as full-text search and metadata/catalog look-ups; building an API's mirroring UI based access is also undergoing.

Finally, she gave an overview about some special projects such as:
  • Data mining and extraction
  • Link domain of 20thCF or of an entire domain (e.g., .uk) from 1996-2010
  • Dynamic, on-demand archiving of video and Wikipedia annotation
  • Semantic data extraction, metadata services
The colloquium was recorded and is available below.

-- Ahmed AlSum

