2011-07-26: Universal Access to All Knowledge

After that, Kris explained the common collection strategies that are used by the Internet Archive to crawl the web. Frequently, they are doing a broad survey for the wide range domains such as .com, .net, .org, etc. They also considered the frequency of change for these websites and gave more support to the sites without archiving capabilities. Internet Archive has a special focus on the exhaustive websites when the web master decided to shutdown the website and would like to take a snapshot for the last time (for example, geocities). Another strategy is crawling by topics or specific collections based on a feedback from researchers or experts in this topic. In general, the key inputs for the URIs seeds are nominations from various partners (e.g., domain experts, trusted directories, Wikipedia).
Kris explained the methods of access to the web archive. The default method is "Host based" by browsing the website as it was using the WayBack Machine. In addition to other novel techniques such as full-text search and metadata/catalog look-ups; building an API's mirroring UI based access is also undergoing.
- Data mining and extraction
- Link domain of 20thCF or of an entire domain (e.g., .uk) from 1996-2010
- Dynamic, on-demand archiving of video and Wikipedia annotation
- Semantic data extraction, metadata services
-- Ahmed AlSum
Comments
Post a Comment