2010-02-17: Using Web Page Titles to Rediscover Lost Web Pages

The object of my project was to glean from a web page's title whether the title could be used to find the resource within the yahoo search engines caches. Lost pages for this project are pages that return a 404. A 404 response code is an error message indicating that the client was able to communicate with the server but the server could not find what was requested. There are a multitude of possibilities why a page or an entire web site may disappear. These pages may reside only in the cache’s of search engines, or web archives, or just moved from one URI to another. In the context of this experiment Titles are denoted by the TITLE element within a web page. There can only be one title in a web page. The title may not contain anchors, highlighting, or paragraph marks. What would be most desirable for this experiment would be to take all URIs as our collection set. Regrettably, using the entire web as our test set is unrealistic. Capturing a representative sample set of web-sites