Posts

Showing posts with the label rediscover missing web pages

2011-06-17: The "Book of the Dead" Corpus

Image
We are delighted to introduce the "Book of the Dead" , a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006. We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper " Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure " published at JCDL 2010 . In addition we now thankfully have Synchronicity , a tool that can help overcome the 404 detriment to everyone's browsing experience in real time. To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded a

2011-06-10: Launching Synchronicity - A Firefox Add-on for Rediscovering Missing Web Pages in Real Time

Image
Today we introduce Synchronicity , a Firefox extension that supports the user in rediscovering missing web pages. It triggers on the occurrence of 404 "Page not Found" errors, provides archived copies of the missing page as well as five methods to query search engines for the new location of the page (in case it has moved) or to obtain a good enough replacement page (in case the page is really gone). Synchronicity works in real time and helps to overcome the detriment of link rot in the web. Installation: Download the add-on from https://addons.mozilla.org/en-US/firefox/addon/synchronicity and follow the installation instructions. After restarting Firefox you will notice Synchronicity's shrimp icon in the right corner of the status bar. Usage: Whenever a 404 "Page not Found" error occurs the little icon will change colors and turn to notify the user that it has caught the error. Just click once on the red icon and the Synchronicity panel will load up. Synchr

2011-02-08: An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages

Image
The final project for my master's degree focused on the problem of “missing” web pages, those URIs that return an error result when retrieved.  When a web page is no longer available at a given URI, it may be available at a new URI, and this research proposes and demonstrates a new method for finding the new URI. Prior research has proposed using the lexical signature of a page as a search query to find the same or similar content at a new URI.  A lexical signature (LS) is a few words that are used in that page much more often than they are used in other pages on the Web, and so are thought to describe what the page is about.  That LS is then used as a search query which will hopefully find the target page in its results. Previously-proposed methods for using an LS to find a new URI required either that the page be analyzed before being lost (ref: P&W) or that cached or archived versions of the page be available for analysis.  If the page had not previously been analyzed and

2010-02-17: Using Web Page Titles to Rediscover Lost Web Pages

Image
The object of my project was to glean from a web page's title whether the title could be used to find the resource within the yahoo search engines caches. Lost pages for this project are pages that return a 404. A 404 response code is an error message indicating that the client was able to communicate with the server but the server could not find what was requested. There are a multitude of possibilities why a page or an entire web site may disappear. These pages may reside only in the cache’s of search engines, or web archives, or just moved from one URI to another. In the context of this experiment Titles are denoted by the TITLE element within a web page. There can only be one title in a web page. The title may not contain anchors, highlighting, or paragraph marks. What would be most desirable for this experiment would be to take all URIs as our collection set. Regrettably, using the entire web as our test set is unrealistic. Capturing a representative sample set of web-sit

2009-07-17: Technical Report "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure"

Image
This week I uploaded the technical report which is co-authored by Michael L. Nelson to the e-print service arxiv.org . The underlying idea of this research is to utilize the web infrastructure (search engines, their caches, the Internet Archive, etc) to rediscover missing web pages - pages that return the 404 "Page not Found" error. We apply various methods to generate search engine queries based on the content of the web page and user created annotations about the page. We then compare the retrieval performance of all methods and introduce a framework to combine such methods to achieve the optimal retrieval performance. The applied methods are: 5- and 7-term lexical signatures of the page the title of the page tags users annotated the page with on delicious.com 5- and 7-term lexical signatures of the page neighborhood (up to 50 pages linking to the missing page) We query the big three search engines (Google, Yahoo and MSN Live) with the outcome of all methods and analyze