2009-07-17: Technical Report "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure"



This week I uploaded the technical report which is co-authored by Michael L. Nelson to the e-print service arxiv.org. The underlying idea of this research is to utilize the web infrastructure (search engines, their caches, the Internet Archive, etc) to rediscover missing web pages - pages that return the 404 "Page not Found" error. We apply various methods to generate search engine queries based on the content of the web page and user created annotations about the page. We then compare the retrieval performance of all methods and introduce a framework to combine such methods to achieve the optimal retrieval performance.
The applied methods are:
  • 5- and 7-term lexical signatures of the page
  • the title of the page
  • tags users annotated the page with on delicious.com
  • 5- and 7-term lexical signatures of the page neighborhood (up to 50 pages linking to the missing page)
We query the big three search engines (Google, Yahoo and MSN Live) with the outcome of all methods and analyze the result sets to investigate the performance.
We have shown in recent work (published at ECDL 2008) that lexical signatures perform very well for rediscovering missing web pages.

As shown on the left we distinguish between four retrieval categories: top (the URLs is returned top ranked), top10 (returned within the top10 but not top), top100 (returned between rank 11 and 100) and undiscovered (not returned in any of the categories above). Displayed here is the retrieval performance of 5- and 7-term lexical signatures. A somewhat binary pattern is visible meaning the vast majority of URLs are either returned within the top 10 or are undiscovered.

However in this study we found that the pages' titles perform equally well. We further found that neither tags about the pages nor lexical signatures based on the page neighborhood performed satisfactorily. We need to mention though that we were not able to obtain tags for all URLs of our data set.
Inspired by the performance of titles we also conducted a small scale analysis of the consistency of our titles with respect to their retrieval performance. We looked at the title length in terms of the number of terms and characters as well as the number of stop words.

Since the title of a web page is much cheaper to obtain compared to the complex computation of a lexical signature we recommend using the title first and the lexical signature second for URLs that were not discovered in the first step. This experiment for one is a follow-up study of our work published at ECDL 2008 and for two forms the basis for a larger-scale study in the future.
--
martin

Comments