2011-06-17: The "Book of the Dead" Corpus
We are delighted to introduce the "Book of the Dead" , a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006. We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper " Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure " published at JCDL 2010 . In addition we now thankfully have Synchronicity , a tool that can help overcome the 404 detriment to everyone's browsing experience in real time. To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded a