We are delighted to introduce the "Book of the Dead", a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006.
We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure" published at JCDL 2010. In addition we now thankfully have Synchronicity, a tool that can help overcome the 404 detriment to everyone's browsing experience in real time.
To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at: http://bit.ly/Book-of-the-Dead
And one more thing... not only does the corpus include the missing URIs, it also contains a best guess of what each of the URIs used to be about. We used Amazon's Mechanical Turk and asked workers to guess what the content of the missing pages used to. We only provided the URIs and the general topics elections and terror. The workers were supposed to just analyze the URI and draw their conclusions. Sometime this can be an easy task, for example the URI:
is clearly about an election event in 2004. Maybe one could know that "lp" stands for Libertarian Party and "de" for Delaware. Now this URI makes real sense and most likely "Morris" was a candidate running for office during the elections.
All together the Book of the Dead now offers missing URIs and their estimated "aboutness" which makes it a valuable dataset for retrieval and archival research.