Friday, June 17, 2011

2011-06-17: The "Book of the Dead" Corpus

We are delighted to introduce the "Book of the Dead", a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006.

We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure" published at JCDL 2010. In addition we now thankfully have Synchronicity, a tool that can help overcome the 404 detriment to everyone's browsing experience in real time.

To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at: http://bit.ly/Book-of-the-Dead

And one more thing... not only does the corpus include the missing URIs, it also contains a best guess of what each of the URIs used to be about. We used Amazon's Mechanical Turk and asked workers to guess what the content of the missing pages used to. We only provided the URIs and the general topics elections and terror. The workers were supposed to just analyze the URI and draw their conclusions. Sometime this can be an easy task, for example the URI:

http://www.de.lp.org/election2004/morris.html

is clearly about an election event in 2004. Maybe one could know that "lp" stands for Libertarian Party and "de" for Delaware. Now this URI makes real sense and most likely "Morris" was a candidate running for office during the elections.

All together the Book of the Dead now offers missing URIs and their estimated "aboutness" which makes it a valuable dataset for retrieval and archival research.
--
martin

2 comments:

  1. I suspect
    http://www.de.lp.org/election2004/morris.html

    contained this content:

    http://web.archive.org/web/20050213154403/http://www.de.lp.org/election2004/morris.html

    (One shouldn't overlook available tools for recovery.)

    ReplyDelete
  2. Hi Brad,

    Thanks for the note. We're certainly aware of the Internet Archive's Wayback Machine (see for example our inter-archive work on Memento http://mementoweb.org/ and our prior work on measuring how much of the web is archived http://bit.ly/kBY9JN )

    The goal of this collection is to provide a list of URIs that are currently 404 so researchers can evaluate their systems for finding where the same or similar content moved (e.g., Morris running in 2012). Some of the URIs existed in archives and some did not; we applied the same Mechanical Turk process to all of them.

    regards,

    Michael

    ReplyDelete