Monday, November 9, 2009

2009-11-09: Eprint released for "Memento: Time Travel for the Web"

This is a follow-up to my post on October 5, where I mentioned the availability of the Memento project web site. Herbert's team and my team, working under an NDIIPP grant, have introduced a framework where you can browse the past web (i.e., old versions of web pages) in the same manner that you browse the current web. The framework uses HTTP content negotiation as a method for requesting the version of the page you want.

Most people know little about content negotiation, and the little they think they know is often wrong (see [1-3] for more information about CN). In a nutshell, CN allows you to link to a URI "foo" but, for example, without specifying its format (e.g., "foo.html" vs. "foo.pdf") or language ("foo.html.en" vs. "foo.html.es"). Your browser automatically passes preferences to the server (e.g., "I slightly prefer HTML over PDF, and I greatly prefer English to Spanish") and the server tries to find its best representation of "foo" that matches your preferences. In fact, CN defines 4 dimensions where the browser and server can negotiate the "best" representation: type, language, character set, and encoding (e.g., .gz vs. .zip).

We define a fifth dimension for CN: Datetime. If you configure your browser to prefer to view the web as it existed at a particular time, say January 29, 2008, then you could click on:

http://en.wikipedia.org/wiki/The_Cribs

and not get the current version, but rather get an older version of the page (in this case, before Johnny Marr had joined the band).

There are two kinds of "tricks" that must be addressed to make this possible:

1. The client can be configured to specify the desired Datetime. Scott Ainsworth is currently developing a Firefox add-on for us and will be releasing "real soon now" (tm). In the mean time, you can play with a browser-based client developed by LANL just to see how it works.

2. The server must know how to "do the right thing" (tm). There are several ways to do this. One, if the server is running a content management system that keeps track of prior versions, then the server can respond with correct older version. For example, we have a plug-in for mediawiki that maps the incoming Datetime requests to the prior versions.

Or the production server can redirect the client to where it knows its pages are. For example, the following demo pages:

http://lanlsource.lanl.gov/hello
http://odusource.cs.odu.edu/hello

"know" about their corresponding transactional archives at http://mementoarchive.lanl.gov/ and http://mementoarchive.cs.odu.edu/, respectively, and will redirect clients to the correct archive.

Third, the server can redirect the client to an aggregator we've developed (see the simple mod_rewrite rules that perform this function). For example, this rule is installed at http://digitalpreservation.gov/; if the server there detects a Memento request, it will redirect the client to the aggregator which will search the Internet Archive, Archive-It, and other public web archives for the best Datetime match.

Finally, if the server is not configured to do any of those things, the Firefox add-on attempts to detect the server's non-compliance and redirect the client to the aggregator (for the same effect as described above).

The above is a short description of how Memento works. More details can be found in our eprint:

Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, Harihar Shankar, "Memento: Time Travel for the Web", arXiv 0911.1112, November 2009.

Also, we have a number of upcoming presentations where you can catch us explaining Memento in more detail:
We hope to see you at one of these meetings. Let us know if you have questions or comments.

--Michael


1. Transparent Content Negotiation in HTTP, RFC 2295.
2. Content Negotiation, Apache HTTP Server Documentation.
3. ODU CS 595 Week 10 Lecture.

No comments:

Post a Comment