Friday, November 5, 2010

2010-11-05: Memento-Datetime is not Last-Modified


One of the key contributions of the Memento Framework is the HTTP response header "Memento-Datetime" (previously called "Content-Datetime" in our earlier publications & slides). Memento-Datetime is the sticky, intended datetime* for the representation returned when a URI is dereferenced. The presence of the Memento-Datetime HTTP response header is how the client realizes it has reached a Memento.

Rather than formally explain what we mean by "sticky, intended datetime", it is easier to explain how it is neither the value in the HTTP response header Last-Modified, nor is it the creation date of the resource (which has no corresponding HTTP header, for reasons that will become clear). For the examples below, we'll define the following abbreviations:
  • CD (Creation-Datetime) = the datetime the resource was created
  • MD (Memento-Datetime) = the datetime the representation was observed on the web
  • LM (Last-Modified) = the datetime the resource last changed state
Case 1: CD == MD == LM

We'll begin with a case in which all three datetime values could be the same. Consider the case of this index page at Archive-It.org:


http://wayback.archive-it.org/927/*/http://www.nyu.edu/fas/projects/vcb/case_911_FLASHcontent.html

The index page has a link to a single Memento. For simplicity, we'll assume Archive-It.org created this index page and the Memento it references at the moment of the crawl, thus the various datetimes of the Memento would all be equal:

Creation-Datetime: Wed, 05 Mar 2008 20:16:49 GMT
Memento-Datetime: Wed, 05 Mar 2008 20:16:49 GMT
Last-Modified: Wed, 05 Mar 2008 20:16:49 GMT

Case 2: CD == MD < LM

If we click on the Memento (http://wayback.archive-it.or/927/20080305201649/http://www.nyu.edu/fas/projects/vcb/case_911_FLASHcontent.html), we see that it has a disclaimer banner ("You are viewing an archived web page...") that many archives employ to inform the reader that they are looking at a Memento and not the original resource. Although there are many techniques for inserting such a banner, the Archive-It example directly modifies the original HTML to insert this banner (as well as handle URI rewriting, etc.).

Now pretend the wording of the banner needs to be changed (for example, to address a new legal requirement). The CD and MD of the Memento are unchanged, but the LM must reflect when the wording of the banner changed:

Creation-Datetime: Wed, 05 Mar 2008 20:16:49 GMT
Memento-Datetime: Wed, 05 Mar 2008 20:16:49 GMT
Last-Modified: Fri, 05 Nov 2010 23:25:19 GMT

Both your lawyer and your HTTP cache consider this an important change, so you have to update LM. But it also clear that the essence of March 2008 observation of the Memento by Archive-It.org is unchanged by the wording change of the archive banner, so MD is not updated. And certainly the CD is unchanged by this modification.

Case 3: MD < CD <= LM

Now pretend you are making a new web archive, and you are populating it by crawling other web archives such as Archive-It.org (simulated with the king of browsers in the image to the left). You are effectively copying:

http://wayback.archive-it.org/927/20080305201649/http://www.nyu.edu/fas/projects/vcb/case_911_FLASHcontent.html

to:

http://archive.example.org/20101105232519/http://www.nyu.edu/fas/projects/vcb/case_911_FLASHcontent.html

The presence of the Memento-Datetime header from Archive-It.org indicates that the resource is an encapsulation of the state of another resource, at the MD datetime value. The link between the Memento and the original resource is indicated with an HTTP Link response header:

Link: rel="original"; <http://www.nyu.edu/fas/projects/vcb/case_911_flashcontent.html>

Thus, MD is sticky in that the new Memento at example.org retains the MD value it observed from Archive-It.org. However, the CD and LM values reflect the datetime relative to example.org:

Creation-Datetime: Fri, 05 Nov 2010 23:25:19 GMT
Memento-Datetime: Wed, 05 Mar 2008 20:16:49 GMT
Last-Modified: Fri, 05 Nov 2010 23:25:19 GMT

The MD and LM datetimes can also vary for the example.org Memento as described in Case 2. (In the unlikely case that the intent of example.org was to create an archive of how resources were archived, the MD could be reset to 05 Nov 2010 and the Link header would point to the Archive-It.org resource as the original resource instead of the nyu.edu resource; however, this is not the point of this discussion.)

Case 4: CD < MD <= LM

This scenario is probably less common, but you could imagine situations in which CD is the earliest datetime value. This might happen in situations in which the resource was created with something akin to fork() & exec() semantics: the resource was technically created at a certain datetime , but it did not acquire its own state until a later datetime, reflected in the MD & LM values.

For example, a transactional archive might record as CD the first datetime in which a resource returns a 200 response, but might choose to delay archiving Mementos until the resource's state is something other than "Welcome to Apache". In this scenario, you could have:

Creation-Datetime: Wed, 05 Mar 2008 20:16:49 GMT
Memento-Datetime: Fri, 05 Nov 2010 23:25:19 GMT
Last-Modified: Fri, 05 Nov 2010 23:25:19 GMT

The MD and LM datetimes could also vary as described in Case 2.

Creation Datetime Is Often Unavailable

To illustrate the differences between the various datetime concepts, the above examples have discussed Creation Datetime as if it is a commonly available value. However, this is most often not the case -- in fact, there is no defined HTTP response header that corresponds to Creation Datetime. This is due to the historical limitation of Unix inodes (i.e., metadata for files), which track three notions of time: atime (access time of the file), mtime (modification time of the file), and ctime (modification time of the inode). Modern content management systems might keep track of Creation Datetime, but it is not formally defined at the HTTP level.

Summary

The above examples should provide illustrations of how the three notions of datetime, although obviously related, have slightly different semantics. It should be clear that a Memento's Memento-Datetime is also not just Creation-Datetime or Last-Modified inherited from the original resource for which it is a Memento. Rather than overload an existing HTTP response header (such as Last-Modified), we have introduced the Memento-Datetime (nee Content-Datetime) response header. Additional information about Memento headers, Link rel types, and HTTP interactions can be found at mementoweb.org.

-- Michael


* Datetime = neologism of "date" & "time": the former is often understood to have a granularity of days, and the latter a granularity of seconds.

No comments:

Post a Comment