2015-08-28 Original Header Replay Considered Coherent
Introduction
As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089, defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15. This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers.
Archive Headers and Original Headers
Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT.
HTTP/1.1 200 OK |
Content-Type: image/gif |
Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT |
Figure 1. No Original Header Playback |
Try to answer the question "Was the representation provided by the web archive valid for Tue, 28 Apr 2015 12:00:00 GMT?" (i.e., the day before). The best answer possible is maybe. Because I have spent many hours using the Computer Science web site, I know the site changes infrequently. Given this knowledge, I might upgrade the answer from maybe to probably. This difficulty answering is due to the Last-Modified header reflecting the date archived instead of the date the image itself was last modified. And, although it is true that the memento (archived copy) was indeed modified Wed, 29 Apr 2015 15:15:23 GMT, this merging of original resource Last-Modified and memento Last-Modified loses valuable information. (Read Memento-Datetime is not Last-Modified for more details.)
Now consider the headers (figure 2) for another copy that was archived Sun, 14 Mar 2015 22:21:07 GMT. Take special note of the X-Archive-Orig-* headers. These are a playback of original headers that were included in the response when the logo image was captured by the web archive.
HTTP/1.1 200 OK |
Content-Type: image/gif |
X-Archive-Orig-etag: "52d202fb-19db" |
X-Archive-Orig-last-modified: Sun, 12 Jan 2014 02:50:35 GMT |
X-Archive-Orig-expires: Sat, 19 Dec 2015 13:01:55 GMT |
X-Archive-Orig-accept-ranges: bytes |
X-Archive-Orig-cache-control: max-age=31104000 |
X-Archive-Orig-connection: keep-alive |
X-Archive-Orig-date: Wed, 24 Dec 2014 13:01:55 GMT |
X-Archive-Orig-content-type: image/gif |
X-Archive-Orig-server: nginx |
X-Archive-Orig-content-length: 6619 |
Memento-Datetime: Sun, 14 Mar 2015 22:21:07 GMT |
Figure 2. Original Header Playback |
Compare the Memento-Datetime (which is the archive datetime) and the X-Archive-Orig-last-modified headers while answering this question: "Was the representation provided by the web archive valid for Tue, 13 Mar 2015 12:00:00 GMT?". Clearly the answer is yes.
Why This Matters
For the casual web archive user, the previous example may seem like must nit-picky detail. Still, consider the Weather Underground page archived on Thu, 09 Dec 2004 19:09:26 GMT and shown in Figure 3.
Figure 3. Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT |
HTTP/1.1 200 OK |
Memento-Datetime: Mon, 12 Sep 2005 22:34:45 GMT |
X-Archive-Orig-last-modified: Mon, 12 Sep 2005 22:32:24 GMT |
Figure 4. Prima Facie Coherence Headers |
Clearly the radar image was captured much later than the web page—over 9 months later in fact! But this alone does not prove the radar image is the incorrect image (perhaps Weather Underground radar images were broken on 09 Dec 2004). However, the Memento-Datetime and X-Archive-Orig-last-modified headers tell the full story, showing that not only was the radar image captured well after the web page was archived, but also that the radar image was modified well after the web page was archived. Thus, together Memento-Datetime and X-Archive-Org-Last-Modified are prima facie evidence that the radar image is temporally violative with respect to the archived web page in which it is displayed. Figure 5 illustrates this pattern. The black left-to-right arrow is time. The black diamond and text represent the web page; the green represents the radar image. The green line shows that the radar image X-Archive-Orig-Last-Modified and Memento-Datetime bracket the web page archival time. Details on this pattern and others are detailed in our framework technical report.
Figure 5. Prima Facie Coherence |
But Does Header Playback Really Matter?
Of course, if few archived images and other embedded resources include Last-Modified headers, the overall effect could be inconsequential. However, the results to be published at Hypertext'15 show that using the Last-Modified header makes a significant coherence improvement: using Last-Modified to select embedded resources increased mean prima facie coherence from ~41% to ~55% compared to using just Memento-Datetime. And, at the time the research was conducted, only the Internet Archive included Last-Modified playback. If the other 14 web archives included in the study also implemented header playback, we project that mean prima facie coherence would have been about 80%!
Web Archives Today
When the research leading to the Hypertext'15 paper was conducted, only the Internet Archive included Last-Modified playback. This limited prima facie coherence determination to only embedded resources retrieved from the Internet Archive. As shown in Table 1, additional web archives now playback original headers. The table also show which archives implement the Memento Protocol (and are therefore RFC 7089 compliant) and which archives use OpenWayback, which already implements header replay. Although header playback is a long way from universal, progress is happening. We look forward to continuing coherence improvement as additional web archives implement header playback and the Memento Protocol.
Table 1. Current Web Archive Status | |||
Web Archive | Header Playback? | Memento Compliant? | OpenWayback? |
---|---|---|---|
Archive-It | Yes | Yes | Yes |
archive.today | No | Yes | No |
arXiv | No | No | No |
Bibliotheca Alexandrina Web Archive | Unknown1 | Yes | Yes |
Canadian Government Web Archive | No | Proxy | No |
Croatian Web Archive | No | Proxy | No |
DBPedia Archive | No | Yes | No |
Estonian Web Archive | No | Proxy | No |
GitHub | No | Proxy | No |
Icelandic Web Archive | Yes | Yes | Yes |
Internet Archive | Yes | Yes | Yes |
Library of Congress Web Archive | Yes | Yes | Yes |
NARA Web Archive | No | Proxy | Yes |
Orain | No | Proxy | No |
PastPages Web Archive | No | Yes | No |
Portugese Web Archive | No | Proxy | No |
PRONI Web Archive | No | Yes | Yes |
Slovenian Web Archive | No | Proxy | No |
Stanford Web Archive | Yes | Yes | Yes |
UK Government Web Archive | No | Yes | Yes |
UK Parliament's Web Archive | No | Yes | Yes |
UK Web Archive | Yes | Yes | Yes |
Web Archive Singapore | No | Proxy | No |
WebCite | No | Proxy | No |
WikiPedia | No | Proxy | No |
1Unavailable at the time this post was written. |
Wrap Up
Web archives featuring both the capture and replay of original header show significantly better temporal coherence in recomposed web pages. Currently, web archives using Heritrix and OpenWayback implement these features; no archives using other software are known to do so. Implementing original header capture and replay is highly recommended as it will allow implementation of improved recomposition heuristics (which is a topic for another day and another post).
— Scott G. Ainsworth
Comments
Post a Comment