Friday, August 28, 2015

2015-08-28 Original Header Replay Considered Coherent


As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089, defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15. This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers.

Archive Headers and Original Headers

Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT.

HTTP/1.1 200 OK
Content-Type: image/gif
Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT
Figure 1. No Original Header Playback

Try to answer the question "Was the representation provided by the web archive valid for Tue, 28 Apr 2015 12:00:00 GMT?" (i.e., the day before). The best answer possible is maybe. Because I have spent many hours using the Computer Science web site, I know the site changes infrequently. Given this knowledge, I might upgrade the answer from maybe to probably. This difficulty answering is due to the Last-Modified header reflecting the date archived instead of the date the image itself was last modified. And, although it is true that the memento (archived copy) was indeed modified Wed, 29 Apr 2015 15:15:23 GMT, this merging of original resource Last-Modified and memento Last-Modified loses valuable information. (Read Memento-Datetime is not Last-Modified for more details.)

Now consider the headers (figure 2) for another copy that was archived Sun, 14 Mar 2015 22:21:07 GMT. Take special note of the X-Archive-Orig-* headers. These are a playback of original headers that were included in the response when the logo image was captured by the web archive.

HTTP/1.1 200 OK
Content-Type: image/gif
X-Archive-Orig-etag: "52d202fb-19db"
X-Archive-Orig-last-modified: Sun, 12 Jan 2014 02:50:35 GMT
X-Archive-Orig-expires: Sat, 19 Dec 2015 13:01:55 GMT
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-cache-control: max-age=31104000
X-Archive-Orig-connection: keep-alive
X-Archive-Orig-date: Wed, 24 Dec 2014 13:01:55 GMT
X-Archive-Orig-content-type: image/gif
X-Archive-Orig-server: nginx
X-Archive-Orig-content-length: 6619
Memento-Datetime: Sun, 14 Mar 2015 22:21:07 GMT
Figure 2. Original Header Playback

Compare the Memento-Datetime (which is the archive datetime) and the X-Archive-Orig-last-modified headers while answering this question: "Was the representation provided by the web archive valid for Tue, 13 Mar 2015 12:00:00 GMT?". Clearly the answer is yes.

Why This Matters

For the casual web archive user, the previous example may seem like must nit-picky detail. Still, consider the Weather Underground page archived on Thu, 09 Dec 2004 19:09:26 GMT and shown in Figure 3.

Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
Figure 3. Weather Underground as archived Thu, 09 Dec 2004 19:09:26 GMT
The Weather Underground page (like most) is a composition of many resources including the web page itself, images,  style sheets, and JavaScript. Note the conflict between the forecast of light drizzle and the completely clear radar image. Figure 4 shows the relevant headers returned for the radar image:

HTTP/1.1 200 OK
Memento-Datetime: Mon, 12 Sep 2005 22:34:45 GMT
X-Archive-Orig-last-modified: Mon, 12 Sep 2005 22:32:24 GMT
Figure 4. Prima Facie Coherence Headers

Clearly the radar image was captured much later than the web page—over 9 months later in fact! But this alone does not prove the radar image is the incorrect image (perhaps Weather Underground radar images were broken on 09 Dec 2004). However, the Memento-Datetime and X-Archive-Orig-last-modified headers tell the full story, showing that not only was the radar image captured well after the web page was archived, but also that the radar image was modified well after the web page was archived. Thus, together Memento-Datetime and X-Archive-Org-Last-Modified are prima facie evidence that the radar image is temporally violative with respect to the archived web page in which it is displayed. Figure 5 illustrates this pattern. The black left-to-right arrow is time. The black diamond and text represent the web page; the green represents the radar image. The green line shows that the radar image X-Archive-Orig-Last-Modified and Memento-Datetime bracket the web page archival time. Details on this pattern and others are detailed in our framework technical report.

Figure 5. Prima Facie Coherence

But Does Header Playback Really Matter?

Of course, if few archived images and other embedded resources include Last-Modified headers, the overall effect could be inconsequential. However, the results to be published at Hypertext'15 show that using the Last-Modified header makes a significant coherence improvement: using Last-Modified to select embedded resources increased mean prima facie coherence from ~41% to ~55% compared to using just Memento-Datetime. And, at the time the research was conducted, only the Internet Archive included Last-Modified playback. If the other 14 web archives included in the study also implemented header playback, we project that mean prima facie coherence would have been about 80%!

Web Archives Today

When the research leading to the Hypertext'15 paper was conducted, only the Internet Archive included Last-Modified playback. This limited prima facie coherence determination to only embedded resources retrieved from the Internet Archive. As shown in Table 1, additional web archives now playback original headers. The table also show which archives implement the Memento Protocol (and are therefore RFC 7089 compliant) and which archives use OpenWayback, which already implements header replay. Although header playback is a long way from universal, progress is happening. We look forward to continuing coherence improvement as additional web archives implement header playback and the Memento Protocol.

Table 1. Current Web Archive Status
Web Archive Header Playback? Memento Compliant? OpenWayback?
Archive-It Yes Yes Yes No Yes No
arXiv No No No
Bibliotheca Alexandrina Web Archive Unknown1 Yes Yes
Canadian Government Web Archive No Proxy No
Croatian Web Archive No Proxy No
DBPedia Archive No Yes No
Estonian Web Archive No Proxy No
GitHub No Proxy No
Icelandic Web Archive Yes Yes Yes
Internet Archive Yes Yes Yes
Library of Congress Web Archive Yes Yes Yes
NARA Web Archive No Proxy Yes
Orain No Proxy No
PastPages Web Archive No Yes No
Portugese Web Archive No Proxy No
PRONI Web Archive No Yes Yes
Slovenian Web Archive No Proxy No
Stanford Web Archive Yes Yes Yes
UK Government Web Archive No Yes Yes
UK Parliament's Web Archive No Yes Yes
UK Web Archive Yes Yes Yes
Web Archive Singapore No Proxy No
WebCite No Proxy No
WikiPedia No Proxy No
1Unavailable at the time this post was written.

Wrap Up

Web archives featuring both the capture and replay of original header show significantly better temporal coherence in recomposed web pages. Currently, web archives using Heritrix and OpenWayback implement these features; no archives using other software are known to do so. Implementing original header capture and replay is highly recommended as it will allow implementation of improved recomposition heuristics (which is a topic for another day and another post).

— Scott G. Ainsworth

No comments:

Post a Comment