What is a composite memento?
A Memento is an archived copy of web resource (RFC 7089) The datetime when the copy was archived is called its Memento-Datetime. A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation. Composite mementos can be thought of as a tree structure. The root resource embeds other resources, which may themselves embed resources, etc. The figure below shows this tree structure and a composite memento of the ODU Computer Science home page as archived by the Internet Archive on 2005-05-14 01:36:08 GMT. Or does it?
Hints of Temporal Incoherence
Consider the following weather report that was captured 2004-12-09 19:09:26 GMT. The Memento-Datetime can be found in the URI and the December 9, 2004 capture date is clearly visible near the upper right. Look closely at description of Current Conditions and the radar image. How can there be no clouds on the radar when the current conditions are light drizzle? Something is wrong here. We have encountered temporal incoherence. This particular incoherence is caused by inherent delays of the capture process used by Heritrix and other crawler-based web archives. In this case, the radar image was captured much later (9 months!) than the web page itself. However, there is no indication of this condition.
A Framework for Evaluating Temporal Coherence
In order to study temporal coherence of composite mementos, a framework was needed. The framework details a series of patterns describing the relationships between root and embedded mementos and four coherence states. The four states and sample patterns are described below. The technical report describing the framework is available on arXiv.
Prima Facie CoherentAn embedded memento is prima facie coherent when evidence shows that it existed in its archived state at the time the root was captured. The figure below illustrates the most common case. Here the embedded memento was captured after the root but modified before the root. The importance of Last-Modified is discussed in my previous post on the importance of header replay.
Possibly CoherentAn embedded memento is possibly coherent when evidence shows that it might have existed in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured before the root.
Probably ViolativeAn embedded memento is probably violative when evidence shows that it might not have existed in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured after the root, but its Last-Modified datetime is unknown.
Prima Facie ViolativeAn embedded memento is probably violative when evidence shows that it did not exist in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured after the root and was also modified after the root.
Only One in Five Archived Web Pages Existed as Presented
Using the framework, we evaluated the temporal coherence of 82,425 composite mementos. These contained 1,623,127 embedded URIs, of which 1,332,993 were available in a web archive. Composite mementos were recomposed using single and multiple archives and two heuristics: minimum distance and bracket.
Single and multiple archives: Composite mementos were recomposed from single and multiple archives. For single archives, all embedded mementos were selected from the same archive as the root. For multiple archives, embedded mementos were selected from any of the 15 archives included in the study.
Heuristics: The minimum distance (or nearest) heuristic selects between multiple captures for the same URI by choosing the memento with the Memento-Datetime nearest to the root's Memento-Datetime, and can be either before or after the root's. The bracket heuristic also takes Last-Modified datetime into account. When a memento's Last-Modified datetime and Memento-Datetime "bracket" the root's Memento-Datetime (as in Prima Facie Coherent above), it is selected even if it is not the closest.
We found that only 38.7% of web pages are temporally coherent and that only 17.9% (roughly 1 in 5) of web pages are temporally coherent and can be fully recomposed (i.e., they have no missing resources).
The paper can be downloaded from the ACM Digital Library or from my local copy. The slides from the Hypertext'15 talk follow.
One last thing: I would like to thank Ted Goranson for presenting the slides at Hypertext 2015 when we could not attend.
-- Scott G. Ainsworth