Tuesday, December 8, 2015

2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

When an archived web page is viewed using the Wayback Machine, the archival datetime is easy to determine from the URI and the Wayback Machine's display.  The archival datetime of embedded resources (images, CSS, etc.) is another story.  And what stories their archival datetimes can tell.  These stories are the topic of my recent research and Hypertext 2015 publication.  This post introduces composite mementos, the evaluation of their temporal (in-)coherence, provides an overview of my research results.

 

What is a composite memento?

 

A Memento is an archived copy of web resource (RFC 7089)  The datetime when the copy was archived is called its Memento-Datetime.  A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation.  Composite mementos can be thought of as a tree structure.  The root resource embeds other resources, which may themselves embed resources, etc.  The figure below shows this tree structure and a composite memento of the ODU Computer Science home page as archived by the Internet Archive on 2005-05-14 01:36:08 GMT.  Or does it?


 

Hints of Temporal Incoherence

 

Consider the following weather report that was captured 2004-12-09 19:09:26 GMT.  The Memento-Datetime can be found in the URI and the December 9, 2004 capture date is clearly visible near the upper right.  Look closely at description of Current Conditions and the radar image.  How can there be no clouds on the radar when the current conditions are light drizzle?  Something is wrong here.  We have encountered temporal incoherence.  This particular incoherence is caused by inherent delays of the capture process used by Heritrix and other crawler-based web archives.  In this case, the radar image was captured much later (9 months!) than the web page itself.  However, there is no indication of this condition.



 

A Framework for Evaluating Temporal Coherence


In order to study temporal coherence of composite mementos, a framework was needed.  The framework details a series of patterns describing the relationships between root and embedded mementos and four coherence states.  The four states and sample patterns are described below.  The technical report describing the framework is available on arXiv.

 

Prima Facie Coherent

An embedded memento is prima facie coherent when evidence shows that it existed in its archived state at the time the root was captured.  The figure below illustrates the most common case.  Here the embedded memento was captured after the root but modified before the root.  The importance of Last-Modified is discussed in my previous post on the importance of header replay.


 

Possibly Coherent

An embedded memento is possibly coherent when evidence shows that it might have existed in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured before the root.


 

Probably Violative

An embedded memento is probably violative when evidence shows that it might not have existed in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured after the root, but its Last-Modified datetime is unknown.


Prima Facie Violative

An embedded memento is probably violative when evidence shows that it did not exist in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured after the root and was also modified after the root.

 

 

 

Only One in Five Archived Web Pages Existed as Presented


Using the framework, we evaluated the temporal coherence of 82,425 composite mementos. These contained 1,623,127 embedded URIs, of which 1,332,993 were available in a web archive.  Composite mementos were recomposed using single and multiple archives and two heuristics: minimum distance and bracket.

Single and multiple archives: Composite mementos were recomposed from single and multiple archives. For single archives, all embedded mementos were selected from the same archive as the root. For multiple archives, embedded mementos were selected from any of the 15 archives included in the study.

Heuristics:  The minimum distance (or nearest) heuristic selects between multiple captures for the same URI by choosing the memento with the Memento-Datetime nearest to the root's Memento-Datetime, and can be either before or after the root's. The bracket heuristic also takes Last-Modified datetime into account. When a memento's Last-Modified datetime and Memento-Datetime "bracket" the root's Memento-Datetime (as in Prima Facie Coherent above), it is selected even if it is not the closest.

We found that only 38.7% of web pages are temporally coherent and that only 17.9% (roughly 1 in 5) of web pages are temporally coherent and can be fully recomposed (i.e., they have no missing resources).

The paper can be downloaded from the ACM Digital Library or from my local copy.  The slides from the Hypertext'15 talk follow.




One last thing: I would like to thank Ted Goranson for presenting the slides at Hypertext 2015 when we could not attend.

-- Scott G. Ainsworth

No comments:

Post a Comment