Tuesday, April 1, 2014

2014-04-01: Yesterday's (Wiki) Page, Today's Image?

Web pages, being complex documents, contain embedded resources like images.  As practitioners of digital preservation well know, ensuring that the correct embedded resource is captured when the main page is preserved presents a very difficult problem.  In A Framework for Evaluation of Composite Memento Temporal Coherence, Scott Ainsworth, Michael L. Nelson, and Herbert Van de Sompel explore this very concept.

Figure 1: Web Archive Weather Underground Example Showing the Different Ages of Embedded Resources
In Figure 1, borrowed from that paper, we see a screenshot of the Web Archive's December 9, 2004 memento from Weather Underground.  Even though the age of most of these embedded images differ greatly from the main page, they don't really impact its meaning.  Of interest is the weather map that differs by 9 months, which shows clear skies even though the forecast of the main page calls for clouds and light rain.

The Web Archive, as a service external to the resource that it is trying to preserve, only has access to resources that exist at the time it can make a crawl, leading to inconsistencies.  Wikis, on the other hand, have access to all resources under their control, embedded or otherwise.

This is why it is surprising that MediaWiki, even though it allows for access to all previous revisions of a given page, does not tie the datetime of those embedded resources back to that main page.

A pertinent example is that of the Wikipedia article Same-sex marriage law in the United States by state.

Figure 2:  Screenshot of Wikipedia article on Same-sex marriage law in the United States by state
Figure 2 shows the current (as of this writing) version of this article, complete with a color-coded map indicating the types of same-sex marriage laws applying to each state.  In this case, the correctness of the version of the embedded resource is pertinent to the understanding of the article.

Figure 3: Screenshot of the same Wikipedia page, but for a revision from June of 2013
Figure 3 shows a June 2013 revision of this article, with the same color-coded map.  This is a problem because it is an old revision of the article with the same version of this color-coded map.  When accessing the June 2013 version of the article on Wikipedia, I get the March 2014 version of the embedded resource.  To ensure that this revision makes sense to the reader, the map from Figure 4 should be displayed with the article instead.  As Figure 5 shows, Wikipedia has all previous revisions of this resource.

Figure 4: The June 2013 revision of the embedded map resource
Figure 5:  Listing of all of the revisions of the map resource on Wikipedia

For this particular topic, any historian (or paralegal) attempting to trace the changes in laws on this topic will be confused when presented by a map that does not match the text, and may possibly question the validity of this resource as a whole.

We tried to address this issue with the Memento MediaWiki extension.  MediaWiki provides the ImageBeforeProduceHTML hook, which appears to do what we want.  It provides a $file argument, giving access the the LocalFile Object for the image. It also provides a $time argument that signifies the Timestamp of file in 'YYYYMMDDHHIISS' string form, or false for current.

We were perplexed when the hook did not perform as expected, so we examined the source of MediaWiki version 1.22.5.  Below we see the makeImageLink function that calls the hook on line 569 of Linker.php.

We see that later on, inside this conditional block, $time is used on line 655 as an argument to the makeThumbLink2 function (bottom of code snippet).
And, within the makeThumbLink function, it gets used to make a boolean argument for a call to the function makeBrokenImageLinkObj on line 861.
Back inside the makeImageLink function, we see a second opportunity to use the $time value on line 675, but again it is used to create a boolean argument to the same function.
Note that its timestamp value in 'YYYYMMDDHHIISS' string form is never actually used as prescribed.  So, the documentation for the ImageBeforeProduceHTML hook is incorrect on the use of this $time argument.  In fact, the hook was introduced in MediaWiki version 1.13.0 and this code doesn't appear to have changed much since that time.  It is possible that the $time functionality is intended to be implemented in a future version.

Alternatively, we considered using the &$res argument from that hook to replace the HTML with the images of our choosing, but we would still need to use the object provided by the $file argument, which has no ready-made way to select a specific revision of the embedded resource.

At this point, in spite of having all of the data needed to solve this problem, MediaWiki, and transitively Wikipedia, does not currently support rendering old revisions of articles as they truly looked in the past.

--Shawn M. Jones


No comments:

Post a Comment