Monday, July 14, 2014

2014-07-14: "Refresh" For Zombies, Time Jumps

We've blogged before about "zombies", or archived pages that reach out to the live web for images, ads, movies, etc.  You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages.  Most of the time Javascript is to blame (for example, see our TPDL 2013 paper "On the Change in Archivability of Websites Over Time"), but in this example the blame rests with the HTML <meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident.

First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received.  This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME type sniffing.  In general, having data formats specify protocol behavior is a bad idea (see the discussion about orthogonality in the W3C Web Architecture), but few can resist the temptation.  Specifically, http-equiv="refresh" makes things even worse, since the HTTP header "Refresh" never officially existed, and it was eventually dropped from the HTML specification as well.

However, it is a nice illustration of a common but non-standard HTML/fake-HTTP extension that nearly everyone supports.  Here's how it works, using www.cnn.com as an example:



This line:

<meta http-equiv="refresh" content="1800;url=http://www.cnn.com/?refresh=1"/>

tells the client to wait 30 minutes (1800 seconds) and reload the current page with the value specified in the optional url= argument (if no URL is provided, the client uses the current page's URL).  CNN has used this "wait 30 minutes and reload" functionality for many years, and it is certainly desirable for a news site to cause the client to periodically reload its front page.   The problem comes when a page is archived, but the refresh capability is 1) not removed or 2) the URL argument is not (correctly) rewritten.

Last week I had loaded a memento of cnn.com from WebCitation, specifically: http://webcitation.org/5lRYaE8eZ, that shows the page as it existed at 2009-11-21:


I hid that page, did some work, and then when I came back I noticed that it had reloaded to the page as of 2014-07-11, even though the URL and the archival banner at the top remained unchanged:


The problem is that WebCitation leaves the meta refresh tag as is, causing the page to reload from the live web after 30 minutes.  I had never noticed this behavior before, so I decided to check how some other archives handle it.

The Internet Archive rewrites the URL, so although the client still refreshes the page, it gets an archived page.  Checking:

http://web.archive.org/web/20091121211700/http://www.cnn.com/


we find:

<meta http-equiv="refresh" content="1800;url=/web/20091121211700/http://www.cnn.com/?refresh=1">


But since the IA doesn't know to canonicalize www.cnn.com/?refresh=1 to www.cnn.com, you actually get a different archived page:



Instead of ending up on 2009-11-21, we end up two days in the past at 2009-11-19:


To be fair, ignoring "?refresh=1" is not a standard canonicalization rule but could be added (standard caveats apply).  And although this is not quite a zombie, it is potentially unsettling since the original memento (2009-11-21) is silently exchanged for another memento (2009-11-19; future refreshes will stay on the 2009-11-19 version).  Presumably other Wayback-based archives behave similarly.  Checking the British Library I saw:

http://www.webarchive.org.uk/wayback/archive/20090914012158/http://www.cnn.com/

redirect to:

http://www.webarchive.org.uk/wayback/archive/20090402030800/http://www.cnn.com/?refresh=1

In this case the jump is more noticable (five months: 2009-09-14 vs. 2009-04-02) since the BL's archives of cnn.com are more sparse. 

Perma.cc behaves similarly to the Internet Archive (i.e., rewriting but not canonicalizing), but presumably because it is a newer archive, it does not yet have a "?refresh=1" version of cnn.com archived.  It is possible that Perma.cc has a Wayback backend, but I'm not sure.  I had to push a 2014-07-11 version into Perma.cc (i.e., it did not already have cnn.com archived).  Checking:

http://perma.cc/89QJ-Y632?type=source


we see:

<meta http-equiv="refresh" content="1800;url=/warc/89QJ-Y632/http://www.cnn.com/?refresh=1"/>

And after 30 minutes it will refresh to a framed 404 because cnn.com/?refresh=1 is not archived:


As Perma.cc becomes more populated, the 404 behavior will likely disappear and be replaced with something like the Internet Archive and British Library examples.

Archive.today is the only archive that correctly handled this situation.  Loading:

https://archive.today/Zn6HS

produces:


A check of the HTML source reveals that they simply strip out the meta refresh tag altogether, so this memento will stay parked on 2013-06-27 no matter how long it stays in the client.

In summary:

  • WebCitation did not rewrite the URI and thus created a zombie
  • Internet Archive (and other Wayback archives) rewrites the URI, but because of site-specific canonicalization, it violates the user's expectations with a single time jump (the distance of which is dependent on the sparsity of the archive)
  • Perma.cc rewrites the URI, but in this case, because it is a new archive, produces a 404 instead of a time jump
  • Archive.today strips the meta refresh tag and avoids the behavior altogether

--Michael

No comments:

Post a Comment