2012-10-10: Zombies in the Archives
Image provided from http://www.taxhelpattorney.com/ |
In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web.
We provide two examples with humorous juxtaposition of past and present content. Because of JavaScript, rendering a page from the past will include advertisements from the present Web.
2008 memento of cnn.com from the Wayback Machine |
Current cnn.com homepage as observed in 2012 |
A second case study comes from the IMDB movie database site. We observed the July 28th, 2011 memento of the IMDB homepage at http://web.archive.org/web/20110728165802/http://www.imdb.com/. This memento advertises the movie Cowboys and Aliens. This movies is set to start "tomorrow" according to our observed July 28th, 2011 memento. Additionally, we see the current feature movie is Captain America.
2011 memento of IMDB.com from the Wayback Machine |
According to the currently observed IMDB site, Cowboys and Aliens was released in 2011 and Captain American was released in 2011, in keeping with our observed memento. However, the ad included on the IMDB memento promotes the movie "Won't Back Down." According to IMDB, this movie won't be released until 2012. Again, we can observed a memento with reference to present-day events.
Cowboys and Aliens was released in 2011 |
Captain American was released in 2011 |
Won't Back Down is scheduled to be released in 2012 |
When we observe the HTTP requests that are made when loading the mementos there is evidence of reach into the current Web. We've stored all HTTP headers from the archive into a text file for analysis. The requests should be to other archive.org resources. However, we can get the requests for live-Web resources:
$ grep Host: headers.txt | grep -v archive.org
Host: ocsp.incommon.org
Host: ocsp.usertrust.com
Host: exchange.cs.odu.edu
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: ad.doubleclick.net
These requests from archives into the live Web are initiated by embedded JavaScript:
<iframe src="http://www.imdb.com/images/SF99c7f777fc74f1d954417f99b985a4af/a/ifb/doubleclick/expand.html#imdb2.consumer.homepage/;tile=5;sz=1008x60,1008x66,7x1;p=ns;ct=com
;[PASEGMENTS];u=[CLIENT_SIDE_ORD];ord=[CLIENT_SIDE_ORD]?" ... onload="ad_utils.on_ad_load(this)"></iframe>
During our investigation of these zombie resources, we observed that this leakage of live content into archived resources is not consistent. We noticed that some versions of some browsers would not produce the leakage; this is potentially due to the browsers' different methods of handling JavaScript and Ajax calls. In our experience, older browsers have a higher percentage of leakage, while the newer browsers demonstrate the leakage less frequently.
The CNN and IMDB mementos mentioned above were rendered in Mozilla Firefox version 3.6.3. Below are two examples of our CNN and IMDB mementos rendered in a Mozilla Firefox 15.0.1. Note that the below examples attempt to load the advertisements but produce a "Not Found In Archive" message.
CNN.com memento rendered in a newer browser with no leakage. |
IMDB.com memento rendered in a newer browser with no leakage. |
When analyzing the headers with these new browsers, we get fewer requests for live content. More importantly, we get different requests than we saw in the other browsers:
$ grep Host: headers.txt | grep -v archive.org
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: b.scorecardresearch.com
Host: s0.2mdn.net
Host: s0.2mdn.net
Host: b.voicefive.com
Host: b.scorecardresearch.com
These mementos still attempted to load wrong resources, albeit unsuccessfully. Essentially, these mementos are shown as incomplete instead of incorrect (and without our humorous results). The exact relationship between browser, mementos, and zombie resources will required additional investigation before we can establish a cause and solution for these leakages.
The Internet Archive is not the only archive that contains these leakages. We found an example in the following WebCite memento of cnn.com archive on 2012-09-09.
WebCite memento of cnn.com. |
The "Popular on Facebook" section of the page has activity from two of my "friends." The page that was shared was the 10 questions for Obama to answer page, which was published on October 1st, 2012 and is shown below. It should be obvious that my "friends" shouldn't have been able to share a page that hasn't been published, yet (2012-09-09 occurs before 2012-10-01). So, the WebCite page allow live-Web leakage in the cnn.com memento.
Live cnn.com resource |
Such occurrences of leakage and zombie resources are not uncommon in today's archives. Current Web technologies such as JavaScript make a pure, unchanging capture difficult in the modern Web. However, it is useful for us as Web users and Web scientists to understand that zombies do exist in our archives.
--Justin F. Brunelle
Comments
Post a Comment