Wednesday, October 10, 2012

2012-10-10: Zombies in the Archives

Image provided from http://www.taxhelpattorney.com/
In our current research, the WS-DL group has observed leakage in archived sites. Leakage occurs when archived resources include current content. I enjoy referring to such occurrences as "zombie" resources (which is appropriate given the upcoming Halloween holiday). That is to say, these resources are expected to be archived ("dead") but still reach into the current Web.

In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web. 

We provide two examples with humorous juxtaposition of past and present content. Because of  JavaScript, rendering a page from the past will include advertisements from the present Web.


2008 memento of cnn.com from the Wayback Machine
First, we look at cnn.com. We can observe an archived resource from the Wayback Machine at http://web.archive.org/web/20080903204222/http://www.cnn.com/. This memento from September 16th, 2008 includes links to the 2008 presidential race between McCain-Palin and Obama-Biden. However, this memento was observed on September 28, 2012 -- during the 2012 presidential race between Romney-Ryan and Obama-Biden. The memento includes embedded JavaScript that pulls advertisements from the live Web. The advertisement included in the memento is a zombie resource that promotes the 2012 presidential debate between Romney and Obama. This drift from the expected archived time seems to provide a prophetic look at the 2012 presidential candidates in a 2008 resource. The current cnn.com homepage gives the same advertisement as the archived version.


Current cnn.com homepage as observed in 2012

A second case study comes from the IMDB movie database site. We observed the July 28th, 2011 memento of the IMDB homepage at http://web.archive.org/web/20110728165802/http://www.imdb.com/. This memento advertises the movie Cowboys and Aliens. This movies is set to start "tomorrow" according to our observed July 28th, 2011 memento. Additionally, we see the current feature movie is Captain America

2011 memento of IMDB.com from the Wayback Machine 

According to the currently observed IMDB site, Cowboys and Aliens was released in 2011 and Captain American was released in 2011, in keeping with our observed memento. However, the ad included on the IMDB memento promotes the movie "Won't Back Down." According to IMDB, this movie won't be released until 2012. Again, we can observed a memento with reference to present-day events.

Cowboys and Aliens was released in 2011
Captain American was released in 2011
Won't Back Down is scheduled to be released in 2012

When we observe the HTTP requests that are made when loading the mementos there is evidence of reach into the current Web. We've stored all HTTP headers from the archive into a text file for analysis.   The requests should be to other archive.org resources. However, we can get the requests for live-Web resources:

$ grep Host: headers.txt | grep -v archive.org
Host: ocsp.incommon.org
Host: ocsp.usertrust.com
Host: exchange.cs.odu.edu
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: ad.doubleclick.net


These requests from archives into the live Web are initiated by embedded JavaScript:

<iframe src="http://www.imdb.com/images/SF99c7f777fc74f1d954417f99b985a4af/a/ifb/doubleclick/expand.html#imdb2.consumer.homepage/;tile=5;sz=1008x60,1008x66,7x1;p=ns;ct=com
;[PASEGMENTS];u=[CLIENT_SIDE_ORD];ord=[CLIENT_SIDE_ORD]?" ... onload="ad_utils.on_ad_load(this)"></iframe>


During our investigation of these zombie resources, we observed that this leakage of live content into archived resources is not consistent. We noticed that some versions of some browsers would not produce the leakage; this is potentially due to the browsers' different methods of handling JavaScript and Ajax calls. In our experience, older browsers have a higher percentage of leakage, while the newer browsers demonstrate the leakage less frequently.

The CNN and IMDB mementos mentioned above were rendered in Mozilla Firefox version 3.6.3. Below are two examples of our CNN and IMDB mementos rendered in a Mozilla Firefox 15.0.1. Note that the below examples attempt to load the advertisements but produce a "Not Found In Archive" message.

CNN.com memento rendered in a newer browser with no leakage.

IMDB.com memento rendered in a newer browser with no leakage.

When analyzing the headers with these new browsers, we get fewer requests for live content. More importantly, we get different requests than we saw in the other browsers:

$ grep Host: headers.txt | grep -v archive.org
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: b.scorecardresearch.com
Host: s0.2mdn.net
Host: s0.2mdn.net
Host: b.voicefive.com
Host: b.scorecardresearch.com


These mementos still attempted to load wrong resources, albeit unsuccessfully. Essentially, these mementos are shown as incomplete instead of incorrect (and without our humorous results). The exact relationship between browser, mementos, and zombie resources will required additional investigation before we can establish a cause and solution for these leakages.

The Internet Archive is not the only archive that contains these leakages. We found an example in the following WebCite memento of cnn.com archive on 2012-09-09.

WebCite memento of cnn.com.

The "Popular on Facebook" section of the page has activity from two of my "friends." The page that was shared was the 10 questions for Obama to answer page, which was published on October 1st, 2012 and is shown below. It should be obvious that my "friends" shouldn't have been able to share a page that hasn't been published, yet (2012-09-09 occurs before 2012-10-01). So, the WebCite page allow live-Web leakage in the cnn.com memento.

Live cnn.com resource

Such occurrences of leakage and zombie resources are not uncommon in today's archives. Current Web technologies such as JavaScript make a pure, unchanging capture difficult in the modern Web. However, it is useful for us as Web users and Web scientists to understand that zombies do exist in our archives.

--Justin F. Brunelle

No comments:

Post a Comment