2013-11-28: Replaying the SOPA Protest
In an attempt to limit online piracy and theft of intellectual property, the U.S. Government proposed the Stop Online Privacy Act (SOPA). This act was widely unpopular. On January 18th, 2012, many prevalent websites (e.g., XKCD) organized a world-wide blackout of their websites in protest of SOPA.
While the attempted passing of SOPA may end up being a mere footnote in history, the overwhelming protest in response is significant. This event is an important observance and should be archived in our Web archives. However, some methods of implementing the protest (such as JavaScript and Ajax) made the resulting representations unarchiveable by archival services at the time. As a case study, we will examine the Washington, D.C. Craigslist site and the English Wikipedia page. All screenshots of the live protests were taken during the protest on January 18th, 2012. The screenshots of the mementos were taken on November 27th, 2013.
Craigslist put up a blackout page that would only provide access to the site through a link that appears after a timeout. In order to preserve the SOPA splash page on the Craigslist site, we submitted the URI-R for the Washington D.C. Craigslist page to WebCite for archiving producing a memento for the SOPA screen:
http://webcitation.org/query?id=1326900022520273
At the bottom of the SOPA splash page, JavaScript counts down from 10 to 1 and then provides a link to enter the site. The countdown operates properly in the memento, providing an accurate capture of the resource as it existed on January 18th, 2012.
The countdown on the page is created with JavaScript that is included in the HTML:
The Heritrix crawler archived the Craigslist page on January 18th, 2012. The Internet Archive contains a memento for the protest:
as does Archive-It:
The Internet Archive memento has the same splash page and countdown as the WebCite memento that was captured with the Heritrix crawler. The link on the Internet Archive memento leads to a memento of the Craigslist page rather than the live version albeit with archival timestamps one day and 13 hours, 30 minutes, and 44 seconds apart (2012-01-19 18:34:32 vs 2012-01-18 05:03:48):
The Internet Archive converts embedded links to be relative to the archive rather than target the live web. Since Heritrix also crawled the linked page, the embedded link dereferences to the proper memento with a note embedded in the HTML protesting SOPA.
The Craigslist protest was readily archived by WebCite, Archive-It, and the Internet Archive. Policies within each archival institution impacted how the Craigslist homepage (past the protest splash screen) is referenced and accessed by archive users. This differs from the Wikipedia protest, which was not readily archived.
Wikipedia displayed a splash screen protesting SOPA blocking access to all content on the site. A version of which is still available live on Wikipedia as of November 27th, 2013:
On January 18th, 2012, we submitted the page to WebCite to produce a memento that did not captured the splash page. Instead the memento has only the representation hidden by the splash page.
The mementos captured by Heritrix and presented through the Internet Archive's Wayback Machine and Archive-It are also missing the SOPA splash page.
To investigate the cause of the missing splash page further, we requested that WebCite archive the current version of the Wikipedia blackout page on November 27th, 2013. The new memento does not capture the splash page, either:
Heritrix has also created a memento for the current blackout page on August 24th, 2013. This memento suffers the same problem as the aforementioned mementos and does not capture the splash page:
When looking through the client-side DOM of the Wikipedia mementos we reference, there is no mention of the splash page protesting SOPA. This means the page was loaded from either Cascading Style Sheets (CSS) or JavaScript. Since clicking the browser's "Stop" button prevents the splash page from appearing, we hypothesize (and show) that JavaScript is responsible for loading the splash page. JavaScript loads the image needed for the splash page as a result of a client-side event. Since the tools have no way of executing the event, the tools have no way of knowing to archive the image.
When we load the live blackout resource, we see that there are several files loaded by Wikimedia. Some of the JavaScript files return a 403 Forbidden response since they are blocked by the Wikipedia Robots.txt file:
Specifically, the Robots.txt file preventing these resources from being archived is:
http://bits.wikimedia.org/robots.txt
The Robots.txt file is archived, as well:
http://web.archive.org/web/*/http://bits.wikimedia.org/robots.txt
We will look at one specific HTTP request for a JavaScript file:
This JavaScript file contains code defining a function that adds CSS to the page, overlaying an image as a splash page and overlays the associated text on the image (I have added the line breaks for readability):
Without execution of the insertBanner function, the archival tools will not know to archive the image of the splash page (WP_SOPA_Splash_Full.jpg) or the overlayed text. In this example, Wikimedia is constructing the URI of the image and using Ajax to request the resource:
The blackout image is available in the Internet Archive, but the mementos in the Wayback Machine do not attempt to load it:
http://web.archive.org/web/20120118165255/http://upload.wikimedia.org/wikipedia/commons/9/98/WP_SOPA_Splash_Full.jpg
Without execution of the client-side JavaScript and subsequent capture of the splash screen, the SOPA blackout protest is not seen by the archival service.
We have presented two different uses of JavaScript by two different web sites and its impact on the archivability of their SOPA protests. The Craigslist mementos provide representations of the SOPA protest, although the archives may be missing associated content due to policy differences and intended use. The Wikipedia mementos do not provide a representation of the protest. While the constituent parts of the Wikipedia protest are not entirely lost, they are not properly reconstituted, making the representation unarchivable with the tools available on January 18th, 2012 and November 27th, 2013.
We have previously demonstrated that JavaScript in mementos can cause strange things to happen. This is another example of how technologies that normally improve a user's browsing experience can actually be more difficult, if not impossible, to archive.
--Justin F. Brunelle
While the attempted passing of SOPA may end up being a mere footnote in history, the overwhelming protest in response is significant. This event is an important observance and should be archived in our Web archives. However, some methods of implementing the protest (such as JavaScript and Ajax) made the resulting representations unarchiveable by archival services at the time. As a case study, we will examine the Washington, D.C. Craigslist site and the English Wikipedia page. All screenshots of the live protests were taken during the protest on January 18th, 2012. The screenshots of the mementos were taken on November 27th, 2013.
Screenshot of the live Craigslist SOPA Protest from January 18th, 2012. |
Craigslist put up a blackout page that would only provide access to the site through a link that appears after a timeout. In order to preserve the SOPA splash page on the Craigslist site, we submitted the URI-R for the Washington D.C. Craigslist page to WebCite for archiving producing a memento for the SOPA screen:
http://webcitation.org/query?id=1326900022520273
At the bottom of the SOPA splash page, JavaScript counts down from 10 to 1 and then provides a link to enter the site. The countdown operates properly in the memento, providing an accurate capture of the resource as it existed on January 18th, 2012.
Screenshot of the Craigslist protest memento in WebCite. |
The countdown on the page is created with JavaScript that is included in the HTML:
The countdown behavior is archived along with the page content because the JavaScript script creating the countdown is captured with the content and is available when the onload event fires on the client and subsequent startCountDown code is executed. However, the link that appears at the bottom of the screen dereferences to the live version of Craigslist. Notice that the live Craigslist page has no reference to the SOPA protest. Since WebCite is a page-at-a-time archival service, it only archives the initial representation and all embedded resources, meaning the the linked Craigslist page is missed during archiving.
Screenshot of the Craigslist homepage linked from the protest splash page. This is also the live version of the homepage. |
The Heritrix crawler archived the Craigslist page on January 18th, 2012. The Internet Archive contains a memento for the protest:
Screenshot of the Craigslist protest splash page in the Wayback Machine. |
as does Archive-It:
The Internet Archive memento has the same splash page and countdown as the WebCite memento that was captured with the Heritrix crawler. The link on the Internet Archive memento leads to a memento of the Craigslist page rather than the live version albeit with archival timestamps one day and 13 hours, 30 minutes, and 44 seconds apart (2012-01-19 18:34:32 vs 2012-01-18 05:03:48):
Screenshot of the Craigslist homepage memento, linked from the protest splash screen. |
The Internet Archive converts embedded links to be relative to the archive rather than target the live web. Since Heritrix also crawled the linked page, the embedded link dereferences to the proper memento with a note embedded in the HTML protesting SOPA.
The Craigslist protest was readily archived by WebCite, Archive-It, and the Internet Archive. Policies within each archival institution impacted how the Craigslist homepage (past the protest splash screen) is referenced and accessed by archive users. This differs from the Wikipedia protest, which was not readily archived.
Screenshot of the live Wikipedia SOPA Protest. |
Wikipedia displayed a splash screen protesting SOPA blocking access to all content on the site. A version of which is still available live on Wikipedia as of November 27th, 2013:
The live version of the Wikipedia SOPA Protest. |
On January 18th, 2012, we submitted the page to WebCite to produce a memento that did not captured the splash page. Instead the memento has only the representation hidden by the splash page.
A screenshot of the WebCite memento of the Wikipedia SOPA Protest. |
The mementos captured by Heritrix and presented through the Internet Archive's Wayback Machine and Archive-It are also missing the SOPA splash page.
A screenshot of the Internet Archive memento of the Wikipedia SOPA Protest. |
A screenshot of the Archive-It memento of the Wikipedia SOPA Protest. |
To investigate the cause of the missing splash page further, we requested that WebCite archive the current version of the Wikipedia blackout page on November 27th, 2013. The new memento does not capture the splash page, either:
A screenshot of the WebCite memento of the current Wikipedia blackout page. |
Heritrix has also created a memento for the current blackout page on August 24th, 2013. This memento suffers the same problem as the aforementioned mementos and does not capture the splash page:
A screenshot of the Internet Archive memento of the current Wikipedia blackout page. |
When looking through the client-side DOM of the Wikipedia mementos we reference, there is no mention of the splash page protesting SOPA. This means the page was loaded from either Cascading Style Sheets (CSS) or JavaScript. Since clicking the browser's "Stop" button prevents the splash page from appearing, we hypothesize (and show) that JavaScript is responsible for loading the splash page. JavaScript loads the image needed for the splash page as a result of a client-side event. Since the tools have no way of executing the event, the tools have no way of knowing to archive the image.
When we load the live blackout resource, we see that there are several files loaded by Wikimedia. Some of the JavaScript files return a 403 Forbidden response since they are blocked by the Wikipedia Robots.txt file:
Google Chrome's developer console showing the resources requested by http://web.archive.org/web/20130824022954/http://en.wikipedia.org/?banner=blackout and their associated response codes. |
Specifically, the Robots.txt file preventing these resources from being archived is:
http://bits.wikimedia.org/robots.txt
The Robots.txt file is archived, as well:
http://web.archive.org/web/*/http://bits.wikimedia.org/robots.txt
We will look at one specific HTTP request for a JavaScript file:
The blackout image is available in the Internet Archive, but the mementos in the Wayback Machine do not attempt to load it:
http://web.archive.org/web/20120118165255/http://upload.wikimedia.org/wikipedia/commons/9/98/WP_SOPA_Splash_Full.jpg
Without execution of the client-side JavaScript and subsequent capture of the splash screen, the SOPA blackout protest is not seen by the archival service.
We have presented two different uses of JavaScript by two different web sites and its impact on the archivability of their SOPA protests. The Craigslist mementos provide representations of the SOPA protest, although the archives may be missing associated content due to policy differences and intended use. The Wikipedia mementos do not provide a representation of the protest. While the constituent parts of the Wikipedia protest are not entirely lost, they are not properly reconstituted, making the representation unarchivable with the tools available on January 18th, 2012 and November 27th, 2013.
We have previously demonstrated that JavaScript in mementos can cause strange things to happen. This is another example of how technologies that normally improve a user's browsing experience can actually be more difficult, if not impossible, to archive.
--Justin F. Brunelle
I archived the Wikipedia blackout page, and made the WARC available here: https://github.com/ukwa/warc-test-corpus/tree/master/wikipedia-sopa-blackout-2012
ReplyDeleteIt is indeed very tricky to play back. It only works in proxy mode (and sadly not via Memento) because the jQuery library blocks the cross-site AJAX request from being attempted (even though with a browser plugin it would work out ok in the end).