Thursday, November 21, 2013

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

Circulating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below).  Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive).  But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives.  Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details). 

Consider this randomly chosen 2009 speech:

http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx

Right now it produces a custom 404 page (see Note 3 below):


Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):




So it seems clear that this speech will not disappear down a memory hole.  But how do you discover these copies in these archives?  Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework.  If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.

If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):



The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI. 

We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:
  • Internet Archive: crawl everything
  • Archive-It: collections defined by subscribers
  • UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
  • Archive.is: archives individual pages on user request
Download and install the Chrome extension and all of these archives and more will easily available to you.

-- Michael and Herbert

Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt.  For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013).  "We've always been at war with Eastasia."

Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others).  Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):



But access was restored sometime in the space of three hours before I could generate a screen shot:



Why was it restored?  Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?).  The 08:36:36 version of robots.txt has:

...
Disallow: /News/News_stories/2008/
Disallow: /News/News_stories/2009/
Disallow: /News/News_stories/2010/01/
... 

But the 18:10:19 version has:
...  
Disallow: /News/Blogs.aspx
Disallow: /News/Blogs/
...  

These "Disallow" rules no longer match the URI of the original speech.  I guess the Internet Archive cached the disallow rule and it just now expired one week later.  See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.

The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:



Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate.  Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:



See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.

--Michael

No comments:

Post a Comment