2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

.@Conservatives put speeches in Streisand's house: http://t.co/6aRiOsHwxO @UKWebArchive: http://t.co/BGD3tYavEx via @lljohnston @hhockx
— Michael L. Nelson (@phonedude_mln) November 13, 2013

Circulating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below). Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive). But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives. Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details).

Consider this randomly chosen 2009 speech:

http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx

Right now it produces a custom 404 page (see Note 3 below):

Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):

So it seems clear that this speech will not disappear down a memory hole. But how do you discover these copies in these archives? Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework. If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.

If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):

curl -i http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx
HTTP/1.1 200 OK
Server: nginx/1.2.8
Date: Wed, 20 Nov 2013 23:37:46 GMT
Content-Type: application/link-format
Transfer-Encoding: chunked
Connection: keep-alive

<http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="original"
, <http://www.webarchive.org.uk:80/wayback/memento/20091208024837/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Tue, 08 Dec 2009 02:48:37 GMT"
, <http://wayback.archive-it.org/all/20100414195415/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 14 Apr 2010 19:54:15 GMT"
, <http://wayback.archive-it.org/all/20100430003736/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 30 Apr 2010 00:37:36 GMT"
, <http://archive.is/20130102062852/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 02 Jan 2013 06:28:52 GMT"
, <http://mementoproxy.lanl.gov/aggr/timegate/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="timegate"
, <http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="self"; type="application/link-format"; from ="Tue, 08 Dec 2009 02:48:37 GMT";until="Wed, 02 Jan 2013 06:28:52 GMT"

The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI.

We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:

Internet Archive: crawl everything
Archive-It: collections defined by subscribers
UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
Archive.is: archives individual pages on user request

Download and install the Chrome extension and all of these archives and more will easily available to you.

-- Michael and Herbert

Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt. For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013). "We've always been at war with Eastasia."

Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others). Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):

But access was restored sometime in the space of three hours before I could generate a screen shot:

Why was it restored? Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?). The 08:36:36 version of robots.txt has:

...

Disallow: /News/News_stories/2008/
Disallow: /News/News_stories/2009/
Disallow: /News/News_stories/2010/01/

...

But the 18:10:19 version has:

...

Disallow: /News/Blogs.aspx
Disallow: /News/Blogs/

...

These "Disallow" rules no longer match the URI of the original speech. I guess the Internet Archive cached the disallow rule and it just now expired one week later. See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.

The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:

curl -i http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx
HTTP/1.1 200 OK
Server: nginx/1.2.8
Date: Thu, 21 Nov 2013 04:54:59 GMT
Content-Type: application/link-format
Transfer-Encoding: chunked
Connection: keep-alive

<http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="original"
, <http://web.archive.org/web/20091113144156/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 13 Nov 2009 14:41:56 GMT"
, <http://web.archive.org/web/20091115092325/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Sun, 15 Nov 2009 09:23:25 GMT"
, <http://www.webarchive.org.uk:80/wayback/memento/20091208024837/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Tue, 08 Dec 2009 02:48:37 GMT"
, <http://web.archive.org/web/20100311204924/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Thu, 11 Mar 2010 20:49:24 GMT"
, <http://web.archive.org/web/20100414195415/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 14 Apr 2010 19:54:15 GMT"
, <http://web.archive.org/web/20100430003736/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 30 Apr 2010 00:37:36 GMT"
, <http://web.archive.org/web/20110504022406/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 04 May 2011 02:24:06 GMT"
, <http://web.archive.org/web/20110622062740/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 22 Jun 2011 06:27:40 GMT"
, <http://web.archive.org/web/20110629025346/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 29 Jun 2011 02:53:46 GMT"
, <http://web.archive.org/web/20111026224126/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 26 Oct 2011 22:41:26 GMT"
, <http://web.archive.org/web/20111028184131/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 28 Oct 2011 18:41:31 GMT"
, <http://web.archive.org/web/20111126195922/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Sat, 26 Nov 2011 19:59:22 GMT"
, <http://web.archive.org/web/20111227213740/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Tue, 27 Dec 2011 21:37:40 GMT"
, <http://web.archive.org/web/20120127204027/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 27 Jan 2012 20:40:27 GMT"
, <http://web.archive.org/web/20120208200018/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 08 Feb 2012 20:00:18 GMT"
, <http://web.archive.org/web/20120302180234/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 02 Mar 2012 18:02:34 GMT"
, <http://web.archive.org/web/20120306041605/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Tue, 06 Mar 2012 04:16:05 GMT"
, <http://web.archive.org/web/20120314083213/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 14 Mar 2012 08:32:13 GMT"
, <http://web.archive.org/web/20120510135248/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Thu, 10 May 2012 13:52:48 GMT"
, <http://web.archive.org/web/20120714070101/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Sat, 14 Jul 2012 07:01:01 GMT"
, <http://web.archive.org/web/20121020123016/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Sat, 20 Oct 2012 12:30:16 GMT"
, <http://web.archive.org/web/20121021101909/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Sun, 21 Oct 2012 10:19:09 GMT"
, <http://web.archive.org/web/20121125074732/http://www.conservatives.com/news/speeches/2009/11/david_cameron_the_big_society.aspx>;rel="memento"; datetime="Sun, 25 Nov 2012 07:47:32 GMT"
, <http://archive.is/20130102062852/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Wed, 02 Jan 2013 06:28:52 GMT"
, <http://web.archive.org/web/20130315134917/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Fri, 15 Mar 2013 13:49:17 GMT"
, <http://web.archive.org/web/20130321201457/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Thu, 21 Mar 2013 20:14:57 GMT"
, <http://web.archive.org/web/20130325012916/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Mon, 25 Mar 2013 01:29:16 GMT"
, <http://web.archive.org/web/20130402093938/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Tue, 02 Apr 2013 09:39:38 GMT"
, <http://web.archive.org/web/20130408042402/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Mon, 08 Apr 2013 04:24:02 GMT"
, <http://web.archive.org/web/20130827094525/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="memento"; datetime="Tue, 27 Aug 2013 09:45:25 GMT"
 , <http://mementoproxy.lanl.gov/aggr/timegate/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="timegate"
 , <http://mementoproxy.lanl.gov/aggr/timemap/link/1/http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx>;rel="self"; type="application/link-format"; from ="Fri, 13 Nov 2009 14:41:56 GMT";until="Tue, 27 Aug 2013 09:45:25 GMT"

Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate. Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:

See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.

--Michael

Search This Blog

Web Science and Digital Libraries Research Group

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

Comments

Post a Comment