2011-07-28: Web Video Discussing Preservation Disappears After 24 Hours

One week ago (July 21, 2011) I was fortunate enough to be invited to speak about Web Archiving on Canada AM, sort of like the Today Show or Good Morning America in the US. I was asked to appear on the program in part because of the July 17, 2011 article in the Washington Post, which followed a July 6, 2011 blog post for the Chronicle of Higher Education, which was based on a June 23, 2011 blog post about our JCDL 2011 paper "How Much of the Web is Archived?". In other words, the process went like this: step 1 - get lucky & step 2 - let preferential attachment do its thing.

I was able to do the appearance in Washington DC, while attending the NDSA/NDIIPP 2011 Partner Meetup. The morning of July 21, I took a taxi to an ABC studio in DC, did the interview (about 4 minutes) and took a taxi back to the conference in time to make the morning session. I had not been on TV before and was both nervous and excited. The local and Canadian crew made the entire experience painless and the whole interview was over right as I started to get comfortable.

Given the short time, I tried to stress two topics: the first is that the ODU/LANL Memento project is not a new archive, but rather a way to leverage all existing web archives at once (this is a common misunderstanding we've experienced in the past). The other point I tried to make was that much of our cultural discourse occurs on the web and we should try to preserve as much of that as possible (including things like lolcats) because we (collectively) do a bad job at predicting what will be important in the future. Shortly after airing, the video segments was available on-line at:

http://www.ctv.ca/canadaam/?video=504307

As the URI suggests, this is the homepage for Canada AM (http://www.ctv.ca/canadaam/), but with an argument ("?video=504307") specifying which video segment (i.e., each individual story -- not the entire morning's show) to display. I shared the video URI with colleagues, friends, and family and was enjoying my 4 minutes of fame (I should still have 11 left in the bank). I had not made a local copy of the video because their web site obfuscated the actual URI of the streaming video, I had to finish the rest of the conference and drive back to Norfolk, and I thought I would have the time to figure it out after I returned.

So imagine my surprise on Friday at about lunch time when I reload the URI and do not see the video, but instead a newly redesigned Canada AM web page! The video of me making the point that we should save web resources lasted approximately 24 hours. I don't mean to seem ungrateful for the opportunity Canada AM afforded me, but as a professor I try to see everything as a teaching opportunity, so here it goes...

Sometime on Friday morning (July 22), the entire web site was redesigned and the old URIs no longer worked (cf. "Cool URIs Don't Change"). The video id was an argument and is now silently ignored, so even worse than a 404 you now get a "soft 404":

% curl -I http://www.ctv.ca/canadaam/\?video=504307
HTTP/1.1 200 OK

Server: Apache/2.2.14 (Ubuntu)
Content-Type: text/html

X-Varnish: 2550613724

Date: Thu, 28 Jul 2011 16:55:48 GMT

Connection: keep-alive


The soft 404 means people clicking on the original video link in Facebook, Twitter, email, etc. won't even see an error page -- they see the new site, but without the video or indication that the video is missing. The new site has a link titled "watch full shows", with the URI:

http://www.ctv.ca/canadaAMPlayer/index.html

Which is textually described as the "Canada AM Video Archive", but the archive begins on July 22, 2011 -- one day after my appearance! The new segments are available at URIs of the form:

http://www.ctv.ca/canadaAMPlayer/index.html?video=504933

The older videos are not available, not even as an argument to the new URI, which also returns a soft 404 (i.e., the video is not available despite the 200 response):

% curl -I http://www.ctv.ca/canadaAMPlayer/index.html\?video=504307
HTTP/1.1 200 OK
Server: Apache/2.2.14 (Ubuntu)
Content-Type: text/html
X-Varnish: 2550976182
Date: Thu, 28 Jul 2011 17:35:35 GMT
Connection: keep-alive


The video ids seem to be continuous (i.e., they did not appear to start over with "1"), so URL rewriting could easily make all the old video URIs continue to work, unless whatever CMS that hosted those videos has been retired with no migration path forward.

Here are some screen shots of the newly redesigned home page (left) and the video archive page (right) from July 22:










Of course, I did not think to make a screen shot of the original home page, or the page of my video because I thought it would live longer than 24 hours! I was able to find a recent (December 8, 2010) copy in the Internet Archive's Wayback Machine:

http://web.archive.org/web/20101208084455/http://www.ctv.ca/canadaam/

And I also pushed the two pages above to WebCite, which nicely contrasts two styles of giving URIs for archived pages (URI-M in Memento parlance):


http://www.webcitation.org/60NizRC0o
http://www.webcitation.org/60Nj60H8D

The IA's URIs violate the W3C "good practice" of URI opacity, but they sure are handy for humans. WebCite actually offers both styles of URIs, for example the latter of the two URIs above is equivalent to:

http://www.webcitation.org/query?url=http%3A%2F%2Fwww.ctv.ca%2FcanadaAMPlayer%2Findex.html&date=2011-07-22

But the resulting URI encoding, while technically correct, is not conducive to easy memorizing and exploration by humans. Different styles of using a URI as an argument to another URI will be explored in a future blog post.

Fortunately I was given a DVD of the session, from which I was able to rip a copy and upload it to YouTube, provided below with the dual interests of vanity and pedagogy. I'm not sure about its status with respect to copyright, so it might disappear in the future as well. It should be covered under fair use, but I would not count on it. However, that is also a topic for another blog post...



--Michael

2012-05-30 Update: Apparently Canada AM did create a new page about the video, including a nice, anonymously authored summary of the material with direct quotes from me:


It appears to be authored on July 24, 2011, not just via the byline but through the HTTP response headers as well.  For example, look at the "Last-Modified" header for this image that appears in the page:

% curl -I http://images.ctv.ca/archives/CTVNews/img2/20110721/470_professor_nelson_110721_225128.jpg
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2
Last-Modified: Sun, 24 Jul 2011 10:52:59 GMT
ETag: "a9e08e-51a4-807938c0"
Accept-Ranges: bytes
Content-Length: 20900
Content-Type: image/jpeg
Date: Wed, 30 May 2012 14:02:21 GMT
Connection: keep-alive

I originally wrote the above article on July 28, 2011 and I was unable to find any trace of my appearance on their site.  Perhaps I just missed it, or perhaps it was written but not yet linked.  This nicely illustrates the premise behind Martin Klein's PhD research: things rarely disappear completely, they just move to a new location; the trick is finding them.

2017-07-24 Update: Apparently Canada AM was cancelled in 2016 (just over a year ago).  Unfortunately, ctv.ca has removed the Canada AM material from the live web, and is using robots.txt to block access in the Internet Archive.


$ curl -I http://www.ctv.ca/CTVNews/SciTech/20110722/internet-website-archive-memento-project-110724/
HTTP/1.1 500 Internal Server Error
Content-Length: 38660
Content-Type: text/html
Cache-Control: max-age=300
Expires: Mon, 24 Jul 2017 20:06:34 GMT
Date: Mon, 24 Jul 2017 20:01:34 GMT
Connection: keep-alive





$ curl -i http://www.ctv.ca/robots.txt
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
X-Frame-Options: SAMEORIGIN
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET
Access-Control-Allow-Headers: Content-Type
X-UA-Compatible: IE=edge
Content-Length: 286
Cache-Control: private, no-store, must-revalidate
Expires: Mon, 24 Jul 2017 20:03:51 GMT
Date: Mon, 24 Jul 2017 20:03:51 GMT
Connection: keep-alive

user-agent:*
Disallow: /api/
Disallow: /admin/
Disallow: /servlet/*
Disallow:/generic/generated/freeheadlines/*
Disallow:/mar/images/local990/*
Disallow:/{{item.ImageUrl}}
Disallow:/CTVNews/*
Disallow:/browserconfig.xml
Disallow:/WebResource.axd
Disallow:/ScriptResource.axd


Unfortunately, there are no other mementos in other archives, as shown by this 404 TimeMap from Memgator:

$ curl -i http://memgator.cs.odu.edu/timemap/link/http://www.ctv.ca/CTVNews/SciTech/20110722/internet-website-archive-memento-project-110724/
HTTP/1.1 404 Not Found
Server: nginx/1.11.3
Date: Mon, 24 Jul 2017 20:06:30 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 19
Connection: keep-alive
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Link, Location, X-Memento-Count, X-Generator
X-Content-Type-Options: nosniff
X-Generator: MemGator:1.0-rc7
X-Memento-Count: 0

404 page not found


Attempts to access images.ctv.ca produce 504 responses:

$ curl -i http://images.ctv.ca/archives/CTVNews/img2/20110721/470_professor_nelson_110721_225128.jpg
HTTP/1.1 504 Gateway Time-out
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 174
Expires: Mon, 24 Jul 2017 20:08:14 GMT
Date: Mon, 24 Jul 2017 20:08:14 GMT
Connection: keep-alive

<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference #97.d040f17.1500926894.d6475a1
</BODY></HTML>

$ curl -i http://images.ctv.ca/robots.txt
HTTP/1.1 504 Gateway Time-out
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 174
Expires: Mon, 24 Jul 2017 20:08:59 GMT
Date: Mon, 24 Jul 2017 20:08:59 GMT
Connection: keep-alive

<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference #97.d040f17.1500926939.d652474
</BODY></HTML>


If not for the youtube video, I would be hard pressed to prove that I'm big in Canada

Comments