Tuesday, February 17, 2015

2015-02-17: Reactions To Vint Cerf's "Digital Vellum"

Don't you just love reading BuzzFeed-like articles, constructed solely of content embedded from external sources?  Yeah, me neither.  But I'm going to pull one together anyway.

Vint Cerf generated a lot of buzz last week when at an AAAS meeting he gave talk titled "Digital Vellum".  The AAAS version, to the best of my knowledge, is not online but this version of "Digital Vellum" at CMU-SV from earlier the same week is probably the same.

The media (e.g., The Guardian, The Atlantic, BBC) picked up on it, because when Vint Cerf speaks people rightly pay attention.  However, the reaction from archiving practitioners and researchers was akin to having your favorite uncle forget your birthday, mostly because Cerf's talk seemed to ignore the last 20 or so years of work in preservation.  For a thoughtful discussion of Cerf's talk, I recommend David Rosenthal's blog post.  But let's get to the BuzzFeed part...

In the wake of the media coverage, I found myself retweeting many of my favorite wry responses starting with Ian Milligan's observation:

Andy Jackson went a lot further, using his web archive (!) to find out how long we've been talking about "digital dark ages":

And another one showing how long The Guardian has been talking about it:

And then Andy went on a tear with pointers to projects (mostly defunct) with similar aims as "Digital Vellum":

Andy's dead right, of course.  But perhaps Jason Scott has the best take on the whole thing:

So maybe Vint didn't forget our birthday, but we didn't get a pony either.  Instead we got a dime kitty


2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

On February 2nd, 2015, Rene Voorburg announced the JavaScript utility robustify.js. The robustify.js code, when embedded in the HTML of a web page, helps address the challenge with link rot by detecting when a clicked link will return an HTTP 404 and uses the Memento Time Travel Service to discover mementos of the URI-R. Robustify.js assigns an onclick event to each anchor tag in the HTML. The event occurs, robustify.js makes an Ajax call to a service to test the HTTP response code of the target URI.

When an HTTP 404 response code is detected by robustify.js, it uses Ajax to make a call to a remote server, uses the Memento Time Travel Service to find mementos of the URI-R, and uses a JavaScript alert to let the user know that JavaScript will redirect the user to the memento.

Our recent studies have shown that JavaScript -- particularly Ajax -- normally makes preservation more difficult, but robustify.js is a useful utility that is easily implemented to solve an important challenge. Along this thought process, we wanted to see how a tool like robustify.js would behave when archived.

We constructed two very simple test pages, both of which include links to Voorburg's missing page http://www.dds.nl/~krantb/stellingen/.
  1. http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html which does not use robustify.js
  2. http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html which does use robustify.js
In robustifyTest.html, when the user clicks on the link to http://www.dds.nl/~krantb/stellingen/, an HTTP GET request is issued by robustify.js to an API that returns an existing memento of the page:

GET /services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F HTTP/1.1
Host: digitopia.nl
Connection: keep-alive
Origin: http://www.cs.odu.edu
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4
Accept: */*
Referer: http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Fri, 06 Feb 2015 21:47:51 GMT
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.15
Access-Control-Allow-Origin: *

The resulting JSON is used by robustify.js to then redirect the user to the memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ as expected.

Given this success, we wanted to understand how our test pages would behave in the archives. We also included a link to the stellingen memento in our test page before archiving to understand how a URI-M would behave in the archives. We used the Internet Archive's Save Page Now feature to create the mementos at URI-Ms http://web.archive.org/web/20150206214019/http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html and http://web.archive.org/web/20150206215522/http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html.

The Internet Archive re-wrote the embedded links to be relative to the archive in the memento, converting http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://www.dds.nl/~krantb/stellingen/. Upon further investigation, we noticed that robustify.js does not issue onclick events to anchor tags linking to pages within the same domain as the host page. An onclick even is not assigned to this any embedded anchor tags because all of the links point to within the Internet Archive, the host domain. Due to this design decision, robustify.js is never invoked when within the archive.

When clicking on the URI-M, the 2015-02-06 memento does not exist, so the Internet Archive redirects the user to the closest memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/. The user, when clicking the link, ends up at the 1999 memento because the Internet Archive understands how to redirect the user from the 2015 URI-M for a memento that does not exist to the 1999 URI-M for a memento that does exist. If the Internet Archive had no memento for http://www.dds.nl/~krantb/stellingen/ the user would simply receive a 404 and not have the benefit of robustify.js using the Memento Time Travel service to search additional archives.

The robustify.js file is archived (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) but its embedded URI-Rs are re-written by the Internet Archive.  The original, live web JavaScript has URI templates embedded in the code that are completed at run time by inserting the "yyymmddhhmmss" and "url" variable strings into the URI-R:


These templates are rewritten during playback to be relative to the Internet Archive:


Because the robustify.js is modified during archiving, we wanted to understand the impact of including the URI-M of robustify.js (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) in our test page (http://www.cs.odu.edu/~jbrunelle/wsdl/test-r.html). In this scenario, the JavaScript attempts to execute when the user clicks on the page's links, but the re-written URIs point to /web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2 (since test-r.html exists on www.cs.odu.edu, the links are relative to www.cs.odu.edu instead of archive.org).

Instead of issuing an HTTP GET for http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F, robustify.js issues an HTTP GET for
http://www.cs.odu.edu/web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F which returns an HTTP 404 when dereferenced.
The robustify.js script does not handle the HTTP 404 response when looking for its service, and throws an exception in this scenario. Note that the memento that references the URI-M of robustify.js does not throw an exception because the robustify.js script does not make a call to digitopia.nl/services/.

In our test mementos, the Internet Archive also re-writes the URI-M http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/.

This memento of a memento (in a near Yo Dawg situation) does not exist. Clicking on the apparent memento of a memento link leads to the user being told by the Internet Archive that the page is available to be archived.

We also created an Archive.today memento of our robustifyTest.html page: https://archive.today/l9j3O. In this memento, the functionality of the robustify script is removed, redirecting the user to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web. The link to the Internet Archive memento is re-written to https://archive.today/o/l9j3O/http://www.dds.nl/~krantb/stellingen/, which results in a redirect (via a refresh) to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web, just as before. Archive.today uses this redirect approach as standard operating procedure. However, Archive.today re-writes all links to URI-Ms back to their respective URI-Rs.

This is a different path to a broken URI-M than the Internet Archive takes, but results in a broken URI-M, nonetheless.  Note that Archive.today simply removes the robustify.js file from the memento, not only removing the functionality, but also removing any trace that it was present in the original page.

In an odd turn of events, our investigation into whether a JavaScript tool would behave properly in the archives has also identified a problem with URI-Ms in the archives. If web content authors continue to utilize URI-Ms to mitigate link rot or utilize tools to help discover mementos of defunct links, there is a potential that the archives may see additional challenges of this nature arising.

--Justin Brunelle