Tuesday, February 17, 2015

2015-02-17: Fixing Links on the Live Web, Breaking Them in the Archive

On February 2nd, 2015, Rene Voorburg announced the JavaScript utility robustify.js. The robustify.js code, when embedded in the HTML of a web page, helps address the challenge with link rot by detecting when a clicked link will return an HTTP 404 and uses the Memento Time Travel Service to discover mementos of the URI-R. Robustify.js assigns an onclick event to each anchor tag in the HTML. The event occurs, robustify.js makes an Ajax call to a service to test the HTTP response code of the target URI.

When an HTTP 404 response code is detected by robustify.js, it uses Ajax to make a call to a remote server, uses the Memento Time Travel Service to find mementos of the URI-R, and uses a JavaScript alert to let the user know that JavaScript will redirect the user to the memento.

Our recent studies have shown that JavaScript -- particularly Ajax -- normally makes preservation more difficult, but robustify.js is a useful utility that is easily implemented to solve an important challenge. Along this thought process, we wanted to see how a tool like robustify.js would behave when archived.

We constructed two very simple test pages, both of which include links to Voorburg's missing page http://www.dds.nl/~krantb/stellingen/.
  1. http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html which does not use robustify.js
  2. http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html which does use robustify.js
In robustifyTest.html, when the user clicks on the link to http://www.dds.nl/~krantb/stellingen/, an HTTP GET request is issued by robustify.js to an API that returns an existing memento of the page:

GET /services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F HTTP/1.1
Host: digitopia.nl
Connection: keep-alive
Origin: http://www.cs.odu.edu
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4
Accept: */*
Referer: http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Fri, 06 Feb 2015 21:47:51 GMT
Content-Type: application/json; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.15
Access-Control-Allow-Origin: *

The resulting JSON is used by robustify.js to then redirect the user to the memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ as expected.

Given this success, we wanted to understand how our test pages would behave in the archives. We also included a link to the stellingen memento in our test page before archiving to understand how a URI-M would behave in the archives. We used the Internet Archive's Save Page Now feature to create the mementos at URI-Ms http://web.archive.org/web/20150206214019/http://www.cs.odu.edu/~jbrunelle/wsdl/robustifyTest.html and http://web.archive.org/web/20150206215522/http://www.cs.odu.edu/~jbrunelle/wsdl/unrobustifyTest.html.

The Internet Archive re-wrote the embedded links to be relative to the archive in the memento, converting http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://www.dds.nl/~krantb/stellingen/. Upon further investigation, we noticed that robustify.js does not issue onclick events to anchor tags linking to pages within the same domain as the host page. An onclick even is not assigned to this any embedded anchor tags because all of the links point to within the Internet Archive, the host domain. Due to this design decision, robustify.js is never invoked when within the archive.

When clicking on the URI-M, the 2015-02-06 memento does not exist, so the Internet Archive redirects the user to the closest memento http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/. The user, when clicking the link, ends up at the 1999 memento because the Internet Archive understands how to redirect the user from the 2015 URI-M for a memento that does not exist to the 1999 URI-M for a memento that does exist. If the Internet Archive had no memento for http://www.dds.nl/~krantb/stellingen/ the user would simply receive a 404 and not have the benefit of robustify.js using the Memento Time Travel service to search additional archives.

The robustify.js file is archived (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) but its embedded URI-Rs are re-written by the Internet Archive.  The original, live web JavaScript has URI templates embedded in the code that are completed at run time by inserting the "yyymmddhhmmss" and "url" variable strings into the URI-R:


These templates are rewritten during playback to be relative to the Internet Archive:


Because the robustify.js is modified during archiving, we wanted to understand the impact of including the URI-M of robustify.js (http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js) in our test page (http://www.cs.odu.edu/~jbrunelle/wsdl/test-r.html). In this scenario, the JavaScript attempts to execute when the user clicks on the page's links, but the re-written URIs point to /web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2 (since test-r.html exists on www.cs.odu.edu, the links are relative to www.cs.odu.edu instead of archive.org).

Instead of issuing an HTTP GET for http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F, robustify.js issues an HTTP GET for
http://www.cs.odu.edu/web/20150206214020/http://digitopia.nl/services/statuscode.php?url=http%3A%2F%2Fwww.dds.nl%2F~krantb%2Fstellingen%2F which returns an HTTP 404 when dereferenced.
The robustify.js script does not handle the HTTP 404 response when looking for its service, and throws an exception in this scenario. Note that the memento that references the URI-M of robustify.js does not throw an exception because the robustify.js script does not make a call to digitopia.nl/services/.

In our test mementos, the Internet Archive also re-writes the URI-M http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/ to http://web.archive.org/web/20150206214019/http://web.archive.org/web/19990830104212/http://www.dds.nl/~krantb/stellingen/.

This memento of a memento (in a near Yo Dawg situation) does not exist. Clicking on the apparent memento of a memento link leads to the user being told by the Internet Archive that the page is available to be archived.

We also created an Archive.today memento of our robustifyTest.html page: https://archive.today/l9j3O. In this memento, the functionality of the robustify script is removed, redirecting the user to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web. The link to the Internet Archive memento is re-written to https://archive.today/o/l9j3O/http://www.dds.nl/~krantb/stellingen/, which results in a redirect (via a refresh) to http://www.dds.nl/~krantb/stellingen/ which results in a HTTP 404 response from the live web, just as before. Archive.today uses this redirect approach as standard operating procedure. However, Archive.today re-writes all links to URI-Ms back to their respective URI-Rs.

This is a different path to a broken URI-M than the Internet Archive takes, but results in a broken URI-M, nonetheless.  Note that Archive.today simply removes the robustify.js file from the memento, not only removing the functionality, but also removing any trace that it was present in the original page.

In an odd turn of events, our investigation into whether a JavaScript tool would behave properly in the archives has also identified a problem with URI-Ms in the archives. If web content authors continue to utilize URI-Ms to mitigate link rot or utilize tools to help discover mementos of defunct links, there is a potential that the archives may see additional challenges of this nature arising.

--Justin Brunelle


  1. Very interesting and useful! I guess the best way to further improve the behaviour of robustify.js in the context of a webarchive would be to stop it redirecting clicks. So it should be able to detect that it is running inside a webarchive (that should take care of redirecting a user to a proper memento). Any idea on how to reliably detect being inside an webarchive using javascript?

  2. The good news is robustify.js does not assign the onclick event to an anchor tag outside of its host domain, which becomes the Wayback Machine after it is archived.

    The bad news is if content authors make a conscious effort to link to URI-Ms, we might see some strange behavior in a future memento.

    If robustify.js wants to know if it is running within an archive, it might be able to check the HTTP headers for the host page. Memento compliant archives return headers like "Memento-Datetime" and "Link: ... rel='original'", e.g.:

    $ curl -I http://web.archive.org/web/20150206214020js_/http://digitopia.nl/js/robustify-min.js
    Memento-Datetime: Fri, 06 Feb 2015 21:40:20 GMT
    Link: ; rel="original"

  3. EDIT:
    "The good news is robustify.js does not assign the onclick event to an anchor tag *INSIDE* of its host domain, which becomes the Wayback Machine after it is archived."

  4. I just updated robustify.js so that it won't attempt to alter the behaviour of links when it is run from inside a web archive (= if urls inside the script have been rewritten).