Monday, July 14, 2014

2014-07-14: The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript

One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.

As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.

We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.

In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.

The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:

The Basics

  • 1a - Local image, relative to the test
  • 1b - Local image, absolute URI
  • 1c - Remote image, absolute
  • 1d - Inline content, encoded image
  • 1e - Scheme-less resource
  • 1f - Recursively included CSS

JavaScript

  • 2a - Script, local
  • 2b - Script, remote
  • 2c - Script inline, DOM manipulation
  • 2d - Ajax image replacement of content that should be in archive
  • 2e - Ajax requests with content that should be included in the archve, test for false positive (e.g., same origin policy)
  • 2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
  • 2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
  • 2h - Code that dynamically adds stylesheets

HTML5 Features

  • 3a - HTML5 Canvas Drawing
  • 3b - LocalStorage
  • 3c - External Webpage
  • 3d - Embedded Objects (HTML5 video)

For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).

The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.

The actual test we used is available at http://acid.matkelly.com for you to exercise with your tools/websites and the code that runs the site is available on GitHub.

— Mat Kelly (@machawk1)

2 comments:

  1. This is a great benchmark. I've been working on the https://webrecorder.io project which records pages as the user is browsing them. It is still in early stages but is designed precisely for archiving complex pages and appears to get a perfect score on this test. (testing in Chrome and FF).

    Visit:
    https://webrecorder.io/record/acid.matkelly.com

    to record, then hit "Replay" to test playback.

    Of course, this can probably be expanded to cover even more complex content, eg.. social media, streaming video, etc...

    ReplyDelete
  2. Update: We were contacted by pagefreezer.com and we ran their on-demand archiving service through AAT. Their public service passes 15/18 tests, and they also have a non-public test version that passes 18/18 tests.

    Michael

    ReplyDelete