One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.
As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.
We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.
In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.
The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:
- 1a - Local image, relative to the test
- 1b - Local image, absolute URI
- 1c - Remote image, absolute
- 1d - Inline content, encoded image
- 1e - Scheme-less resource
- 1f - Recursively included CSS
- 2a - Script, local
- 2b - Script, remote
- 2c - Script inline, DOM manipulation
- 2d - Ajax image replacement of content that should be in archive
- 2e - Ajax requests with content that should be included in the archve, test for false positive (e.g., same origin policy)
- 2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
- 2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
- 2h - Code that dynamically adds stylesheets
- 3a - HTML5 Canvas Drawing
- 3b - LocalStorage
- 3c - External Webpage
- 3d - Embedded Objects (HTML5 video)
For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).
The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.@machawk1)