Monday, August 15, 2016

2016-08-15: Mementos In the Raw, Take Two


In a previous post, we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives.
Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240. Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation.
To recap, most web archives augment mementos when presenting them to the user, often for usability or legal purposes. The figures below show examples of these augmentations.

Figure 1: The PRONI web archive augments mementos for user experience; augmentations outlined in red

Figure 2: The UK National Archives adds additional text and a banner to differentiate their mementos from their live counterparts, because their mementos appear in Google search results
Additionally, some archives rewrite links to allow navigation within an archive. This way the end user can visit other pages within the same archive from the same time period. Smaller archives, because of the size of their collections, do not benefit as much from these rewritten links. Of course, for Memento users, these rewritten links are not really required.
In many cases, access to the original, unaltered content is needed. This is, for example, the case for some research studies that require the original HTTP response headers and the original unaltered content. Unaltered content is also needed to replay the original web content in projects like oldweb.today and the TimeTravel's Reconstruct feature.
The previously proposed solution was based on the use of two TimeGates, one to access augmented content (which is the current default) and an additional one to access unaltered content. In this post, we discuss a less complex method of acquiring raw mementos. This solution provides a standard way to request raw mementos, regardless of web archive software or configuration, and eliminates the need for archive-specific or software-specific heuristics.
The raw-ness of a memento exists in several dimensions, and the level of raw-ness that is required depends on the nature of the application:
  1. No augmented content - The memento should contain no additional HTML, JavaScript, CSS, or text added for usability or any other purpose. Its content should exist as it did on the web at the moment it was captured by the web archive.
  2. No rewritten links - The links should not be rewritten. The links within the memento content should exist as they did on the web at the moment the memento was captured by the web archive.
  3. Original headers - The original HTTP response headers should be available, expressed as X-Archive-Orig-*, like X-Archive-Orig-Content-Type: text/html. Their values should be the same as those of the corresponding headers without the X-Archive-Orig- prefix (e.g. Content-Type) at the moment of capture by the web archive.
We propose a solution that uses the Prefer HTTP request header and the Preference-Applied response header from RFC7240.
Consider a client that prefers a true, raw memento for http://www.cnn.com. Using the Prefer HTTP request header, this client can provide the following request headers when issuing an HTTP HEAD/GET to a memento.
GET /web/20160721152544/http://www.cnn.com/ HTTP/1.1 Host: web.archive.org Prefer: original-content, original-links, original-headers Connection: close
As we see above, the client specifies which level of raw-ness it prefers in the memento. In this case, the client prefers a memento with the following features:
  1. original-content - The client prefers that the memento returned contain the same HTML, JavaScript, CSS, and/or text that existed in the original resource at the time of capture.
  2. original-links - The client prefers that the memento returned contain the links that existed in the original resource at the time of capture.
  3. original-headers - The client prefers that the memento response uses X-Archive-Orig-* to express the values of the original HTTP response headers from the moment of capture.
The memento then responds with the headers below.
HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 17:34:15 GMT Content-Type: text/html;charset=utf-8 Content-Length: 109672 Connection: keep-alive set-cookie: wayback_server=60; Domain=archive.org; Path=/; Expires=Sat, 20-Aug-16 17:34:15 GMT; Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Content-Location: /web/20160721152544im_/http://www.cnn.com/ Vary: prefer Preference-Applied: original-content, original-links, original-headers Link: <http://www.cnn.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.cnn.com/>; rel="timegate", <http://web.archive.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://web.archive.org/web/20160120080735/http://www.cnn.com/>; rel="first memento"; datetime="Wed, 20 Jan 2016 08:07:35 GMT", <http://web.archive.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT" X-Archive-Orig-x-served-by: cache-iad2120-IAD, cache-sjc3632-SJC X-Archive-Orig-x-cache-hits: 1, 13 X-Archive-Orig-cache-control: max-age=60 X-Archive-Orig-x-xss-protection: 1; mode=block X-Archive-Orig-content-type: text/html; charset=utf-8 X-Archive-Orig-age: 184 X-Archive-Orig-x-timer: S1469114744.153501,VS0,VE0 X-Archive-Orig-set-cookie: countryCode=US; Domain=.cnn.com X-Archive-Orig-access-control-allow-origin: * X-Archive-Orig-content-security-policy: default-src 'self' http://*.cnn.com:* https://*.cnn.com:* *.cnn.net:* *.turner.com:* *.ugdturner.com:* *.vgtf.net:*; script-src 'unsafe-inline' 'unsafe-eval' 'self' *; style-src 'unsafe-inline' 'self' *; frame-src 'self' *; object-src 'self' *; img-src 'self' * data: blob:; media-src 'self' *; font-src 'self' *; connect-src 'self' *; X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-vary: Accept-Encoding X-Archive-Orig-connection: close X-Archive-Orig-x-servedbyhost: prd-10-60-168-38.nodes.56m.dmtio.net X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT X-Archive-Orig-via: 1.1 varnish X-Archive-Orig-content-length: 109672 X-Archive-Orig-x-cache: HIT, HIT X-Archive-Orig-fastly-debug-digest: 1e206303e0672a50569b0c0a29903ca81f3ef5033de74682ce90ec9d13686981
The response also uses the Preference-Applied header to indicate that it is providing the original-headers and the content has its original-links and original-content. It is possible, of course, for a system to satisfy only some of these preferences, and the Preference-Applied header allows the server to indicate which ones.
The Vary header also contains prefer, indicating that clients can influence the memento's response by using this header. The response can then be cached for requests that have the same options in the request headers.
Based on these preferences, the content of the response has been altered from the default. The Content-Location header informs clients of the exact URI-M that meets these preferences for this memento, in this case http://web.archive.org/web/20160721152544im_/http://www.cnn.com/.
The memento returned contains the original content and the original links, as seen in the figure below, and the original headers provided as X-Archive-Orig-* as shown in the above response.
Figure 3: Seen in this example is a memento with original-content - no banner added - and original-links as seen in the magnified inspector output from Firefox.

If the client issues no Prefer header in the request, then the server can still use the Preference-Applied header to indicate which preferences are met by default. Again, the Vary header indicates that clients can influence the response via the use of the Prefer request header. The Content-Location header indicates the URI-M of the memento. The response headers for such a default memento from the Internet Archive are shown below, with its original headers expressed in the form of X-Archive-Orig-* and bolded for emphasis.
HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 16:17:09 GMT Content-Type: text/html;charset=utf-8 Content-Length: 127383 Connection: keep-alive set-cookie: wayback_server=60; Domain=archive.org; Path=/; Expires=Sat, 20-Aug-16 16:17:07 GMT; Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Content-Location: /web/20160721152544/http://www.cnn.com/ Vary: prefer Preference-Applied: original-headers Link: <http://www.cnn.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.cnn.com/>; rel="timegate", <http://web.archive.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://web.archive.org/web/20000620180259/http://www.cnn.com/>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://web.archive.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT" Set-Cookie: JSESSIONID=3652A3AF37E6AF4FB5C7DEF16CC8084E; Path=/; HttpOnly X-Archive-Orig-x-served-by: cache-iad2120-IAD, cache-sjc3632-SJC X-Archive-Orig-x-cache-hits: 1, 13 X-Archive-Guessed-Charset: utf-8 X-Archive-Orig-cache-control: max-age=60 X-Archive-Orig-x-xss-protection: 1; mode=block X-Archive-Orig-content-type: text/html; charset=utf-8 X-Archive-Orig-age: 184 X-Archive-Orig-x-timer: S1469114744.153501,VS0,VE0 X-Archive-Orig-set-cookie: countryCode=US; Domain=.cnn.com X-Archive-Orig-access-control-allow-origin: * X-Archive-Orig-content-security-policy: default-src 'self' http://*.cnn.com:* https://*.cnn.com:* *.cnn.net:* *.turner.com:* *.ugdturner.com:* *.vgtf.net:*; script-src 'unsafe-inline' 'unsafe-eval' 'self' *; style-src 'unsafe-inline' 'self' *; frame-src 'self' *; object-src 'self' *; img-src 'self' * data: blob:; media-src 'self' *; font-src 'self' *; connect-src 'self' *; X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-vary: Accept-Encoding X-Archive-Orig-connection: close X-Archive-Orig-x-servedbyhost: prd-10-60-168-38.nodes.56m.dmtio.net X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT X-Archive-Orig-via: 1.1 varnish X-Archive-Orig-content-length: 109672 X-Archive-Orig-x-cache: HIT, HIT X-Archive-Orig-fastly-debug-digest: 1e206303e0672a50569b0c0a29903ca81f3ef5033de74682ce90ec9d13686981
For this default memento, shown in the figure below, the links are rewritten and the presence of the Wayback banner indicates that additional content has been added.
Figure 4: This default memento contains added content in the form of a banner outlined in red on top as well as rewritten links, shown using Firefox's inspector and magnified on the bottom.
We are confident that it is legitimate to use the Prefer header in this way. Even though the original RFC contains examples requesting different representations using only the PATCH, PUT, and POST methods, a draft RFC for the "safe" HTTP preference mentions its use with GET in order to modify the content of the requested page. This draft RFC has already been implemented in Mozilla Firefox and Internet Explorer. It is also used in the W3C Open Annotation Protocol to indicate the extent to which a resource should include annotations in its representation.
Compared to our previously described approach, this solution is more elegant in its simplicity and intuitiveness. This approach also allows the introduction of other client preferences over time, if such a need would emerge. These preferences can and should be registered in accordance with RFC7240. The client specifies which features of a memento it prefers and the memento itself indicates which features it has satisfied while ensuring its response satisfies those preferred features.
We seek feedback on this solution, including what additional dimensions clients may prefer beyond the three we have specified.
--
Herbert Van de Sompel
- and -
Michael L. Nelson
- and -
Lyudmila Balakireva
- and -
Martin Klein
- and -
- and -
Harihar Shankar

2 comments:

  1. This is unambiguously better that our original idea!

    ReplyDelete
  2. (posting on behalf of webmaster@archive.is)

    "screenshot" would be a good idea. Archive.org stores screenshots for some pages as well and there are screenshot-only archives (for example http://research.domaintools.com/research/screenshot-history/). It may make sense to split the feature to "fullsize-screenshot (domaintools.com) and "partial-screenshot (archive.org and archive.is).

    As for "DOM transformed", "original in .zip" - the categories covering the only instance of archive.is - I would put archive.is into the
    group of "derivative-document" archives which can supply the stored pages as MHT or PDF (it looks trivial to add rendering to PDF and MHT
    to archive.is and to archive.org-like archives; it can be done even on memento side if it is too difficult or too long to convince each archive).

    "screenshot" could be merely one of the "derivative-document" formats. Then producing on-demand screenshots from the old mementos (captured long before screenshoting was implemented by archive.org) can be done by something like oldweb.today thus turning it into the screenshot-only archive of archive.org's extent.

    ReplyDelete