Wednesday, April 27, 2016

2016-04-27: Mementos in the Raw

While analyzing mementos in a recent experiment, we discovered problems processing archived content.  Many web archives augment the mementos they serve with additional archive-specific information, including HTML, text, and JavaScript.  We were attempting to compare content across many web archives, and had to develop custom solutions to remove these augmentations.

Most augment their mementos in order to provide additional user experience features, such as navigation to additional mementos, by rewriting links and providing additional discovery tools. From an end-user perspective, these augmented mementos enhance the usability and overall experience of web archives and are the default case for user access to mementos.  An example from the PRONI web archive is shown below, with the augmentations outlined in red.



Others have requirements to differentiate archived content from live content, because they expose archived content to web search engines. Below, we see that a Google search will return content from the UK National Archives, with one of these search results outlined in red.
To indicate the archived nature of this content, the title of the web page, outlined in red below, has been altered to indicate that this archived page is "[ARCHIVED CONTENT]".


Our experiments were adversely affected by these augmentations. We required "mementos in the raw".  In the case of our study, we needed to access the content as it had existed on the web at the time of capture.  Research by Scott Ainsworth requires accurate replay of the headers as well. These captured mementos are also invaluable to the growing number of research studies that use web archives. Captured mementos are also used by projects like oldweb.today, that truly need to access the original content so it can be rendered in old browsers. It seeks consistent content from different archives to arrive at an accurate page recreation. Fortunately, some web archives store the captured memento, but there is no uniform, standard-based way to access them across various archive implementations.

Based on the needs of these research studies and software projects:
  1. A captured memento must contain only the memento content that was present in the original document:
    • no HTML, JavaScript, CSS, or text has been added to the output
    • linked URIs are not rewritten and exist as they were in the original document (e.g., http://wayback.vefsafn.is/wayback/20091117131348/http://www.lanl.gov/news/index.html should just be http://www.lanl.gov/news/index.html)
  2. A captured memento should also provide the original HTTP headers in some form (e.g., X-Archive-Orig-Content-Type: text/html for users desiring the original Content-Type)
The following table provides a list of some known web archives and the status of their ability to provide captured mementos, by either unaltered content and/or the original headers. Those columns with a "Yes" indicate that the archive is able to provide access to that specific dimension of captured mementos using software-specific approaches.


Those entries with a ? and other archives not listed may or may not provide access to captured mementos. This ambiguity is part of the problem.  Those archives that run OpenWayback for serving their mementos have the capability to deliver captured mementos, as detailed in the OpenWayback Administrator Manual, by use of special URIs. In fact, the OpenWayback im_ URI flag provides the desired behavior, with original headers and original content, even though the documentation states that it is supposed to "return document as an image".

Of course, not all web archives run OpenWayback, and developers have needed to create heuristics based on the software used by each individual web archive.  For example, our archive registry uses the un-rewritten-api-url attribute to provide a pattern for accessing captured mementos. Because there is no uniform approach, these pattern-based solutions are necessary but brittle, tying them to a small set of specific implementations, and making it difficult for clients to adapt to new or changing web archive software.
We propose a solution that uses the Memento specification (RFC 7089) in its current form, while still allowing uniform, standards-based access to both augmented and captured mementos.

Proposed Solution for Accessing Augmented and Captured Mementos

We propose two parallel Memento implementations: one with a TimeGate and TimeMap for access to augmented mementos (as currently exists) and another with a TimeGate and TimeMap for access to captured mementos.  A client that desires access to a specific type of memento (captured or augmented) only needs to access the TimeGate or TimeMap that specializes in finding and returning that type of memento. These parallel Memento implementations are based on the same infrastructure, the interactions are the same, and the only difference is in the nature of the memento each serves.

Clients could use the Archive Registry for discovering these TimeGates and TimeMaps. The Registry contains entries for many public web archives and version control systems, for each detailing its TimeGate and TimeMap URIs, as well as any additional information pertinent to accessing the archives. Several tools, such as the Memento Aggregator, directly use the information in the Registry. In light of discussions on the Memento Development list, we are considering creating a curated location where improvements can be submitted by the community.

A new attribute, profile, added to the timegate and timemap elements in the Registry, would allow a client to discover the TimeGate and/or TimeMap providing the type of memento it desires. A fictional enhanced Registry entry for the Icelandic Web Archive is shown below with the new profile attributes in red. Also, information currently provided in the <archive> element would either be deprecated (e.g. un-rewritten-api-url) or relocated (e.g. inside the timegate or timemap elements).

<link id="is" longname="Icelandic Web Archive">
    <timegate uri="http://wayback.vefsafn.is/wayback/" redirect="no" profile="http://mementoweb.org/terms/augmented"/>
    <timegate uri="http://wayback.vefsafn.is/wayback/captured/" redirect="no" profile="http://mementoweb.org/terms/captured"/>
    <timemap uri="http://wayback.vefsafn.is/wayback/timemap/link/" paging-status="2" redirect="no" profile="http://mementoweb.org/terms/augmented" />
    <timemap uri="http://wayback.vefsafn.is/wayback/timemap/captured/link/" 
paging-status="2" redirect="no" profile="http://mementoweb.org/terms/captured" />
    <icon uri="http://vefsafn.is/favicon.ico"/>
    <calendar uri="http://wayback.vefsafn.is/wayback/*/"/>
    <memento uri="http://wayback.vefsafn.is/wayback/*/"/>
    <archive type="snapshot" rewritten-urls="yes" un-rewritten-api-url="http://wayback.vefsafn.is/wayback/{timestamp}id_/{url}" access-policy="public" memento-status="yes"/>
</link>

This solution requires no changes to the Memento protocol and allows web archives to satisfy the needs of both end-users and software applications by returning the appropriate memento for each use-case. 
In the case of OpenWayback, this capability should be easy to add. Consider the following example from the Icelandic Archive, running OpenWayback, where the following URIs refer to the mementos of http://www.lanl.gov with a Memento-Datetime of Tue, 17 Nov 2009 13:13:48 GMT:
The memento that will be selected from the archive for the requested datetime, and hence the database interactions, will be the same for augmented and captured mementos. The only difference is the memento URI to which the TimeGates will redirect and is limited to the addition of the string im_ in the captured memento's URI. The additional TimeGate only needs to add this string to its output.
This approach, fully aligned with the Memento protocol, removes the need for client heuristics and supports using syntaxes other than im_ to distinguish between captured and augmented memento URIs. A client that picks the nature of a given TimeGate or TimeMap will continue to receive that type of memento.

Optional Additions


With parallel "augmented" and "captured" Memento protocol support in place, as described above, we have supplied access to different types of mementos. The following section details other optional helpful changes that a client could use to identify and locate different types of mementos.

Self-Describing TimeGates, TimeMaps, and Mementos

TimeGates, TimeMaps, and mementos can self-describe their nature with an HTTP link using a profile relation, defined by RFC 6906, and a link target (Target IRI in the RFC) that indicates their augmented or captured nature.

Example TimeGate response headers implementing this self-describing ability are shown below, with the profile relation specifying the captured nature in red.

HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:02:14 GMT
Server: Apache
Vary: accept-datetime
Location: http://arxiv.example.net/web/captured/20010321203610/http://
a.example.org/
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format"
      ; from="Tue, 15 Sep 2000 11:28:26 GMT"
      ; until="Wed, 20 Jan 2010 09:34:33 GMT",
    <http://mementoweb.org/terms/captured>; rel="profile"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Connection: close

Example TimeMap response headers implementing this relation are shown below, again with additions in red describing this TimeMap as listing augmented mementos. The profile link is placed within the Link header so that clients can discard or consume the associated entity based on their needs. The profile link is also included in the TimeMap body so that the TimeMap itself is self-describing.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:06:50 GMT
Server: Apache
Content-Length: 4883
Content-Type: application/link-format
Link: <http://mementoweb.org/terms/augmented>; rel="profile"
Connection: close

    <http://a.example.org>;rel="original",
    <http://arxiv.example.net/timemap/http://a.example.org>
      ; rel="self";type="application/link-format",
    <http://mementoweb.org/terms/augmented>
      ; rel="profile",
    <http://arxiv.example.net/timegate/http://a.example.org>
      ; rel="timegate",
    <http://arxiv.example.net/web/20000620180259/http://a.example.org>
      ; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
    <http://arxiv.example.net/web/20091027204954/http://a.example.org>
      ; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
    <http://arxiv.example.net/web/20000621011731/http://a.example.org>
      ; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
    <http://arxiv.example.net/web/20000621044156/http://a.example.org>
      ; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT",
    ...

Finally, a memento can specify whether it is captured or augmented using the same method.  Seen as red in the example below, headers describe this resource as a captured memento.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:15 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
    <http://arxiv.example.net/timegate/captured/http://a.example.org/>
      ; rel="timegate",
    <http://mementoweb.org/terms/captured>; rel="profile"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

These additional profile relations allow archives to describe the nature of respective TimeGates, TimeMaps, and mementos without affecting existing Memento clients.

Discovery of Other TimeGates and TimeMaps via Mementos

Here we introduce an approach for a client to get from a memento to its corresponding memento of the other type. This capability is handy as such, but, as will be shown, it is also a way to get to the other type of TimeGate and TimeMap.

By including another Link relation, a machine client can find the corresponding memento of another type.  Shown below, we build upon our previous example memento headers and add this new relation, marked in red, allowing clients to find this captured memento's augmented counterpart. Here a profile attribute is used with the memento relation type in order to indicate the type of memento found at the link target. This profile attribute has been requested as part of "Signposting the Scholarly Web", and is provided by a proposed update to a draft RFC detailing "link hints". This proposed update has been informally accepted by the RFC's author.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:15 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/captured/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
    <http://arxiv.example.net/timegate/captured/http://a.example.org/>
      ; rel="timegate",
    <http://mementoweb.org/terms/captured>; rel="profile",
    <http://arxiv.example.net/web/20010321203610/http://
a.example.org/> 
      ; rel="memento"; profile="http://mementoweb.org/terms/augmented"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

From there, a client can follow the link target to the augmented memento. In the example below, we have the headers for the corresponding augmented memento.  The Memento protocol already provides the associated timegate and timemap relations, shown in bold.  A client uses these relations to discover the TimeGate/TimeMap that serves this memento, and, of course, the TimeGate/TimeMap have the same augmented nature as this memento. Note that this augmented memento also links to its captured counterpart.

HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:16 GMT
Server: Apache-Coyote/1.1
Memento-Datetime: Wed, 21 Mar 2001 20:36:10 GMT
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/http://a.example.org/>
      ; rel="timemap"; type="application/link-format",
    <http://arxiv.example.net/timegate/http://a.example.org/>
      ; rel="timegate",
    <http://mementoweb.org/terms/augmented>; rel="profile",
    <http://arxiv.example.net/web/20010321203610/captured/http://
a.example.org/>
      ; rel="memento"; profile="http://mementoweb.org/terms/captured"
Content-Length: 25532
Content-Type: text/html;charset=utf-8
Connection: close

Now the client can make future requests to this TimeGate and receive responses like the one below, finding additional augmented mementos for the original resource.

HTTP/1.1 302 Found
Date: Thu, 21 Jan 2010 00:02:17 GMT
Server: Apache
Vary: accept-datetime
Location: http://arxiv.example.net/web/20100424131422/http://
a.example.org/
Link: <http://a.example.org/>; rel="original",
    <http://arxiv.example.net/timemap/http://a.example.org/>
      ; rel="timemap"; type="application/link-format"
      ; from="Tue, 15 Sep 2000 11:28:26 GMT"
      ; until="Wed, 20 Jan 2010 09:34:33 GMT",
    <http://mementoweb.org/terms/augmented>; rel="profile"
Content-Length: 0
Content-Type: text/plain; charset=UTF-8
Connection: close

Likewise, a client can issue a request to the associated TimeMap to access augmented mementos for this resource. Of course, this process can start from an augmented memento and lead a client to the TimeGate/TimeMap for its captured counterpart as well.

Conclusion


The "captured" and "augmented" parallel Memento implementations addresses the problem of accessing different types of mementos in a standard-based manner.  Given that the selected memento will be the same for both the captured and augmented cases and the difference will only be in the access mechanism (URI), the solution seems straightforward to implement for web archives. Existing clients will still continue to function as is, and clients desiring a specific type of memento can use the Archive Registry to find the resources that support the that type of memento.

In addition, the optional profile and discovery links add further value, allowing clients to identify which type of mementos they have currently acquired as well as accessing the other types of mementos that are available.

We look forward to feedback on this proposed solution.

--
Shawn M. Jones
- and -
Herbert Van de Sompel
- and -
Michael L. Nelson

Acknowledgements: Ilya Kremer also contributed to the initial discussion of the need for a standard method of accessing captured mementos.

1 comment:

  1. nice post ! thanks for sharing such useful content.

    ReplyDelete