Tuesday, August 30, 2016

2016-08-30: Memento at the W3C

We are pleased to report that the W3C has embraced Memento for versioning its specifications and its wiki. Completing this effort required collaboration between the W3C and the Los Alamos National Laboratory (LANL) Research Library Prototyping Team. Here we inform others of the brief history of this effort and provide an overview of the technical aspects of the work done to make Memento at the W3C.

Brief History of Memento Work with the W3C

The W3C uses Memento for two separate systems:
Memento was implemented on both of these systems in 2016, but there were a lot of discussions and changes in direction along the way.
In 2010, Herbert Van de Sompel presented Memento as part of the Linked Data on the Web Workshop (LDOW) at WWW. The presentation was met with much enthusiasm. In fact, Sir Tim Berners-Lee stated "this is neat and there is a real need for it". Later, he met with Herbert to suggest that Memento could be used on the W3C site itself, specifically for time-based access to W3C specifications.
That same year, Harihar Shankar had finished the first working version of the Memento MediaWiki Extension. Ted Guild of the W3C installed this extension on their wiki for easy access to prior versions of pages.
At the time, the W3C kept their specifications in CVS. LANL and the W3C began discussions about how to use Memento with their CVS system and other associated web server software. This attempt ran into problems due to permissions issues and other concerns.
Fast forward to 2013, when Shawn Jones had joined the ODU Web Science and Digital Libraries Research Group. At this point, attempts to get the Memento MediaWiki Extension installed at Wikipedia had stalled. The extension had also ceased working with the version of MediaWiki then being used at the W3C. Shawn updated the extension, analyzing different design options, and evaluating their performance. He enlisted support from the MediaWiki development team in hopes that it would be acceptable for deployment at Wikipedia. Version 2.0.0 was released in 2014.
By 2014 Yorick Chollet had joined the LANL Prototyping Team. As part of work with the W3C, Yorick produced standalone TimeGate software that could be installed and run by anyone. The W3C had also started work on a web API for their specifications. The decision was made by both groups to develop the TimeGate as a microservice that would provide a Memento interface to the W3C API.
In 2015, Herbert notified the W3C that the latest version of the Memento MediaWiki Extension was available. After some planned updates to the W3C infrastructure, the updated extension was installed in January of 2016, restoring Memento support on their wiki.
By that time the W3C specifications API was nearing completion. Harish and Herbert collaborated with José Kahan at the W3C to ensure that the W3C TimeGate microservice worked with the API. Once testing was complete, the W3C added the Memento-Datetime header and updated Link headers to their resources in order to reference the new TimeGate. At the same time the W3C moved services to HTTPS, requiring HTTPS to be implemented at the TimeGate as well. Now both the W3C specifications and the W3C wiki use Memento.

Details of Memento Support for W3C Specifications

Work on Memento for the W3C Specifications entailed coordination between three components:
The diagram below provides an overview of the architecture of the Memento TimeGate microservice. The TimeGate accepts the Accept-Datetime header from Memento clients via HTTP. It then queries the W3C API using an API Handler. The result of that query is then used to discover the best revision of a specification that was active at the datetime expressed in the Accept-Datetime Header.

To demonstrate how these components work together, we will walk through Memento datetime negotiation using the specification for HTML 5 at URI-R https://www.w3.org/TR/html5/ and an Accept-Datetime value of Sat, 24 Apr 2010 15:00:00 GMT.
As shown in the curl request below, the W3C Apache Web server produces the appropriate TimeGate Link header for original resources. Memento clients use the timegate relation in this Link header to discover the URI-G of the TimeGate for this resource.
# curl -I "https://www.w3.org/TR/html5/" HTTP/1.1 200 OK Date: Fri, 05 Aug 2016 20:41:42 GMT Last-Modified: Fri, 24 Oct 2014 16:15:24 GMT ETag: "20acd-5062d7cffff00" Accept-Ranges: bytes Content-Length: 133837 Cache-Control: max-age=31536000 Expires: Sat, 05 Aug 2017 20:41:42 GMT P3P: policyref="http://www.w3.org/2014/08/p3p.xml" Link: <https://timetravel.mementoweb.org/w3c/timegate/https://www.w3.org/ TR/html5/>;rel="timegate" Access-Control-Allow-Origin: * Content-Type: text/html; charset=utf-8 Strict-Transport-Security: max-age=15552000; includeSubdomains; preload Content-Security-Policy: upgrade-insecure-requests
To continue datetime negotiation, a Memento client would then issue an HTTP request like the one below to this TimeGate - maintained by LANL.
HEAD /w3c/timegate/http://www.w3.org/TR/html5/ HTTP/1.1 Host: timetravel.mementoweb.org Accept-Datetime: Sat, 24 Apr 2010 15:00:00 GMT Connection: close
The Memento TimeGate microservice extracts the shortname from the original URI, html5 in this case. It then queries the W3C API for this shortname directly, receiving a JSON response like the abridged one below. This response contains a version history the specification.
... ABRIDGED FOR BREVITY - SALIENT PARTS BELOW ... "_embedded": { "version-history": [ { "status": "Recommendation", "uri": "http:\/\/www.w3.org\/TR\/2014\/REC-html5-20141028\/", "date": "2014-10-28", "informative": false, "title": "HTML5", "shortlink": "http:\/\/www.w3.org\/TR\/html5\/", "editor-draft": "http:\/\/www.w3.org\/html\/wg\/drafts\/html\/master\/", "process-rules": "http:\/\/www.w3.org\/2005\/10\/Process-20051014\/", "_links": { "self": { "href": "https:\/\/api.w3.org\/specifications\/html5\/versions\/20141028" }, "editors": { "href": "https:\/\/api.w3.org\/specifications\/html5\/versions\/20141028\/editors" }, "deliverers": { "href": "https:\/\/api.w3.org\/specifications\/html5\/versions\/20141028\/deliverers" }, "specification": { "href": "https:\/\/api.w3.org\/specifications\/html5" }, "predecessor-version": { "href": "https:\/\/api.w3.org\/specifications\/html5\/versions\/20141028\/predecessors" } } }, ... MULTIPLE OTHER VERSIONS FOLLOW - ABRIDGED FOR BREVITY ...
From this JSON response, the TimeGate looks for the version-history array inside the _embedded object. From each entry in that array, it then extracts the uri and date. It then compares the value of the HTTP request's Accept-Datetime header with the URIs and dates from this version history to find the URI-M of the best memento that was active at the Accept-Datetime value.
In the case of our example, the datetime requested is Sat, 24 Apr 2010 15:00:00 GMT. Using the version history from the W3C API, the TimeGate discovers that the URI-M of best memento that was active at the Accept-Datetime value is at http://www.w3.org/TR/2010/WD-html5-20100304/. This URI-M is then used as the value of the Location header of the TimeGate's response. Because the TimeGate has access to the entire version history, it easily generates additional Link relations in its response, filling in the first and last relations in addition to the URI of the timemap. The TimeGate's full response is shown below, with the Location and Link headers in bold.
# curl -I -H 'Accept-Datetime: Sat, 24 Apr 2010 15:00:00 GMT' 'https://timetravel.mementoweb.org/w3c/timegate/https://www.w3.org/TR/html5/' HTTP/1.1 302 Found Server: nginx/1.8.0 Content-Type: text/plain; charset=UTF-8 Content-Length: 0 Connection: keep-alive Date: Fri, 05 Aug 2016 21:18:29 GMT Vary: accept-datetime Location: http://www.w3.org/TR/2010/WD-html5-20100304/ Link: <http://www.w3.org/TR/html5/>; rel="original", <https://timetravel.mementoweb.org/w3c/timemap/link/http://www.w3.org/TR/html5/>; rel="timemap"; type="application/link-format", <https://timetravel.mementoweb.org/w3c/timemap/json/http://www.w3.org/TR/html5/>; rel="timemap"; type="application/json", <http://www.w3.org/TR/2008/WD-html5-20080122/>; rel="first memento"; datetime="Tue, 22 Jan 2008 00:00:00 GMT", <http://www.w3.org/TR/2010/WD-html5-20100304/>; rel="memento"; datetime="Thu, 04 Mar 2010 00:00:00 GMT", <http://www.w3.org/TR/2014/REC-html5-20141028/>; rel="last memento"; datetime="Tue, 28 Oct 2014 00:00:00 GMT"
A Memento client would then interpret the HTTP 302 status code as a redirect and make a subsequent request to the URI-M from the Location header. In the response, the W3C Apache Web server provides the Memento-Datetime header, identifying this resource as a memento. Also provided are the timegate and original relations in the Link header, so further datetime negotiation can occur if necessary.
# curl -I "http://www.w3.org/TR/2010/WD-html5-20100304/" HTTP/1.1 200 OK Date: Fri, 05 Aug 2016 21:19:07 GMT Last-Modified: Tue, 08 Feb 2011 20:10:44 GMT Memento-Datetime: Tue, 08 Feb 2011 20:10:44 GMT ETag: "1d74a-49bcaf17c5900" Accept-Ranges: bytes Content-Length: 120650 Cache-Control: max-age=31536000 Expires: Sat, 12 Aug 2017 14:31:18 GMT P3P: policyref="http://www.w3.org/2014/08/p3p.xml" Link: <https://timetravel.mementoweb.org/w3c/timegate/http://www.w3.org/ TR/html5/>;rel="timegate", <http://www.w3.org/TR/html5/>;rel="original" Vary: upgrade-insecure-requests Access-Control-Allow-Origin: * Content-Type: text/html; charset=utf-8
From this example example, we see that datetime negotiation is now possible for W3C specifications, allowing users to find prior versions of any W3C specification using a given datetime. As seen in the datetime negotiation example above and in the link relations diagram below, the relations in the link header make this possible, even though LANL maintains the TimeGate and the W3C maintains the original resource (current version of specification) and the mementos (past versions of the specification).



And, of course, TimeMaps work as well, with a TimeMap microservice using the W3C API to find the version history of the page. An example TimeMap is shown below.
# curl 'https://timetravel.mementoweb.org/w3c/timemap/link/https://www.w3.org/TR/html5/' <https://www.w3.org/TR/html5/>; rel="original", <https://timetravel.mementoweb.org/w3c/timegate/https://www.w3.org/TR/html5/>; rel="timegate", <https://timetravel.mementoweb.org/w3c/timemap/link/https://www.w3.org/TR/html5/>; rel="self"; type="application/link-format", <https://timetravel.mementoweb.org/w3c/timemap/json/https://www.w3.org/TR/html5/>; rel="timemap"; type="application/json", <http://www.w3.org/TR/2008/WD-html5-20080122/>; rel="first memento"; datetime="Tue, 22 Jan 2008 00:00:00 GMT", <http://www.w3.org/TR/2008/WD-html5-20080610/>; rel="memento"; datetime="Tue, 10 Jun 2008 00:00:00 GMT", <http://www.w3.org/TR/2009/WD-html5-20090212/>; rel="memento"; datetime="Thu, 12 Feb 2009 00:00:00 GMT", <http://www.w3.org/TR/2009/WD-html5-20090423/>; rel="memento"; datetime="Thu, 23 Apr 2009 00:00:00 GMT", <http://www.w3.org/TR/2009/WD-html5-20090825/>; rel="memento"; datetime="Tue, 25 Aug 2009 00:00:00 GMT", <http://www.w3.org/TR/2010/WD-html5-20100304/>; rel="memento"; datetime="Thu, 04 Mar 2010 00:00:00 GMT", <http://www.w3.org/TR/2010/WD-html5-20100624/>; rel="memento"; datetime="Thu, 24 Jun 2010 00:00:00 GMT", <http://www.w3.org/TR/2010/WD-html5-20101019/>; rel="memento"; datetime="Tue, 19 Oct 2010 00:00:00 GMT", <http://www.w3.org/TR/2011/WD-html5-20110113/>; rel="memento"; datetime="Thu, 13 Jan 2011 00:00:00 GMT", <http://www.w3.org/TR/2011/WD-html5-20110405/>; rel="memento"; datetime="Tue, 05 Apr 2011 00:00:00 GMT", <http://www.w3.org/TR/2011/WD-html5-20110525/>; rel="memento"; datetime="Wed, 25 May 2011 00:00:00 GMT", <http://www.w3.org/TR/2012/WD-html5-20120329/>; rel="memento"; datetime="Thu, 29 Mar 2012 00:00:00 GMT", <http://www.w3.org/TR/2012/WD-html5-20121025/>; rel="memento"; datetime="Thu, 25 Oct 2012 00:00:00 GMT", <http://www.w3.org/TR/2012/CR-html5-20121217/>; rel="memento"; datetime="Mon, 17 Dec 2012 00:00:00 GMT", <http://www.w3.org/TR/2014/CR-html5-20140429/>; rel="memento"; datetime="Tue, 29 Apr 2014 00:00:00 GMT", <http://www.w3.org/TR/2014/WD-html5-20140617/>; rel="memento"; datetime="Tue, 17 Jun 2014 00:00:00 GMT", <http://www.w3.org/TR/2014/CR-html5-20140731/>; rel="memento"; datetime="Thu, 31 Jul 2014 00:00:00 GMT", <http://www.w3.org/TR/2014/PR-html5-20140916/>; rel="memento"; datetime="Tue, 16 Sep 2014 00:00:00 GMT", <http://www.w3.org/TR/2014/REC-html5-20141028/>; rel="last memento"; datetime="Tue, 28 Oct 2014 00:00:00 GMT"
Contrast this TimeMap of 19 versions with the 1,243 observations made by the Internet Archive for the same page. If studying the evolution of a standard, 19 explicit versions are easier to work with than more than 1000 observations, many of which are for the same version.

Details of Memento Support on the W3C Wiki

The W3C is also running the full Memento MediaWiki Extension on their wiki. The full Memento MediaWiki Extension provides TimeGates and TimeMaps as well as other additional information in the Link headers of its HTTP responses. Shown below is an example HTTP response for the original resource https://www.w3.org/wiki/HTML/Elements/link.
# curl -I "https://www.w3.org/wiki/HTML/Elements/link" HTTP/1.1 200 OK X-Powered-By: PHP/5.4.45-0+deb7u4 X-Content-Type-Options: nosniff Link: <https://www.w3.org/wiki/HTML/Elements/link>; rel="original latest-version",<https://www.w3.org/wiki/Special:TimeGate/HTML/Elements/link>; rel="timegate",<https://www.w3.org/wiki/Special:TimeMap/HTML/Elements/link>; rel="timemap"; type="application/link-format"; from="Mon, 14 Mar 2011 19:25:12 GMT"; until="Thu, 21 Jul 2011 22:24:53 GMT",<https://www.w3.org/wiki/index.php?title=HTML/Elements/link&oldid=48683>; rel="first memento"; datetime="Mon, 14 Mar 2011 19:25:12 GMT",<https://www.w3.org/wiki/index.php?title=HTML/Elements/link&oldid=52749>; rel="last memento"; datetime="Thu, 21 Jul 2011 22:24:53 GMT" Content-language: en Vary: Accept-Encoding,Cookie Cache-Control: s-maxage=18000, must-revalidate, max-age=0 Last-Modified: Wed, 03 Aug 2016 04:40:32 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 24053 Accept-Ranges: bytes Date: Wed, 03 Aug 2016 19:27:11 GMT X-Varnish: 877421307 877181026 Age: 35199 Via: 1.1 varnish X-Cache: HIT Strict-Transport-Security: max-age=15552000; includeSubdomains; preload Content-Security-Policy: upgrade-insecure-requests Content-Security-Policy-Report-Only: default-src *.w3.org; img-src *.w3.org data:; style-src *.w3.org 'unsafe-inline'; script-src *.w3.org 'unsafe-inline'; frame-ancestors *.w3.org; report-uri https://www.w3.org/csp-report/29ce9kZ/wro
And also for prior versions of the same resource, we see that the Memento-Datetime and Link headers are returned.
# curl -I "https://www.w3.org/wiki/index.php?title=HTML/Elements/link&oldid=52749" HTTP/1.1 200 OK X-Powered-By: PHP/5.4.45-0+deb7u4 X-Content-Type-Options: nosniff Memento-Datetime: Thu, 21 Jul 2011 22:24:53 GMT Link: <https://www.w3.org/wiki/HTML/Elements/link>; rel="original latest-version",<https://www.w3.org/wiki/Special:TimeGate/HTML/Elements/link>; rel="timegate",<https://www.w3.org/wiki/Special:TimeMap/HTML/Elements/link>; rel="timemap"; type="application/link-format"; from="Mon, 14 Mar 2011 19:25:12 GMT"; until="Thu, 21 Jul 2011 22:24:53 GMT",<https://www.w3.org/wiki/index.php?title=HTML/Elements/link&oldid=48683>; rel="first memento"; datetime="Mon, 14 Mar 2011 19:25:12 GMT",<https://www.w3.org/wiki/index.php?title=HTML/Elements/link&oldid=52749>; rel="last memento"; datetime="Thu, 21 Jul 2011 22:24:53 GMT" Content-language: en Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Content-Type: text/html; charset=UTF-8 Content-Length: 24966 Accept-Ranges: bytes Date: Sat, 06 Aug 2016 19:12:58 GMT X-Varnish: 878886405 Age: 0 Via: 1.1 varnish X-Cache: MISS Strict-Transport-Security: max-age=15552000; includeSubdomains; preload Content-Security-Policy: upgrade-insecure-requests Content-Security-Policy-Report-Only: default-src *.w3.org; img-src *.w3.org data:; style-src *.w3.org 'unsafe-inline'; script-src *.w3.org 'unsafe-inline'; frame-ancestors *.w3.org; report-uri https://www.w3.org/csp-report/29ce9kZ/wro
For more information on the extension, we suggest consulting those resources, as well as its GitHub and MediaWiki sites.

Conclusions

Since its inception, we have identified many use cases for Memento, from reconstructing web pages from many existing archives to avoiding spoilers in fiction to managing the temporal nature of semantic web data. We are happy that the W3C has adopted Memento for use in their work as well.
Even though the W3C maintains the Apache server holding mementos and original resources, and LANL maintains the systems running the W3C TimeGate software, it is the relations within the Link headers that tie everything together. It is an excellent example of the harmony possible with meaningful Link headers. Memento allows users to negotiate in time with a single web standard, making web archives, semantic web resources, and now W3C specifications all accessible the same way. Memento provides a standard alternative to a series of implementation-specific approaches.
We have been trying to bring Memento support to Wikipedia for the past few years, demonstrating the technology at conferences, working with their development team, and even getting direct feedback on the software from MediaWiki developers such as LegoTKM, Jeroen De Dauw, and ricordisamoa. Unfortunately, we have so far been unsuccessful with discussing deployment to Wikipedia. Perhaps they can be our next major adopter?

--
Herbert Van de Sompel
- and -
Harihar Shankar
- and -

Thursday, August 25, 2016

2016-08-25: Documenting the Now Advisory Board Meeting Trip Report

On August 21-23, 2016, I attended the Advisory Board Meeting for the Documenting the Now (DocNow) project at the Washington University in St. Louis.  The DocNow project is funded by the Andrew Mellon Foundation "aims to collect, archive, and provide access to social media feeds chronicling historically significant events, particularly concerning social justice."   In practice, this means providing a friendly interface for interacting with trending events on Twitter (e.g., #BlackLivesMatter and affiliated hashtags).  This is significant because tools like twarc (created by Ed Summers, the technical lead for DocNow), a widely used Twitter archiving command line tool, are not within the scope of non-expert users. 

The DocNow has a strong project team and a diverse advisory board, of which I am honored to be a member of.  The team has been pretty active on github, slack, Twitter, etc., but those are no substitute for an extended f2f meeting.

The day began on the 22nd with a welcome and a contextualization for DocNow by the first panel (Jessica Johnson, Mark Anthony Neal, Sarah Jackson).  The session were recorded and will be released within the next week or two (video), so I won't try to completely reconstruct the discussion here, but some of the highlights that I noted include: 1) archives are necessary to create the context in which to evaluate content (the example was #FreeWakaFlocka being confused as a Sesame Street reference), and 2) real-time, self-reflection / self-awareness of  Twitter being a communications channel and archival record, and 3) a preview of the ethics involved in processing personal redaction / take down requests.  Some of the resources I noted were: Research Ethics for Students and Teachers: Social Media in the Classroom, Hijacking #myNYPD: Social Media Dissent and Networked Counterpublics, and African American celebrity dissent and a tale of two public spheres: a critical and comparative analysis of the mainstream and black press, 1949-2005.

Panel #2 featured the personal reflections of activists Reuben Riggs, Kayla Reed, Alexis Templeton, and Rasheen Aldridge, expertly moderated by Jonathan Fenderson.  I'm certainly not going to try to summarize their compelling contributions -- you really need to watch the video.  One resource I noted was the story of the Palestinian woman giving notes to Ferguson protesters about how to deal with tear gas.  I also noted that the activists' use of social media was, at least initially, not entirely focused on Twitter.  This has implications because as researchers, we tend to focus on Twitter exclusively, largely because it's the easiest to interact with.




Panel 3 (Yvonne Ng, Stacie Williams, Alexandra Dolan-Mescal, Dexter Thomas) resumed the ethics of discussion from the end of Panel 1 (video).   Yvonne worked through a set of examples about archivists / reports including videos (e.g., from YouTube) with PII (see: Ethical Guidelines for Using Video in Human Rights Reporting).  The mood in the room at the time was definitely trending to protecting / anonymizing.  I asked the question of how to reconcile this level of editing with the guidance from Panel 2, which included (in so many words) "be sure to document everything, including the ugly".  I don't think we really successfully addressed this question.  Stacie covered the story of aggregating  various #WhatIWasWearing tweets and getting consent from the authors.  Dexter echoed the issue of consent, drawing from his experience at the LA Times.  Alexandra even went as far as saying "it's a surveillance tool", and questioned the archiving process in general.

I was on Panel 4, along with Brooke Foucault-Welles and Deen Freelon (video).  I went last and was so focused on my upcoming presentation my notes for my co-panelists are uneven.  Deen discussed some of his open source tools, and briefly mentioned the problem of disappearing tweets.  I did write down Brooke's closing three points: 1) "data storage is cheap, data usability is expensive" (with some stories of her "data wrangling"), 2) "tradeoff between parsimonious vs. inclusivity", which summarizes nicely as the "stegosaurus problem" -- apparently they were relatively rare but preserved well, and 3) "diversifying data", including the context of the larger platform itself and the observation that the Twitter of 2009 is not the same as the Twitter of 2014.

 I talked about why we need multiple, independent web archives:




Panel 5 had Brian Deitz, Jarrett Drake, Natalie Baur, and Samantha Abrams, discussing documenting a community (video).  Samantha discussed her work as a "guerilla archivist", quasi-officially archiving #theRealUW (see her blog post "On establishing a web archiving platform").  Brian's echoed some of the same points, and contrasted #ChapelHillShooting vs. #Our3Winners.  Natalie discussed creating an archive around the time the US normalized relations with Cuba, and Jarrett discussed #OccupyNassau.

The final panel of the day featured Sylvie Rollason-Cass, Ilya Kreymer, Matt Phillips, and Nicholas Taylor (video).  Ilya gave a demo of webrecorder.io, and I believe everyone else had slides even though I can't find them: Sylvie covered the range of services and projects from Archive-It, Matt reviewed Perma.cc and other projects at LIL, and Nicholas talked about the WASAPI project.

The second day was a half day, and wasn't recorded.  Alexandra lead us in a User Story Map exercise in an effort to further flesh out user requirements.  She had four defined user types defined (I didn't write them down), but there was discussion about adding a fifth: the "authority" persona that would use the archive to expose and punish the participants.





We concluded the day with Dan Chudnov giving a short demo of the current tool.  I won't really go into details since it is likely to change significantly (they were adamant about it being an early discussion piece), but it is far ahead of tools like twarc for supporting guided exploration.  2016-09-01 Edit: a temporary prototype of the tool is now available:




I think the meeting was very successful, and I'm grateful to the organizers (Desiree Jones-Smith, Bergis Jules, et al.) for including me on the Advisory Board and inviting me to St. Louis.  I'll add the video links when they're uploaded  (2016-09-12 edit: 4 video links added), and in the mean time you can rewind the #docnowcommunity hashtag to get a feel for the many things I missed (Samantha is keeping a list of resources shared over #docnowcommunity).

--Michael

2016-08-25: Two WS-DL Classes Offered for Fall 2016


Two Web Science & Digital Library (WS-DL) courses will be offered in Fall 2016:
Note that Dr. Michele Weigle is not teaching this semester.  Obviously there is demand for CS 418/518, but if you're considering CS 734/834 you might be interested in this student's quote from a recent exit exam:
[and] Dr. Nelson’s Information Retrieval course are the two which I feel have prepared me most for job interviews and work in the working world of computer science.
We're not yet sure what WS-DL courses will be offered in Spring 2017, so take advantage of these offerings in the Fall.

--Michael

Monday, August 15, 2016

2016-08-15: Mementos In the Raw, Take Two


In a previous post, we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives.
Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240. Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation.
To recap, most web archives augment mementos when presenting them to the user, often for usability or legal purposes. The figures below show examples of these augmentations.

Figure 1: The PRONI web archive augments mementos for user experience; augmentations outlined in red

Figure 2: The UK National Archives adds additional text and a banner to differentiate their mementos from their live counterparts, because their mementos appear in Google search results
Additionally, some archives rewrite links to allow navigation within an archive. This way the end user can visit other pages within the same archive from the same time period. Smaller archives, because of the size of their collections, do not benefit as much from these rewritten links. Of course, for Memento users, these rewritten links are not really required.
In many cases, access to the original, unaltered content is needed. This is, for example, the case for some research studies that require the original HTTP response headers and the original unaltered content. Unaltered content is also needed to replay the original web content in projects like oldweb.today and the TimeTravel's Reconstruct feature.
The previously proposed solution was based on the use of two TimeGates, one to access augmented content (which is the current default) and an additional one to access unaltered content. In this post, we discuss a less complex method of acquiring raw mementos. This solution provides a standard way to request raw mementos, regardless of web archive software or configuration, and eliminates the need for archive-specific or software-specific heuristics.
The raw-ness of a memento exists in several dimensions, and the level of raw-ness that is required depends on the nature of the application:
  1. No augmented content - The memento should contain no additional HTML, JavaScript, CSS, or text added for usability or any other purpose. Its content should exist as it did on the web at the moment it was captured by the web archive.
  2. No rewritten links - The links should not be rewritten. The links within the memento content should exist as they did on the web at the moment the memento was captured by the web archive.
  3. Original headers - The original HTTP response headers should be available, expressed as X-Archive-Orig-*, like X-Archive-Orig-Content-Type: text/html. Their values should be the same as those of the corresponding headers without the X-Archive-Orig- prefix (e.g. Content-Type) at the moment of capture by the web archive.
We propose a solution that uses the Prefer HTTP request header and the Preference-Applied response header from RFC7240.
Consider a client that prefers a true, raw memento for http://www.cnn.com. Using the Prefer HTTP request header, this client can provide the following request headers when issuing an HTTP HEAD/GET to a memento.
GET /web/20160721152544/http://www.cnn.com/ HTTP/1.1 Host: web.archive.org Prefer: original-content, original-links, original-headers Connection: close
As we see above, the client specifies which level of raw-ness it prefers in the memento. In this case, the client prefers a memento with the following features:
  1. original-content - The client prefers that the memento returned contain the same HTML, JavaScript, CSS, and/or text that existed in the original resource at the time of capture.
  2. original-links - The client prefers that the memento returned contain the links that existed in the original resource at the time of capture.
  3. original-headers - The client prefers that the memento response uses X-Archive-Orig-* to express the values of the original HTTP response headers from the moment of capture.
The memento then responds with the headers below.
HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 17:34:15 GMT Content-Type: text/html;charset=utf-8 Content-Length: 109672 Connection: keep-alive set-cookie: wayback_server=60; Domain=archive.org; Path=/; Expires=Sat, 20-Aug-16 17:34:15 GMT; Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Content-Location: /web/20160721152544im_/http://www.cnn.com/ Vary: prefer Preference-Applied: original-content, original-links, original-headers Link: <http://www.cnn.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.cnn.com/>; rel="timegate", <http://web.archive.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://web.archive.org/web/20160120080735/http://www.cnn.com/>; rel="first memento"; datetime="Wed, 20 Jan 2016 08:07:35 GMT", <http://web.archive.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT" X-Archive-Orig-x-served-by: cache-iad2120-IAD, cache-sjc3632-SJC X-Archive-Orig-x-cache-hits: 1, 13 X-Archive-Orig-cache-control: max-age=60 X-Archive-Orig-x-xss-protection: 1; mode=block X-Archive-Orig-content-type: text/html; charset=utf-8 X-Archive-Orig-age: 184 X-Archive-Orig-x-timer: S1469114744.153501,VS0,VE0 X-Archive-Orig-set-cookie: countryCode=US; Domain=.cnn.com X-Archive-Orig-access-control-allow-origin: * X-Archive-Orig-content-security-policy: default-src 'self' http://*.cnn.com:* https://*.cnn.com:* *.cnn.net:* *.turner.com:* *.ugdturner.com:* *.vgtf.net:*; script-src 'unsafe-inline' 'unsafe-eval' 'self' *; style-src 'unsafe-inline' 'self' *; frame-src 'self' *; object-src 'self' *; img-src 'self' * data: blob:; media-src 'self' *; font-src 'self' *; connect-src 'self' *; X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-vary: Accept-Encoding X-Archive-Orig-connection: close X-Archive-Orig-x-servedbyhost: prd-10-60-168-38.nodes.56m.dmtio.net X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT X-Archive-Orig-via: 1.1 varnish X-Archive-Orig-content-length: 109672 X-Archive-Orig-x-cache: HIT, HIT X-Archive-Orig-fastly-debug-digest: 1e206303e0672a50569b0c0a29903ca81f3ef5033de74682ce90ec9d13686981
The response also uses the Preference-Applied header to indicate that it is providing the original-headers and the content has its original-links and original-content. It is possible, of course, for a system to satisfy only some of these preferences, and the Preference-Applied header allows the server to indicate which ones.
The Vary header also contains prefer, indicating that clients can influence the memento's response by using this header. The response can then be cached for requests that have the same options in the request headers.
Based on these preferences, the content of the response has been altered from the default. The Content-Location header informs clients of the exact URI-M that meets these preferences for this memento, in this case http://web.archive.org/web/20160721152544im_/http://www.cnn.com/.
The memento returned contains the original content and the original links, as seen in the figure below, and the original headers provided as X-Archive-Orig-* as shown in the above response.
Figure 3: Seen in this example is a memento with original-content - no banner added - and original-links as seen in the magnified inspector output from Firefox.

If the client issues no Prefer header in the request, then the server can still use the Preference-Applied header to indicate which preferences are met by default. Again, the Vary header indicates that clients can influence the response via the use of the Prefer request header. The Content-Location header indicates the URI-M of the memento. The response headers for such a default memento from the Internet Archive are shown below, with its original headers expressed in the form of X-Archive-Orig-* and bolded for emphasis.
HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 16:17:09 GMT Content-Type: text/html;charset=utf-8 Content-Length: 127383 Connection: keep-alive set-cookie: wayback_server=60; Domain=archive.org; Path=/; Expires=Sat, 20-Aug-16 16:17:07 GMT; Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Content-Location: /web/20160721152544/http://www.cnn.com/ Vary: prefer Preference-Applied: original-headers Link: <http://www.cnn.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.cnn.com/>; rel="timegate", <http://web.archive.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://web.archive.org/web/20000620180259/http://www.cnn.com/>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://web.archive.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT" Set-Cookie: JSESSIONID=3652A3AF37E6AF4FB5C7DEF16CC8084E; Path=/; HttpOnly X-Archive-Orig-x-served-by: cache-iad2120-IAD, cache-sjc3632-SJC X-Archive-Orig-x-cache-hits: 1, 13 X-Archive-Guessed-Charset: utf-8 X-Archive-Orig-cache-control: max-age=60 X-Archive-Orig-x-xss-protection: 1; mode=block X-Archive-Orig-content-type: text/html; charset=utf-8 X-Archive-Orig-age: 184 X-Archive-Orig-x-timer: S1469114744.153501,VS0,VE0 X-Archive-Orig-set-cookie: countryCode=US; Domain=.cnn.com X-Archive-Orig-access-control-allow-origin: * X-Archive-Orig-content-security-policy: default-src 'self' http://*.cnn.com:* https://*.cnn.com:* *.cnn.net:* *.turner.com:* *.ugdturner.com:* *.vgtf.net:*; script-src 'unsafe-inline' 'unsafe-eval' 'self' *; style-src 'unsafe-inline' 'self' *; frame-src 'self' *; object-src 'self' *; img-src 'self' * data: blob:; media-src 'self' *; font-src 'self' *; connect-src 'self' *; X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-vary: Accept-Encoding X-Archive-Orig-connection: close X-Archive-Orig-x-servedbyhost: prd-10-60-168-38.nodes.56m.dmtio.net X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT X-Archive-Orig-via: 1.1 varnish X-Archive-Orig-content-length: 109672 X-Archive-Orig-x-cache: HIT, HIT X-Archive-Orig-fastly-debug-digest: 1e206303e0672a50569b0c0a29903ca81f3ef5033de74682ce90ec9d13686981
For this default memento, shown in the figure below, the links are rewritten and the presence of the Wayback banner indicates that additional content has been added.
Figure 4: This default memento contains added content in the form of a banner outlined in red on top as well as rewritten links, shown using Firefox's inspector and magnified on the bottom.
We are confident that it is legitimate to use the Prefer header in this way. Even though the original RFC contains examples requesting different representations using only the PATCH, PUT, and POST methods, a draft RFC for the "safe" HTTP preference mentions its use with GET in order to modify the content of the requested page. This draft RFC has already been implemented in Mozilla Firefox and Internet Explorer. It is also used in the W3C Open Annotation Protocol to indicate the extent to which a resource should include annotations in its representation.
Compared to our previously described approach, this solution is more elegant in its simplicity and intuitiveness. This approach also allows the introduction of other client preferences over time, if such a need would emerge. These preferences can and should be registered in accordance with RFC7240. The client specifies which features of a memento it prefers and the memento itself indicates which features it has satisfied while ensuring its response satisfies those preferred features.
We seek feedback on this solution, including what additional dimensions clients may prefer beyond the three we have specified.
--
Herbert Van de Sompel
- and -
Michael L. Nelson
- and -
Lyudmila Balakireva
- and -
Martin Klein
- and -
- and -
Harihar Shankar