2013-07-15: Wayback Machine Upgrades Memento Support

By Michael L. Nelson - July 15, 2013

Just over a week ago, the Internet Archive upgraded their support for Memento in the Wayback Machine. The Wayback Machine has had native Memento support for about 2.5 years, but they've just recently implemented a number of changes and now the Wayback Machine and version 08 of the Memento Internet Draft are synchronized. The changes will be mostly unseen by casual users, but developers will appreciate the changes that should make things even simpler. Perhaps even more importantly, these changes have been reflected in the open source version of the Wayback Machine, so the numerous sites that are running this software (for example, see the IIPC member list) should enjoy native Memento support upon their next upgrade.

The first and most significant change is that there is now just a single URI prefix for mementos (URI-M). Previously, the URI-M discovered through the Wayback Machine's UI was different from the URI-M discovered through the Memento interface (e.g., using the MementoFox add-on). For example, for the original resource thecribs.com (@ 2003-09-30) you used to have both:

Wayback UI: http://web.archive.org/web/20030930231814/http://www.thecribs.com/

Memento: http://api.wayback.archive.org/memento/20030930231814/http://www.thecribs.com/

(The second URI is not linked; the api.wayback.archive.org/* interface is now turned off and those URIs now produce 404s.)

The problem was that a web.archive.org URIs rewrote the URIs in the HTML to point back in to the archive (i.e., "Archival Replay Mode"), but lacked the necessary Memento-Datetime and Link HTTP response headers. The api.wayback.archive.org URIs had the necessary HTTP response headers, but lacked the rewritten HTML for Archival Replay Mode. So while both types of URIs (web.archive.org and api.wayback.archive.org) worked in their respective environments, a Memento user could not share (via email, Twitter, etc.) an api.wayback.archive.org URI with a non-Memento user, and likewise a Memento user would not have the additional Memento functionality with a web.archive.org URI.

Long story short: a single URI does it all now:

% curl -I http://web.archive.org/web/20030930231814/http://www.thecribs.com/

HTTP/1.1 200 OK
Server: Tengine/1.4.6
Date: Mon, 15 Jul 2013 13:09:52 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 1168
Connection: keep-alive
set-cookie: wayback_server=99; Domain=archive.org; Path=/; Expires=Wed, 14-Aug-13 13:09:51 GMT;
Memento-Datetime: Tue, 30 Sep 2003 23:18:14 GMT
Link: <http://www.thecribs.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.thecribs.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.thecribs.com/>; rel="timegate", <http://web.archive.org/web/20030930231814/http://www.thecribs.com/>; rel="first memento"; datetime="Tue, 30 Sep 2003 23:18:14 GMT", <http://web.archive.org/web/20031222131737/http://www.thecribs.com/>; rel="next memento"; datetime="Mon, 22 Dec 2003 13:17:37 GMT", <http://web.archive.org/web/20130518134906/http://www.thecribs.com/>; rel="last memento"; datetime="Sat, 18 May 2013 13:49:06 GMT"
X-Archive-Guessed-Charset: UTF-8
X-Archive-Orig-Connection: close
X-Archive-Orig-Content-Type: text/html
X-Archive-Orig-Server: Apache/1.3.27 (Unix)
X-Archive-Orig-Date: Tue, 30 Sep 2003 23:18:14 GMT
X-Archive-Wayback-Perf: [IndexLoad: 140, IndexQueryTotal: 140, , RobotsFetchTotal: 3, , RobotsRedis: 3, RobotsTotal: 3, Total: 466, WArcResource: 322]
X-Archive-Playback: 1
X-Page-Cache: MISS

Never noticed that the dual URI thing? That's fine, neither did most other people. I included the above details only to document how things used to work in case you run across an old-style api.wayback.archive.org URI. Otherwise, don't worry about it.

The URI merger also changes the base URIs for the Timemaps and Timegates:

http://web.archive.org/web/timemap/link/{URI-R}
http://web.archive.org/web/{URI-R}

The second change that may impact people is that TimeMaps now support paging. The page size is large (currently 10,000), but popular sites like www.cnn.com have > 14,000 mementos. Instead of having explicit "page 1", "page 2", etc., paged TimeMaps now have a "self" link with "from" and "until" parameters to indicate the left-hand and right-hand temporal endpoints, respectively, for this TimeMap. It then links to the next TimeMaps with a "from" parameter to indicate the left-hand temporal endpoint of the next page (the "until" value might not be known if the last page is still being "filled", so to speak). It is easier to look at the example:

% curl http://web.archive.org/web/timemap/link/http://www.cnn.com/

<http://www.cnn.com/>; rel="original",
<http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="self"; type="application/link-format"; from="Tue, 20 Jun 2000 18:02:59 GMT"; until="Sat, 30 Jun 2012 21:46:19 GMT",
<http://web.archive.org/web/http://www.cnn.com/>; rel="timegate",
<http://web.archive.org/web/20000620180259/http://cnn.com/>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://web.archive.org/web/20000621011731/http://cnn.com/>; rel="memento"; datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://web.archive.org/web/20000621140928/http://cnn.com/>; rel="memento"; datetime="Wed, 21 Jun 2000 14:09:28 GMT",
...
[lots of links deleted] 
...
<http://web.archive.org/web/20120630193538/http://www.cnn.com/>; rel="memento"; datetime="Sat, 30 Jun 2012 19:35:38 GMT",
<http://web.archive.org/web/20120630214619/http://www.cnn.com/>; rel="last memento"; datetime="Sat, 30 Jun 2012 21:46:19 GMT",
<http://web.archive.org/web/timemap/link/20120630214620/http://www.cnn.com/>; rel="timemap"; type="application/link-format"; from="Sat, 30 Jun 2012 21:46:20 GMT"

% curl http://web.archive.org/web/timemap/link/20120630214620/http://www.cnn.com/

<http://www.cnn.com/>; rel="original",
<http://web.archive.org/web/timemap/link/20120630214620/http://www.cnn.com/>; rel="self"; type="application/link-format"; from="Sat, 30 Jun 2012 22:51:21 GMT"; until="Sun, 14 Jul 2013 23:57:29 GMT",
<http://web.archive.org/web/http://www.cnn.com/>; rel="timegate",
<http://web.archive.org/web/20120630225121/http://cnn.com/>; rel="first memento"; datetime="Sat, 30 Jun 2012 22:51:21 GMT",
<http://web.archive.org/web/20120701001356/http://www.cnn.com/>; rel="memento"; datetime="Sun, 01 Jul 2012 00:13:56 GMT",
<http://web.archive.org/web/20120701015627/http://www.cnn.com/>; rel="memento"; datetime="Sun, 01 Jul 2012 01:56:27 GMT",
...
[lots of links deleted]
...

Together, the multiple pages form a single logical TimeMap and the pages are only for convenience of transport. The server determines how many links go into a single page. Most TimeMaps have < 10,000 URI-Ms so you might not notice this change right away, but please be aware that your applications can not longer assume they're getting the entire TimeMap with a single HTTP GET.

The third change is about defining a standard way for the archive to tell the client "this is not a memento, so do not attempt memento processing on it"*. This is new in section 4.5.8 of version 8 of the Internet Draft. The idea is that most of the resources embedded in, for example, http://web.archive.org/web/20030930231814/http://www.thecribs.com/ are mementos captured at some point in the past. However, some of the images, javascript, etc. are injected by the archive to assist in playback and are not actual mementos and thus the client should not attempt negotiation on those resources. Rather than having clients maintain regular expressions for what is and what is not a memento at various archives, the server can now just send back this HTTP response header:

Link: <http://mementoweb.org/terms/donotnegotiate>; rel="type"

Here is the full HTTP response for http://web.archive.org/static/js/jwplayer/jwplayer.js, a javascript file injected into the archived HTML to assist in the archival playback:

If you study the HTTP responses for both http://web.archive.org/web/20030930231814/http://www.thecribs.com/ and http://web.archive.org/static/js/jwplayer/jwplayer.js, you will see that the former has "X-Archive-Playback: 1" and the latter has "X-Archive-Playback: 0". In summary, section 4.5.8 of the Internet Draft just standardizes the current "X-Archive-Playback: 0" header with a Link header that is applicable to all kinds of Memento archives (and not just Wayback Machines).

We hope you will give the new Wayback Memento interfaces a test drive and let us know if you see any errors or have additional comments. The new interfaces were integrated in the LANL and ODU aggregators last week, so if you are using those you should have seen a switch already. We'd like to thank Ilya Kremer (IA) and Lyudmila Balakireva (LANL) for all of their feedback and efforts during this implementation and Kris Carpenter (IA) for her continued support of Memento.

--Michael

* or, if you prefer: "All these URIs are mementos except this one. Attempt no negotiation there. Use them together. Use them in peace."

Comments

Chris AdamsJuly 23, 2013 at 2:32 PM
As an aside, I noticed some display issues for the curl fragments in my RSS reader which turned out to be caused by the curl output examples: the XML responses contained characters which have meaning in HTML (<>&) but are not escaped.
ReplyDelete
Replies
Michael L. NelsonJuly 29, 2013 at 12:01 PM
Hi Chris: the curl examples are inside an HTML textarea block, in which a browser should treat everything inside as not requiring escaping. If you view this in another browser (i.e., not your RSS reader), it should do the right thing.
ReplyDelete
Replies
UnknownOctober 9, 2014 at 3:14 AM
{
"archived_snapshots": {
"closest": {
"available": true,
"url": "http://web.archive.org/web/20130919044612/http://example.com/",
"timestamp": "20130919044612",
"status": "200"
}
}
}
ReplyDelete
Replies
prashantcNovember 23, 2014 at 4:06 PM
Hi, I'm fairly new to archive retrieval. I'm working with Memento archives. I'm still a little confused as to what to do. I'm currently attempting to reconstruct an HTML page on the server. My server-side application makes a "TimeGate" request and gets the most recent Memento URL. I ping the URL for the archived URI. I haven't injected any Javascript into the page or done any HTML rewrite. As you mentioned, the images aren't loaded correctly. Neither are other externally linked files. It would be great if you help me out with this. Thanks.
ReplyDelete
Replies
Michael L. NelsonNovember 24, 2014 at 10:13 AM
prashantc: I'm not sure I understand your question -- example URIs would help tremendously. Note that if you follow the 302 from the TimeGate you arrive at the archive's best memento for your requested Accept-Datetime. It is possible that that memento doesn't have all the embedded resources archived. If you know the missing URI and look for it in http://www.mementweb.org/timegate/{URI} and you get a 404, then nobody has it archived and there's not much you can do about it.

--Michael
ReplyDelete
Replies
prashantcNovember 24, 2014 at 5:02 PM
Hello Michael. Thank you for you help. Unfortunately, I have a local setup so I can't provide any example URI's. Let me try explaining the setup: I have a HTML file (say text.hml) that references an image file (say htmlimage.png). Both these files have been archived correctly. The reason is cause when, in my browser, I ping: http://mementoweb.org/timegate/text.html, the content along with the image loads correctly.
However, I am trying to do the same on the server side, using C. Using HTTPGet, I ping the same url (written above) and it internally makes two memento requests: 1) for the text.html and 2) for the htmlimage.png file. The output for this overall request is sent back to the client. On the client side, the content in text.html is loaded correctly. However, the htmlimage.png appears missing/broken. I hope this helps. Thank you for your help.
ReplyDelete
Replies
Michael L. NelsonNovember 24, 2014 at 6:27 PM
prashantc: I still don't have enough details to understand what it happening. It almost sounds like you're trying to get both the HTML and the PNG in a single request to the TimeGate (which can't be done)? Also, I'm not sure what you mean by "output for this overall request is sent back to the client." I think we're going to need to look at HTTP traces before I can understand what you're doing. Also, maybe we could move this discussion to the memento-dev email list. Additional questions: are you setting "Accept-Datetime" for *all* of the requests? The IA will behave differently if that value if present or absent. I also assume you're chasing the redirects all the way to the end?
ReplyDelete
Replies
LookForMarryAugust 12, 2015 at 11:09 PM
Since first of August 2015 The API not works.
I try to check domain age instead of archive.org
ReplyDelete
Replies
Michael L. NelsonAugust 12, 2015 at 11:22 PM
I'm not sure I understand your comment -- the Memento API still works.
ReplyDelete
Replies
UnknownMay 25, 2016 at 4:26 AM
Hello!
Thank you for this project .
Can I get a first date (first momento) in your JSON API?
ReplyDelete
Replies
Michael L. NelsonMay 28, 2016 at 10:20 AM
If you have any memento, you can get the first value by doing a HEAD on the memento and looking for rel="first memento" in the Link headers. If you need it in a JSON response, take a look at: http://timetravel.mementoweb.org/guide/api/ for various options.
ReplyDelete
Replies
cheyrnNovember 15, 2018 at 6:31 PM
Where can I ask questions about this? I've tried the memento-dev mailing list and my post hasn't been approved for several days. One question is, some of the resources have memento addresses that add cs_ or js_ etc. id_ I see mentioned as retrieving the unrewritten resource, but what is the significance of the other 3 letter additions? In at least one case I was unable to retrieve a resource, but I could by removing "im_" from the memento URL.
ReplyDelete
Replies
Michael L. NelsonNovember 15, 2018 at 10:18 PM
Hi, I'm not sure why the memento-dev mesg went through. I know some of the list managers are out of the country; I'll ping them.

cs_ == CSS
js_ == javascript
im_ == image

they basically function the same as id_, but they are used when the html suggests that the target URI is an image, CSS, etc. see:

https://github.com/webrecorder/pywb/blob/master/docs/manual/rewriter.rst

note that id_ im_ etc. are not fully standardized across different wayback machine implementations. most are shared, but not all.

Michael
ReplyDelete
Replies

Add comment

Search This Blog

Web Science and Digital Libraries Research Group

2013-07-15: Wayback Machine Upgrades Memento Support

Comments

Post a Comment