2020-03-26: Memento Compliance Audit of PyWB
This document is an audit report of the latest development version of PyWB, a Web archive replay sytem, for its Memento (RFC 7089) compliance. As a growing number of public Web archives are moving towards deploying PyWB, it becomes critical to comply with standards to ensure that tools in the archiving ecosystem continue to function as expected.
To audit the Memento compliance of PyWB I established the following setup:
- Captured
example.com
five times in separate WARC files with the gap of a few minutes each usingwarcio
- Created various test instances of PyWB's
develop
branch, which is one commit ahead of thev-2.4.0-rc6-test
version (commit hash:92e459bda52a2b03f33a4b0b8094ed424248d2a5
) - Initialized a collection named
example
and loaded freshly captured warc files in it for replay - Placed multiple custom configuration files that are loaded by setting
PYWB_CONFIG_FILE
environment variable for each test instance - Preserved the state of the relevant folder tree in pywbtest.tar.gz for replication and reproducibility
- Made various tests instance publicly accessible at:
- Default: https://pywbtest.ws-dl.cs.odu.edu/
- No Frame Replay: https://pywbtest-nofr.ws-dl.cs.odu.edu/
- TimeGate Redirect: https://pywbtest-tgrd.ws-dl.cs.odu.edu/
- No Frame Replay With TimeGate Redirect: https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/
Notes:
- The keywords
MUST
andMUST NOT
indicate strict compliance issues, whileSHOULD
andSHOULD NOT
suggest established practices and recommendations that some existing tools may rely on - These testing services run behind a reverse-proxy, which is responsible for
HTTPS
andHTTP/2
, not the PyWB, so--http1.1
flag will be used incurl
commands below to enforceHTTP/1.1
- Commands below can be clicked to toggle the visibility of their outputs or can be run in a terminal while the testing services remain live
- Each command is prefixed with a counter in the form of
[C<NUMBER>]
to allow referencing
tree pywbtest
pywbtest
├── collections
│ └── example
│ ├── archive
│ │ ├── example-20200323133704.warc.gz
│ │ ├── example-20200323133917.warc.gz
│ │ ├── example-20200323134145.warc.gz
│ │ ├── example-20200323134509.warc.gz
│ │ └── example-20200323134606.warc.gz
│ ├── indexes
│ │ └── index.cdxj
│ ├── static
│ └── templates
├── config-nofr-tgrd.yaml
├── config-nofr.yaml
├── config-tgrd.yaml
├── static
└── templates
8 directories, 6 files
Table of Contents
- Default Mode
- No Frame Replay Mode
- TimeGate Redirect Mode
- No Frame Replay With TimeGate Redirect Mode
- Summary
- Acknowledgements
1. Default Mode
In this mode we do not use any custom configuration file.
1.1. Banner Memento
After navigating via the Web UI I selected the third of the five mementos and inspected it using cURL:
curl -iL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/
HTTP/1.1 200 OK
Content-Length: 1573
Content-Type: text/html
Date: Mon, 23 Mar 2020 17:44:17 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
<!DOCTYPE html>
<html>
<head>
<style>
html, body
{
height: 100%;
margin: 0px;
padding: 0px;
border: 0px;
overflow: hidden;
}
</style>
<script src='https://pywbtest.ws-dl.cs.odu.edu/static/wb_frame.js'> </script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest.ws-dl.cs.odu.edu/static/default_banner.css'/>
</head>
<body style="margin: 0px; padding: 0px;">
<div id="wb_iframe_div">
<iframe id="replay_iframe" frameborder="0" seamless="seamless" scrolling="yes" class="wb_iframe" allow="autoplay; fullscreen"></iframe>
</div>
<script>
var cframe = new ContentFrame({"url": "https://example.com/" + window.location.hash,
"prefix": "https://pywbtest.ws-dl.cs.odu.edu/example/",
"request_ts": "20200323134145",
"iframe": "#replay_iframe"});
</script>
</body>
</html>
This is the banner container that loads the main page memento inside an iframe. This exposes link relations
timemap
, timegate
, and a single memento
with mp_
suffix.Now, let's make a request with the datetime one second earlier i.e.,
20200323134144
instead of 20200323134145
for which there are no mementos at the exact moment.curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134144/https://example.com/
HTTP/1.1 200 OK
Content-Length: 1573
Content-Type: text/html
Date: Mon, 23 Mar 2020 17:45:41 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134144mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:44 GMT"
Memento-Datetime: Mon, 23 Mar 2020 13:41:44 GMT
And another request to a domain name that does not exist in the archive:
curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134144/https://missing.example.com/
HTTP/1.1 200 OK
Content-Length: 1581
Content-Type: text/html
Date: Mon, 23 Mar 2020 17:46:47 GMT
Link: <https://missing.example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://missing.example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://missing.example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134144mp_/https://missing.example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:44 GMT"
Memento-Datetime: Mon, 23 Mar 2020 13:41:44 GMT
These two requests
SHOULD
have returned 302
and 404
status codes respectively, but they both return 200
. More importantly, they return Memento-Datetime
headers corresponding to the datetime string in the request URI, which means a Memento client may wrongly assume these as actual mementos.1.2. TimeMap
Now, let's fetch the TimeMap as reported in the
Link
header of the first request above (command C2
).curl -iL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/
HTTP/1.1 200 OK
Content-Length: 1097
Content-Type: application/link-format
Date: Mon, 23 Mar 2020 17:48:10 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime
<https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="self"; type="application/link-format"; from="Mon, 23 Mar 2020 13:37:04 GMT",
<https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate",
<https://example.com/>; rel="original",
<https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"; collection="example",
<https://pywbtest.ws-dl.cs.odu.edu/example/20200323133917mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:39:17 GMT"; collection="example",
<https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example",
<https://pywbtest.ws-dl.cs.odu.edu/example/20200323134509mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:45:09 GMT"; collection="example",
<https://pywbtest.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"; collection="example"
This has at least two issues:
- It returns
Vary: accept-datetime
header, but datetime-based content negotiation on a TimeMap endpoint is not defined in Memento - Memento links in the response payload introduce a
collection
attribute that looks harmless, but such arbitrary attributes are not allowed by the Web Linking (RFC 5988), unless extended in another specification, if it is important then it can be incorporated as per the Item and Collection Link Relations (RFC 6573)
1.3. Main Page Memento
Now, let's fetch the middle memento entry from the reported TimeMap above (command
C5
), which is the main page memento, not the banner container.curl -iL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Location: https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
Transfer-Encoding: chunked
<!doctype html>
<html>
<head><!-- WB Insert -->
<script>
wbinfo = {};
wbinfo.top_url = "https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/";
// Fast Top-Frame Redirect
if (window == window.top && wbinfo.top_url) {
var loc = window.location.href.replace(window.location.hash, "");
loc = decodeURI(loc);
if (loc != decodeURI(wbinfo.top_url)) {
window.location.href = wbinfo.top_url + window.location.hash;
}
}
wbinfo.url = "https://example.com/";
wbinfo.timestamp = "20200323134145";
wbinfo.request_ts = "20200323134145";
wbinfo.prefix = decodeURI("https://pywbtest.ws-dl.cs.odu.edu/example/");
wbinfo.mod = "mp_";
wbinfo.is_framed = true;
wbinfo.is_live = false;
wbinfo.coll = "example";
wbinfo.proxy_magic = "";
wbinfo.static_prefix = "https://pywbtest.ws-dl.cs.odu.edu/static/";
wbinfo.enable_auto_fetch = false;
</script>
<script src='https://pywbtest.ws-dl.cs.odu.edu/static/wombat.js'> </script>
<script>
wbinfo.wombat_ts = "20200323134145";
wbinfo.wombat_sec = "1584970905";
wbinfo.wombat_scheme = "https";
wbinfo.wombat_host = "example.com";
wbinfo.wombat_opts = {};
if (window && window._WBWombatInit) {
window._WBWombatInit(wbinfo);
}
</script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest.ws-dl.cs.odu.edu/static/default_banner.css'/>
<!-- End WB Insert -->
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
- The value of the
Memento-Datetime
header is reported in theDate
header as well (this behavior is found in many other places too where the main page memento is returned), though the two headers have different semantics - The
Link
header only reports onememento
relation (i.e., the current memento)- While this is not mandatory to provide other memento relations such as
first
,prev
,next
, andlast
, but this has been the norm and tools were built with these expectations, for example, ReconstructiveBanner Custom Element relies on these values to enable navigational links
- While this is not mandatory to provide other memento relations such as
Now, let's make a request with the datetime one second earlier i.e.,
20200323134144
instead of 20200323134145
for which there are no mementos at the exact moment.curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134144mp_/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Location: https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
Traditionally, we would expect a
302
redirect to the closest memento in this case, but PyWB returns the payload of the closest memento and indicates the actual URI-M
in Content-Location
header both when the request URI is an exact match (e.g., the earlier example) or the datetime is nearby. The Content-Location
header is generally used where there is some content negotiation involved or a resource is created or updated, but I have not seen it being used in plain GET requests. While it can be argued that URI-Ms with datetimes different from the exact matches are a form of implicit content negotiation (albeit using a path parameter and not a header) and it avoids unnecessary round-trips, my concern here is the fact that too many URIs are pointing to the same resource. I do not think that the Content-Location
header can be used to convey the canonical
link relation. Also, the use of the Content-Location
header suggests the user-agent to use the value of the header in the future in place of the request URI, which is problematic if a more appropriate (closer) memento is made available in the future. At least it should be made explicit using a Cache-control
that this response is not cacheable. Many researchers were relying on the traditional behavior to identify a terminal memento by following any redirects until a Memento-Datetime
header is found in the response, but this behavior will force them to reconsider their scripts.Another issue in the
Link
header of both of the above requests (commands C26
and C7
) is the inclusion of the non-standard collection
attribute (same as the TimeMap payload discussed earlier in section 1.2
).Now, a request to a domain name that does not exist in the archive:
curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134144mp_/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1084
Content-Type: text/html
Date: Mon, 23 Mar 2020 22:13:44 GMT
This looks good, in contrast the corresponding banner container URI-M (i.e., the one without
mp_
suffix) returns 200
as illustrated earlier.1.4. TimeGate
Now, let's interact with the TimeGate endpoint as reported in the
Link
headers earlier in many responses. Corresponding PyWB documentation claims that the behavior is consistent with the Memento Pattern 2.2.curl -iL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/
HTTP/1.1 200 OK
Content-Length: 1559
Content-Type: text/html
Date: Tue, 24 Mar 2020 00:15:36 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime
<!DOCTYPE html>
<html>
<head>
<style>
html, body
{
height: 100%;
margin: 0px;
padding: 0px;
border: 0px;
overflow: hidden;
}
</style>
<script src='https://pywbtest.ws-dl.cs.odu.edu/static/wb_frame.js'> </script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest.ws-dl.cs.odu.edu/static/default_banner.css'/>
</head>
<body style="margin: 0px; padding: 0px;">
<div id="wb_iframe_div">
<iframe id="replay_iframe" frameborder="0" seamless="seamless" scrolling="yes" class="wb_iframe" allow="autoplay; fullscreen"></iframe>
</div>
<script>
var cframe = new ContentFrame({"url": "https://example.com/" + window.location.hash,
"prefix": "https://pywbtest.ws-dl.cs.odu.edu/example/",
"request_ts": "",
"iframe": "#replay_iframe"});
</script>
</body>
</html>
curl -IL --http1.1 -H "Accept-Datetime: Fri, 01 Jan 1999 12:34:56 GMT" https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/
HTTP/1.1 200 OK
Content-Length: 1573
Content-Type: text/html
Date: Tue, 24 Mar 2020 00:15:09 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/19990101123456mp_/https://example.com/>; rel="memento"; datetime="Fri, 01 Jan 1999 12:34:56 GMT"
Memento-Datetime: Fri, 01 Jan 1999 12:34:56 GMT
Vary: accept-datetime
First of these two requests (without an explicit
Accept-Datetime
header) SHOULD
return the most recent memento and the second one (with the explicit Accept-Datetime
header) MUST
resolve to the first memento as the requested datetime is far in the past, way before the very first capture of the URI-R in the test archive.The first response does not include any
memento
relation in the Link
header and fails to provide a Memento-Datetime
header. The second response does include both of these, but the datetime value is an echoback of the requested Accept-Datetime
value, and not datetime of the actual corresponding memento. Additionally, there are no navigational memento
relations (as discussed earlier in section 1.3).Requesting a non-existing resource returns
200
status along with the issues discussed above as shown below:curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/https://missing.example.com/
HTTP/1.1 200 OK
Content-Length: 1567
Content-Type: text/html
Date: Tue, 24 Mar 2020 00:41:22 GMT
Link: <https://missing.example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://missing.example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://missing.example.com/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime
It turned out that this reported TimeGate endpoint belongs to the banner container, not the main page memento. Unfortunately, this is the only TimeGate endpoint that is discoverable from any other response, be it a TimeMap, banner memento, or main page memento.
Out of curiosity I tested a potential TimeGate endpoint with
mp_
as a path parameter which turned out to be one that is compliant with the documented behavior. However, both the PyWB documentation and Link
header in responses fail to acknowledge this.curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/mp_/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Location: https://pywbtest.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:46:06 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:46:06 GMT
Vary: accept-datetime
X-Archive-Orig-Age: 590951
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:46:06 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7FA7)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
curl -IL --http1.1 -H "Accept-Datetime: Fri, 01 Jan 1999 12:34:56 GMT" https://pywbtest.ws-dl.cs.odu.edu/example/mp_/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Location: https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:37:04 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:37:04 GMT
Vary: accept-datetime
X-Archive-Orig-Age: 351038
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:37:04 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7F13)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/mp_/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1084
Content-Type: text/html
Date: Mon, 23 Mar 2020 13:39:29 GMT
These look good, except the following issues (which were discussed earlier as well):
- The value of
Date
andMemento-Datetime
headers is same - Non-standard
collection
attribute is present in theLink
header - The
Link
header only reports only onememento
relation (i.e., the current memento) and not the navigationalmemento
relations such asfirst
,prev
,next
, andlast
, which has been the norm and tools were built with these expectations, for example, MemGator relies on these relations to provide the consolidated navigational mementos in TimeGate and other related endpoints, if PyWB instances deployed in various public archives choose to omit these, MemGator's response will be less accurate unless it performs a more costly TimeMap request to establish the truth
2. No Frame Replay Mode
In this mode we disable framed replay to embed archival banner in mementos directly.
cat config-nofr.yaml
framed_replay: false
2.1. Memento
First, a request to an existing memento with an exactly matching datetime:
curl -iL --http1.1 https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Location: https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
Transfer-Encoding: chunked
<!doctype html>
<html>
<head><!-- WB Insert -->
<script>
wbinfo = {};
wbinfo.top_url = "https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/";
wbinfo.url = "https://example.com/";
wbinfo.timestamp = "20200323134145";
wbinfo.request_ts = "20200323134145";
wbinfo.prefix = decodeURI("https://pywbtest-nofr.ws-dl.cs.odu.edu/example/");
wbinfo.mod = "";
wbinfo.is_framed = false;
wbinfo.is_live = false;
wbinfo.coll = "example";
wbinfo.proxy_magic = "";
wbinfo.static_prefix = "https://pywbtest-nofr.ws-dl.cs.odu.edu/static/";
wbinfo.enable_auto_fetch = false;
</script>
<script src='https://pywbtest-nofr.ws-dl.cs.odu.edu/static/wombat.js'> </script>
<script>
wbinfo.wombat_ts = "20200323134145";
wbinfo.wombat_sec = "1584970905";
wbinfo.wombat_scheme = "https";
wbinfo.wombat_host = "example.com";
wbinfo.wombat_opts = {};
if (window && window._WBWombatInit) {
window._WBWombatInit(wbinfo);
}
</script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest-nofr.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest-nofr.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest-nofr.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest-nofr.ws-dl.cs.odu.edu/static/default_banner.css'/>
<!-- End WB Insert -->
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Next, a request to a memento with a nearby datetime:
curl -IL --http1.1 https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134144/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Location: https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
Finally, a request to a non-existing memento:
curl -IL --http1.1 https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134144/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1104
Content-Type: text/html
Date: Tue, 24 Mar 2020 01:09:17 GMT
This behavior is similar to the main page memento of the default configuration (section
1.3
) and inherits the same issues as discussed earlier.2.2. TimeMap
curl -iL --http1.1 https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/
HTTP/1.1 200 OK
Content-Length: 1117
Content-Type: application/link-format
Date: Tue, 24 Mar 2020 01:19:51 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format"
Vary: accept-datetime
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="self"; type="application/link-format"; from="Mon, 23 Mar 2020 13:37:04 GMT",
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate",
<https://example.com/>; rel="original",
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323133704/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"; collection="example",
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323133917/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:39:17 GMT"; collection="example",
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example",
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134509/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:45:09 GMT"; collection="example",
<https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134606/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"; collection="example"
This behavior is similar to the TimeMap of the default configuration (section
1.2
) and inherits the same issues as discussed earlier.2.3. TimeGate
curl -iL --http1.1 https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Location: https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134606/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:46:06 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323134606/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:46:06 GMT
Vary: accept-datetime
X-Archive-Orig-Age: 590951
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:46:06 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7FA7)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
Transfer-Encoding: chunked
<!doctype html>
<html>
<head><!-- WB Insert -->
<script>
wbinfo = {};
wbinfo.top_url = "https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/";
wbinfo.url = "https://example.com/";
wbinfo.timestamp = "20200323134606";
wbinfo.request_ts = "";
wbinfo.prefix = decodeURI("https://pywbtest-nofr.ws-dl.cs.odu.edu/example/");
wbinfo.mod = "";
wbinfo.is_framed = false;
wbinfo.is_live = false;
wbinfo.coll = "example";
wbinfo.proxy_magic = "";
wbinfo.static_prefix = "https://pywbtest-nofr.ws-dl.cs.odu.edu/static/";
wbinfo.enable_auto_fetch = false;
</script>
<script src='https://pywbtest-nofr.ws-dl.cs.odu.edu/static/wombat.js'> </script>
<script>
wbinfo.wombat_ts = "20200323134606";
wbinfo.wombat_sec = "1584971166";
wbinfo.wombat_scheme = "https";
wbinfo.wombat_host = "example.com";
wbinfo.wombat_opts = {};
if (window && window._WBWombatInit) {
window._WBWombatInit(wbinfo);
}
</script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest-nofr.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest-nofr.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest-nofr.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest-nofr.ws-dl.cs.odu.edu/static/default_banner.css'/>
<!-- End WB Insert -->
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
curl -IL --http1.1 -H "Accept-Datetime: Fri, 01 Jan 1999 12:34:56 GMT" https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Location: https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323133704/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:37:04 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-nofr.ws-dl.cs.odu.edu/example/20200323133704/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:37:04 GMT
Vary: accept-datetime
X-Archive-Orig-Age: 351038
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:37:04 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7F13)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
curl -IL --http1.1 https://pywbtest-nofr.ws-dl.cs.odu.edu/example/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1104
Content-Type: text/html
Date: Tue, 24 Mar 2020 01:44:26 GMT
This behavior is similar to the TimeGate endpoint of main page mementos with the default configuration (section
1.4
) and inherits the same issues as discussed earlier.3. TimeGate Redirect Mode
In this mode we enable redirection behavior of the TimeGate.
cat config-tgrd.yaml
redirect_to_exact: true
3.1. Banner Memento
Banner memento behaves the same way as in the default configuration (section
1.1
).3.2. Main Page Memento
Main page memento returns an intermediary
307
response if the datetime value in the URI-M does not match exactly with an existing memento as shown below:curl -IL --http1.1 https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134144mp_/https://example.com/
HTTP/1.1 307 Temporary Redirect
Content-Length: 0
Date: Tue, 24 Mar 2020 03:31:19 GMT
Link: <https://example.com/>; rel="original"
Location: https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
The value of
Date
and Memento-Datetime
headers is same in the terminal response.3.3. TimeMap
TimeMap behaves the same way as in the default configuration, except that it lists mementos without the
mp_
suffix.3.4. TimeGate
Corresponding PyWB documentation suggests that this behavior is consistent with Memento Pattern 2.3. However, the description suggests that it actually meant Memento Pattern 2.1
PyWB documentation states:
As this approach always includes a redirect, use of this system is discouraged when the intent is to render mementos. However, this approach is useful when the goal is to determine the URI-M and to provide backwards compatibility.
I think this mutual exclusion is problematic because it gives the choice of one configuration or the other to the archive admins while it concerns clients more and admins have no way to enable both. Ideally, there should be two endpoints simultaneously available to cater both the scenarios without the need of an unnecessary configuration option.
Let's make some requests to the advertised TimeGate endpoint that belongs to the frame memento:
curl -iL --http1.1 https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/
HTTP/1.1 307 Temporary Redirect
Content-Length: 0
Date: Tue, 24 Mar 2020 01:59:55 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"
Location: https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134606/https://example.com/
Vary: accept-datetime
HTTP/1.1 200 OK
Content-Length: 1603
Content-Type: text/html
Date: Tue, 24 Mar 2020 01:59:55 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"
Memento-Datetime: Mon, 23 Mar 2020 13:46:06 GMT
<!DOCTYPE html>
<html>
<head>
<style>
html, body
{
height: 100%;
margin: 0px;
padding: 0px;
border: 0px;
overflow: hidden;
}
</style>
<script src='https://pywbtest-tgrd.ws-dl.cs.odu.edu/static/wb_frame.js'> </script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest-tgrd.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest-tgrd.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest-tgrd.ws-dl.cs.odu.edu/static/default_banner.css'/>
</head>
<body style="margin: 0px; padding: 0px;">
<div id="wb_iframe_div">
<iframe id="replay_iframe" frameborder="0" seamless="seamless" scrolling="yes" class="wb_iframe" allow="autoplay; fullscreen"></iframe>
</div>
<script>
var cframe = new ContentFrame({"url": "https://example.com/" + window.location.hash,
"prefix": "https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/",
"request_ts": "20200323134606",
"iframe": "#replay_iframe"});
</script>
</body>
</html>
curl -IL --http1.1 -H "Accept-Datetime: Fri, 01 Jan 1999 12:34:56 GMT" https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/
HTTP/1.1 307 Temporary Redirect
Content-Length: 0
Date: Tue, 24 Mar 2020 02:03:58 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"
Location: https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323133704/https://example.com/
Vary: accept-datetime
HTTP/1.1 200 OK
Content-Length: 1603
Content-Type: text/html
Date: Tue, 24 Mar 2020 02:03:58 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"
Memento-Datetime: Mon, 23 Mar 2020 13:37:04 GMT
curl -IL --http1.1 https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1104
Content-Type: text/html
Date: Tue, 24 Mar 2020 02:05:07 GMT
Now, some requests to the non-advertised TimeGate endpoint that belongs to the main page memento:
curl -IL --http1.1 https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/mp_/https://example.com/
HTTP/1.1 307 Temporary Redirect
Content-Length: 0
Date: Tue, 24 Mar 2020 02:07:42 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"
Location: https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/
Vary: accept-datetime
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:46:06 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323134606mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:46:06 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:46:06 GMT
X-Archive-Orig-Age: 590951
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:46:06 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7FA7)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
curl -IL --http1.1 -H "Accept-Datetime: Fri, 01 Jan 1999 12:34:56 GMT" https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/mp_/https://example.com/
HTTP/1.1 307 Temporary Redirect
Content-Length: 0
Date: Tue, 24 Mar 2020 02:15:29 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"
Location: https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/
Vary: accept-datetime
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:37:04 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:37:04 GMT
X-Archive-Orig-Age: 351038
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:37:04 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7F13)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
curl -IL --http1.1 https://pywbtest-tgrd.ws-dl.cs.odu.edu/example/mp_/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1104
Content-Type: text/html
Date: Tue, 24 Mar 2020 02:16:29 GMT
On a positive note, unlike the default configuration, non-existing resources are detected even in the banner memento and return a
404
status code. Other than this there are numerous Memento-compliance issues in these:- The value of
Date
andMemento-Datetime
headers is same in terminal responses after redirects - The
memento
relation in theLink
header always returns a main page memento (i.e., one withmp_
suffix) even when the banner memento is requested and the payload represents the banner page - Both PyWB documentation as well as Memento RFC talk about a
302
status code when content negotiation is performed in this style, but the implementation returns307
instead, which is against the specification and breaks tools like MemGator and Memento Validator - If the
307
redirect were to be considered as a temporary resource then itMUST NOT
include aVary
header withaccept-datetime
value in it and thereSHOULD
be a usual302
response somewhere in the redirection chain - If the purpose of replacing
302
with307
is to support methods likePOST
andOPTIONS
then the matter must be discussed with the community to resolve it collaboratively in a transparent manner because the Memento RFC does not support - Presence of an arbitrary
collection
attribute and absence of navigationalmemento
relations as discussed earlier
4. No Frame Replay With TimeGate Redirect Mode
In this mode we disable banner container frame and enable redirection behavior of the TimeGate.
cat config-nofr-tgrd.yaml
framed_replay: false
redirect_to_exact: true
4.1. Memento
curl -iL --http1.1 https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
Transfer-Encoding: chunked
<!doctype html>
<html>
<head><!-- WB Insert -->
<script>
wbinfo = {};
wbinfo.top_url = "https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/";
wbinfo.url = "https://example.com/";
wbinfo.timestamp = "20200323134145";
wbinfo.request_ts = "20200323134145";
wbinfo.prefix = decodeURI("https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/");
wbinfo.mod = "";
wbinfo.is_framed = false;
wbinfo.is_live = false;
wbinfo.coll = "example";
wbinfo.proxy_magic = "";
wbinfo.static_prefix = "https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/static/";
wbinfo.enable_auto_fetch = false;
</script>
<script src='https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/static/wombat.js'> </script>
<script>
wbinfo.wombat_ts = "20200323134145";
wbinfo.wombat_sec = "1584970905";
wbinfo.wombat_scheme = "https";
wbinfo.wombat_host = "example.com";
wbinfo.wombat_opts = {};
if (window && window._WBWombatInit) {
window._WBWombatInit(wbinfo);
}
</script>
<script>
window.banner_info = {
is_gmt: true,
liveMsg: decodeURIComponent("Live on"),
calendarAlt: decodeURIComponent("Calendar icon"),
calendarLabel: decodeURIComponent("View All Captures"),
choiceLabel: decodeURIComponent("Language:"),
loadingLabel: decodeURIComponent("Loading..."),
logoAlt: decodeURIComponent("Logo"),
locale: "en",
curr_locale: "",
locales: [],
locale_prefixes: {},
prefix: "https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/",
staticPrefix: "https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/static"
};
</script>
<!-- default banner, create through js -->
<script src='https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/static/default_banner.js'> </script>
<link rel='stylesheet' href='https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/static/default_banner.css'/>
<!-- End WB Insert -->
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
</style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134145/https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
curl -IL --http1.1 https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134144/https://example.com/
HTTP/1.1 307 Temporary Redirect
Content-Length: 0
Date: Tue, 24 Mar 2020 03:25:18 GMT
Link: <https://example.com/>; rel="original"
Location: https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134145/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
curl -IL --http1.1 https://pywbtest-nofr-tgrd.ws-dl.cs.odu.edu/example/20200323134144/https://missing.example.com/
HTTP/1.1 404 Not Found
Content-Length: 1124
Content-Type: text/html
Date: Tue, 24 Mar 2020 03:27:46 GMT
The behavior here is similar to the main page memento as described above in the TimeGate redirect mode (section
3.2
).4.2. TimeMap
TimeMap behaves the same way as in the default configuration (section
1.2
), except that it lists mementos without the mp_
suffix.4.3. TimeGate
TimeGate endpoint behaves the same way as described above in the TimeGate redirect (section
3.4
) mode and inherits the same issues as discussed earlier. The only difference is in the payload of the final response that is same as the no frame replay mode described earlier (section 2.3
).Summary
I audited the latest development version of PyWB (as of March 23, 2020) with a number of different configurations for its Memento compliance and found numerous issues of varying severity levels that may break various tools of the Web archiving and Memento ecosystem.
Critical violations (
MUST
be fixed):- TimeGate in redirect mode
MUST
use302
-style content negotiation and not307
, which is not part of the Memento RFC, should307
-style be mandatory, the matter must be discussed with the community to resolve collaboratively in a transparent manner (see section3.4
) -- [Reported in ipwb#545] - As per RFC 5988 arbitrary attributes are not allowed in
Link
, hencecollection
attribute inLink
header and TimeMap entityMUST
be removed or incorporated as per RFC 6573 (see sections1.2
and1.3
) -- [Reported in ipwb#546] - When accessing a main page memento, entries in the
Link
headerMUST
correspond to the main page memento, and not the corresponding banner memento (see section1.4
) -- [Reported in ipwb#547] - In main page mementos the value of the
Memento-Datetime
header overwrites theDate
header, these headers have distinct semantics, their valuesMUST NOT
be the same, except in rare cases when a memento is replayed within one second of its capture (see section1.3
) -- [Reported in ipwb#548] - Banner mementos blindly echo back requested datetime in
Memento-Datetime
header with200
status code irrespective of the existence of an exactly matching memento or no mementos at all (see section1.4
) -- [Reported in ipwb#549] - During datetime-based content negotiation a temporary resource
MUST NOT
include aVary
header withaccept-datetime
value (see section3.4
) -- [Reported in ipwb#550]
Moderate issues (
SHOULD
be revisited):- TimeMaps
SHOULD NOT
support content negotiation based onAccept-Datetime
header (see section1.2
) -- [Reported in ipwb#551] - If there are variations of mementos (e.g., banner, rewritten, raw), the community
SHOULD
discuss how to report them inLink
header and TimeMaps and which ones should be reported in certain responses (see section1.4
) -- [Reported in ipwb#552] - In case of implicit datetime content negotiation (i.e., using the datetime string of the URI-M path and not the
Accept-Datetime
header) a302
redirect should be returned to the closest memento instead of returning200
and relying onContent-Location
header to not pollute the URI-M space and to ensure caches and many tools that rely on this behavior function properly (see section1.3
) -- [Reported in ipwb#553] - Expose both styles of content negotiation (i.e.,
200
and302
) simultaneously, so that user-agents get to decide which one to consume, not the web master (see section3.4
) -- [Reported in ipwb#554] - Navigational memento link relations (i.e.,
first
,prev
,next
, andlast
) are recommended to be included inLink
header of TimeGate and memento responses as many tools rely on them (see sections1.3
and1.4
) -- [Reported in ipwb#555] - Fix PyWB documentation to align with the implementation (see section
3.4
) -- [Reported in ipwb#556]
Acknowledgements
This audit report has greatly benefited from the feedback of Herbert Van de Sompel, Martin Klein, and Michael L. Nelson. I am grateful for their contributions, but I am responsible for any errors that may be present.
--Sawood Alam
good
ReplyDelete