2018-12-03: Acidic Regression of WebSatchel
Shawn Jones (@shawnmjones) recently made me aware of a personal tool to save copies of a Web page using a browser extension called "WebSatchel". The service is somewhat akin to the offerings of browser-based tools like Pocket (now bundled with Firefox after a 2017 acquisition) among many other tools. Many of these types of tools use a browser extension that allows the user to send a URI to a service that creates a server-side snapshot of the page. This URI delegation procedure aligns with Internet Archive's "Save Page Now", which we have discussed numerous times on this blog. In comparison, our own tool, WARCreate, saves "by-value".
With my interest in any sort of personal archiving tool, I downloaded the WebSatchel Chrome extension, created a free account, signed in, and tried to save the test page from the Archival Acid Test (which we created in 2014). My intention in doing this was to evaluate the preservation capabilities of the tool-behind-the-tool, i.e., that which is invoked when I click "Save Page" in WebSatchel. I was shown this interface:
Note the thumbnail of the screenshot captured. The red square in the 2014 iteration of the Archival Acid Test (retained at the same URI-R for posterity) is indicative of a user interacting with the page for the content to load and thus be accessible for preservation. With respect to only evaluating the tool's capture ability, the red in the thumbnail may not be indicative of the capture. A repeat of this procedure to ensure that I "surfaced" the red square on the live web (i.e., interacted with the page before telling WebSatchel to grab it) resulted in a thumbnail where all squares were blue. As expected, this may be indicative that WebSatchel is using the browser's screenshot extension API at the time of URI submission rather than creating a screenshot of their own capture. The limitation of the screenshot to the viewport (rather than the whole page) also indicates this.
Mis(re-)direction
I then clicked the "Open Save Page" button and was greeted with a slightly different result. This captured resided at https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/Tl2kToC9fthiV1mM/index.html.
curling that URI results in an inappropriately used HTTP 302 status code that appears to indicate a redirect to a login page.
$ curl -I https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/Tl2kToC9fthiV1mM/index.html HTTP/1.1 302 302 Date: Mon, 03 Dec 2018 19:44:59 GMT Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5 Location: websatchel.com/j/public/login Content-Type: text/html
Note the lack of scheme in the Location header. RFC2616 (HTTP/1.1) Section 14.30 requires the location to be an absolute URI (per RFC3896 Section 4.3). In an investigation to legitimize their hostname leading redirect pattern, I also checked the more current RFC7231 Section 7.1.2, which revises the value of Location response to be a URI reference in the spirit of RFC3986. This updated HTTP/1.1 RFC allows for relative references, as already done in practice prior to RFC7231. WebSatchel's Location pattern causes browsers to interpret their hostname as a relative redirect per the standards, causing a redirect to https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/websatchel.com/j/public/login
$ curl -I https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/websatchel.com/j/public/login HTTP/1.1 302 302 Date: Mon, 03 Dec 2018 20:13:04 GMT Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5 Location: websatchel.com/j/public/login
...and repeated recursively until the browser reports "Too Many Redirects".
Interacting with the Capture
Despite the redirect issue, interacting with the capture retains the red square. In the case where all squares were blue on the live Web, the aforementioned square was red when viewing the capture. In addition to this, two of the "Advanced" tests (advanced relative to 2014 crawler capability, not particularly new to the Web at the time) were missing, representative of an iframe (without anything CORS-related behind the scenes) and an embedded HTML5 object (using the standard video element, nothing related to Custom Elements).
"Your" Captures
I hoped to also evaluate archival leakage (aka Zombies) but the service did not seem to provide a way for me to save my capture to my own system, i.e., your archives, remotely (and solely) hosted. In investigating a way to liberate my captures, I noticed that the default account is simply a trial of a service, which ends a month after creating the account and a relatively steep monthly pricing model. The "free" account is also listed as being limited to 1 GB/account, 3 pages/day and access removed to their "page marker" feature, WebSatchel's system for a sort-of text highlighting form of annotation.
Interoperability?
WebSatchel has browser extensions for Firefox, Chrome, MS Edge, and Opera but the data liberation scheme leaves a bit to be desired, especially for personal preservation. As a quick final test, without holding my breadth for too long, I use my browser's DevTools to observe the HTTP response headers for the URI of my Acid Test capture. As above, attempting to access the capture via curl would require circumventing the infinite redirect and manually going through an authentication procedure. As expected, nothing resembling Memento-Datetime was present in the response headers.
—Mat (@machawk1)
Comments
Post a Comment