2018-01-08: Introducing Reconstructive - An Archival Replay ServiceWorker Module


Web pages are generally composed of many resource such as images, style sheets, JavaScript, fonts, iframe widgets, and other embedded media. These embedded resources can be referenced in many ways (such as relative path, absolute path, or a full URL). When the same page is archived and replayed from a different domain under a different base path, these references may not resolve as intended, hence, may result in a damaged memento. For example, a memento (an archived copy) of the web page https://www.odu.edu/ can be seen at https://web.archive.org/web/20180107155037/https://www.odu.edu/. Note that domain name has changed from www.odu.edu to web.archive.org and some extra path segments are added to it. In order for this page to render properly, various resource references in it are rewritten, for example, images/logo-university.png in a CSS file is replaced with /web/20171225230642im_/http://www.odu.edu/etc/designs/odu/images/logo-university.png.

Traditionally, web archival replay systems rewrite link and resource references in HTML/CSS/JavaScript responses so that they resolve to their corresponding archival version. Failure to do so would result in a broken rendering of archived pages (composite mementos) as the embedded resource references might resolve to their live version or an invalid location. With the growing use of JavaScript in web applications, often resources are injected dynamically, hence rewriting such references is not possible from the server side. To mitigate this issue, some JavaScript is injected in the page that overrides the global namespace to modify the DOM and monitor all network activity. In JCDL17 and WADL17 we proposed a ServiceWorker-based solution to this issue that requires no server-side rewriting, but catches every network request, even those that were initiated due to dynamic resource injection. Read our paper for more details.
Sawood Alam, Mat Kelly, Michele C. Weigle and Michael L. Nelson, "Client-side Reconstruction of Composite Mementos Using ServiceWorker," In JCDL '17: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries. June 2017, pp. 237-240.


URL Rewriting


There are primarily three ways to reference a resource from another resource, namely, relative path, absolute path, and absolute URL. All three have their own challenges when served from an archive (or from a different origin and/or path than the original). In the case of archival replay, both the origin and base paths are changed from the original while original origin and paths usually become part of the new path. Relative paths are often the easiest to replay as they are not tied to the origin or the root path, but they cannot be used for external resources. Absolute paths and absolute URLs on the other hand are resolved incorrectly or live-leaked when a primary resource is served from an archive, neither of these conditions are desired in archival replay. There is a fourth way of referencing a resource called schemeless (or protocol-relative) that starts with two forward slashes followed by a domain name and paths. However, usually web archives ignore the scheme part of the URI when canonicalizing URLs, so we can focus on just three main ways. The following table illustrates examples of each with their resolution issues.


Reference type Example Resolution after relocation
Relative path images/logo.png Potentially correct
Absolute path /public/images/logo.png Potentially incorrect
Absolute URL http://example.com/public/images/logo.png Potentially live leakage

Archival replay systems (such as OpenWayback and PyWB) rewrite responses before serving to the client in a way that various resource references point to their corresponding archival page. Suppose a page, originally located at http://example.com/public/index.html, has an image in it that is referenced as <img src="/public/images/logo.png">. When the same page is served from an archive at http://archive.example.org/<datetime>/http://example.com/public/index.html, the image reference needs to be rewritten as <img src="/<datetime>/http://example.com/public/images/logo.png"> in order for it to work as desired. However, URLs constructed by JavaScript, dynamically on the client-side are difficult to rewrite just by the static analysis of the code at server end. With the rising usage of JavaScript in web pages, it is becoming more challenging for the archival replay systems to correctly replay archived web pages.

ServiceWorker


ServiceWorker is a new web API that can be used to intercept all the network requests within its scope or originated from its scope (with a few exceptions such as an external iframe source). A web page first delivers a ServiceWorker script and installs it in the browser, which is registered to watch for all requests from a scoped path under the same origin. Once installed, it persists for a long time and intercepts all subsequent requests withing its scope. An active ServiceWorker sits in the middle of the client and the server as a proxy (which is built-in to the browser). It can change both requests and responses as necessary. The primary use-case of the API is to provide better offline experience in web apps by serving pages from a client-side cache when there is no network or populating/synchronizing the cache. However, we found it useful to solve an archival replay problem.

Reconstructive


We created Reconstructive, a ServiceWorker module for archival replay that sits on the client-side and intercepts every potential archival request to properly reroute it. This approach requires no rewrites from the server side. It is being used successfully in our IPFS-based archival replay system called InterPlanetary Wayback (IPWB). The main objective of this module is to help reconstruct (hence the name) a composite memento (from one or more archives) while preventing from any live-leaks (also known as zombie resources) or wrong URL resolutions.



The following figure illustrates an example where an external image reference in an archived web page would have leaked to the live-web, but due to the presence of Reconstructive, it was successfully rerouted to the corresponding archived copy instead.


In order to reroute requests to the URI of a potential archived copy (also known as Memento URI or URI-M) Reconstructive needs the request URL and the referrer URL, of which the latter must be a URI-M. It extracts the datetime and the original URI (or URI-R) of the referrer then combines them with the request URL as necessary to construct a potential URI-M for the request to be rerouted to. If the request URL is already a URI-M, it simply adds a custom request header X-ServiceWorker and fetches the response from the server. When necessary, the response is rewritten on the client-side to fix some quirks to make sure that the replay works as expected or to optionally add an archival banner. The following flowchart diagram shows what happens in every request/response cycle of a fetch event in Reconstructive.


We have also released an Archival Capture Replay Test Suite (ACRTS) to test the rerouting functionality in different scenarios. It is similar to our earlier Archival Acid Test, but more focused on URI references and network activities. The test suite comes with a pre-captured WARC file of a live test page. captured resources are all green while the live site has everything red. The WARC file can be replayed using any archival replay system to test how well the system is replaying archived resources. In the test suite a green box means properly rerouting, red box means a live-leakage, while white/gray means incorrectly resolving the reference.


Module Usage


The module is intended to be used by archival replay systems backed by Memento endpoints. It can be a web archive such as IPWB or a Memento aggregator such as MemGator. In order use the module, write a ServiceWorker script (say, serviceworker.js) with your own logic to register and update it. In that script, import reconstructive.js script (locally or externally) which will make the Reconstructive module available with all of its public members/functions. Then bind the fetch event listener to the publicly exposed Reconstructive.reroute function.

importScripts('https://oduwsdl.github.io/Reconstructive/reconstructive.js');
const rc = new Reconstructive();
self.addEventListener('fetch', rc.reroute);

This will start rerouting every request according to a default URI-M pattern while excluding some requests that match a default set of exclusion rules. However, URI-M pattern, exclusion rules, and many other configuration options can be customized. It even allows customization of the default response rewriting function and archival banner. The module can also be configured to only reroute a subset of the requests while letting the parent ServiceWorker script deal with the rest. For more details read the user documentation, example usage (registration process and sample ServiceWorker), or heavily documented module code.

Archival Banner


Reconstructive module has implemented a custom element named <reconstructive-banner> to provide an archival banner functionality. The banner element utilizes Shadow DOM to prevent any styles from the banner to leak into the page or the other way. Banner inclusion can be enabled by setting the showBanner configuration option to true when initializing Reconstructive module after which it will be added to every navigational page. Unlike many other archival banners in use, it does not use an iframe or stick to the top of the page. It floats at the bottom of the page, but goes out of the way when not needed. The banner element is currently in its early stage with very limited information and interactivity, but it is intended to be evolved in to a more functional component.

<script src="https://oduwsdl.github.io/Reconstructive/reconstructive-banner.js"></script>
<reconstructive-banner urir="http://example.com/" datetime="20180106175435"></reconstructive-banner>


Limitations


It is worth noting that we rely on some fairly new web APIs that might not have a very good and consistent support across all browsers and may potentially change in future. At the time of writing this post ServiceWorker support is available in about 74% active browsers globally. To help the server identify whether a request is coming from Reconstructive (to provide fallback of server-side rewriting), we add a custom request header X-ServiceWorker.

As per current specifications, there can be only one service worker active on a given scope. This means, if an archived page has its own ServiceWorker, it cannot work along with Reconstructive. However, in usual web apps ServiceWorkers are generally used for better user experience and gracefully degrade to remain functional (this is not guaranteed though). The best we can do in this case is to rewrite every ServiceWorker registration code (on client-side) in any archived page before serving the response to disable it so that Reconstructive continues to work.

Conclusions


We conceptualized an idea, experimented with it, published a peer-reviewed paper on it, implemented it in a more production-ready fashion, used it in a novel archival replay system, and made the code publicly available under the MIT License. We also released a test suite ACRTS that can be useful by itself. This work is supported in part by NSF grant III 1526700.

Resources



Update [Jan 22, 2018]: Updated usage example after converting the module to an ES6 Class and updated links after changing the repo name from "reconstructive" to "Reconstructive".

--
Sawood Alam

Comments