Monday, January 8, 2018

2018-01-08: Introducing Reconstructive - An Archival Replay ServiceWorker Module


Web pages are generally composed of many resource such as images, style sheets, JavaScript, fonts, iframe widgets, and other embedded media. These embedded resources can be referenced in many ways (such as relative path, absolute path, or a full URL). When the same page is archived and replayed from a different domain under a different base path, these references may not resolve as intended, hence, may result in a damaged memento. For example, a memento (an archived copy) of the web page https://www.odu.edu/ can be seen at https://web.archive.org/web/20180107155037/https://www.odu.edu/. Note that domain name has changed from www.odu.edu to web.archive.org and some extra path segments are added to it. In order for this page to render properly, various resource references in it are rewritten, for example, images/logo-university.png in a CSS file is replaced with /web/20171225230642im_/http://www.odu.edu/etc/designs/odu/images/logo-university.png.

Traditionally, web archival replay systems rewrite link and resource references in HTML/CSS/JavaScript responses so that they resolve to their corresponding archival version. Failure to do so would result in a broken rendering of archived pages (composite mementos) as the embedded resource references might resolve to their live version or an invalid location. With the growing use of JavaScript in web applications, often resources are injected dynamically, hence rewriting such references is not possible from the server side. To mitigate this issue, some JavaScript is injected in the page that overrides the global namespace to modify the DOM and monitor all network activity. In JCDL17 and WADL17 we proposed a ServiceWorker-based solution to this issue that requires no server-side rewriting, but catches every network request, even those that were initiated due to dynamic resource injection. Read our paper for more details.
Sawood Alam, Mat Kelly, Michele C. Weigle and Michael L. Nelson, "Client-side Reconstruction of Composite Mementos Using ServiceWorker," In JCDL '17: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries. June 2017, pp. 237-240.


URL Rewriting


There are primarily three ways to reference a resource from another resource, namely, relative path, absolute path, and absolute URL. All three have their own challenges when served from an archive (or from a different origin and/or path than the original). In the case of archival replay, both the origin and base paths are changed from the original while original origin and paths usually become part of the new path. Relative paths are often the easiest to replay as they are not tied to the origin or the root path, but they cannot be used for external resources. Absolute paths and absolute URLs on the other hand are resolved incorrectly or live-leaked when a primary resource is served from an archive, neither of these conditions are desired in archival replay. There is a fourth way of referencing a resource called schemeless (or protocol-relative) that starts with two forward slashes followed by a domain name and paths. However, usually web archives ignore the scheme part of the URI when canonicalizing URLs, so we can focus on just three main ways. The following table illustrates examples of each with their resolution issues.


Reference type Example Resolution after relocation
Relative path images/logo.png Potentially correct
Absolute path /public/images/logo.png Potentially incorrect
Absolute URL http://example.com/public/images/logo.png Potentially live leakage

Archival replay systems (such as OpenWayback and PyWB) rewrite responses before serving to the client in a way that various resource references point to their corresponding archival page. Suppose a page, originally located at http://example.com/public/index.html, has an image in it that is referenced as <img src="/public/images/logo.png">. When the same page is served from an archive at http://archive.example.org/<datetime>/http://example.com/public/index.html, the image reference needs to be rewritten as <img src="/<datetime>/http://example.com/public/images/logo.png"> in order for it to work as desired. However, URLs constructed by JavaScript, dynamically on the client-side are difficult to rewrite just by the static analysis of the code at server end. With the rising usage of JavaScript in web pages, it is becoming more challenging for the archival replay systems to correctly replay archived web pages.

ServiceWorker


ServiceWorker is a new web API that can be used to intercept all the network requests within its scope or originated from its scope (with a few exceptions such as an external iframe source). A web page first delivers a ServiceWorker script and installs it in the browser, which is registered to watch for all requests from a scoped path under the same origin. Once installed, it persists for a long time and intercepts all subsequent requests withing its scope. An active ServiceWorker sits in the middle of the client and the server as a proxy (which is built-in to the browser). It can change both requests and responses as necessary. The primary use-case of the API is to provide better offline experience in web apps by serving pages from a client-side cache when there is no network or populating/synchronizing the cache. However, we found it useful to solve an archival replay problem.

Reconstructive


We created Reconstructive, a ServiceWorker module for archival replay that sits on the client-side and intercepts every potential archival request to properly reroute it. This approach requires no rewrites from the server side. It is being used successfully in our IPFS-based archival replay system called InterPlanetary Wayback (IPWB). The main objective of this module is to help reconstruct (hence the name) a composite memento (from one or more archives) while preventing from any live-leaks (also known as zombie resources) or wrong URL resolutions.



The following figure illustrates an example where an external image reference in an archived web page would have leaked to the live-web, but due to the presence of Reconstructive, it was successfully rerouted to the corresponding archived copy instead.


In order to reroute requests to the URI of a potential archived copy (also known as Memento URI or URI-M) Reconstructive needs the request URL and the referrer URL, of which the latter must be a URI-M. It extracts the datetime and the original URI (or URI-R) of the referrer then combines them with the request URL as necessary to construct a potential URI-M for the request to be rerouted to. If the request URL is already a URI-M, it simply adds a custom request header X-ServiceWorker and fetches the response from the server. When necessary, the response is rewritten on the client-side to fix some quirks to make sure that the replay works as expected or to optionally add an archival banner. The following flowchart diagram shows what happens in every request/response cycle of a fetch event in Reconstructive.


We have also released an Archival Capture Replay Test Suite (ACRTS) to test the rerouting functionality in different scenarios. It is similar to our earlier Archival Acid Test, but more focused on URI references and network activities. The test suite comes with a pre-captured WARC file of a live test page. captured resources are all green while the live site has everything red. The WARC file can be replayed using any archival replay system to test how well the system is replaying archived resources. In the test suite a green box means properly rerouting, red box means a live-leakage, while white/gray means incorrectly resolving the reference.


Module Usage


The module is intended to be used by archival replay systems backed by Memento endpoints. It can be a web archive such as IPWB or a Memento aggregator such as MemGator. In order use the module, write a ServiceWorker script (say, serviceworker.js) with your own logic to register and update it. In that script, import reconstructive.js script (locally or externally) which will make the Reconstructive module available with all of its public members/functions. Then bind the fetch event listener to the publicly exposed Reconstructive.reroute function.

importScripts('https://oduwsdl.github.io/Reconstructive/reconstructive.js');
const rc = new Reconstructive();
self.addEventListener('fetch', rc.reroute);

This will start rerouting every request according to a default URI-M pattern while excluding some requests that match a default set of exclusion rules. However, URI-M pattern, exclusion rules, and many other configuration options can be customized. It even allows customization of the default response rewriting function and archival banner. The module can also be configured to only reroute a subset of the requests while letting the parent ServiceWorker script deal with the rest. For more details read the user documentation, example usage (registration process and sample ServiceWorker), or heavily documented module code.

Archival Banner


Reconstructive module has implemented a custom element named <reconstructive-banner> to provide an archival banner functionality. The banner element utilizes Shadow DOM to prevent any styles from the banner to leak into the page or the other way. Banner inclusion can be enabled by setting the showBanner configuration option to true when initializing Reconstructive module after which it will be added to every navigational page. Unlike many other archival banners in use, it does not use an iframe or stick to the top of the page. It floats at the bottom of the page, but goes out of the way when not needed. The banner element is currently in its early stage with very limited information and interactivity, but it is intended to be evolved in to a more functional component.

<script src="https://oduwsdl.github.io/Reconstructive/reconstructive-banner.js"></script>
<reconstructive-banner urir="http://example.com/" datetime="20180106175435"></reconstructive-banner>


Limitations


It is worth noting that we rely on some fairly new web APIs that might not have a very good and consistent support across all browsers and may potentially change in future. At the time of writing this post ServiceWorker support is available in about 74% active browsers globally. To help the server identify whether a request is coming from Reconstructive (to provide fallback of server-side rewriting), we add a custom request header X-ServiceWorker.

As per current specifications, there can be only one service worker active on a given scope. This means, if an archived page has its own ServiceWorker, it cannot work along with Reconstructive. However, in usual web apps ServiceWorkers are generally used for better user experience and gracefully degrade to remain functional (this is not guaranteed though). The best we can do in this case is to rewrite every ServiceWorker registration code (on client-side) in any archived page before serving the response to disable it so that Reconstructive continues to work.

Conclusions


We conceptualized an idea, experimented with it, published a peer-reviewed paper on it, implemented it in a more production-ready fashion, used it in a novel archival replay system, and made the code publicly available under the MIT License. We also released a test suite ACRTS that can be useful by itself. This work is supported in part by NSF grant III 1526700.

Resources



Update [Jan 22, 2018]: Updated usage example after converting the module to an ES6 Class and updated links after changing the repo name from "reconstructive" to "Reconstructive".

--
Sawood Alam

Sunday, January 7, 2018

2018-01-07: Review of WS-DL's 2017

The Web Science and Digital Libraries Research Group had a steady 2017, with one MS student graduated, one research grant awarded ($75k), 10 publications, and 15 trips to conferences, workshops, hackathons, internships, etc.  In the last four years (2016--2013) we have graduated five PhD and three MS students, so the focus for this year was "recruiting" and we did pick up seven new students: three PhD and four MS.  We had so many new and prospective students that Dr. Weigle and I created a new CS 891 web archiving seminar to indoctrinate introduce them to web archiving and graduate school basics.

We had 10 publications in 2017:
  • Mohamed Aturban published a tech report about the difficulties in simply computing fixity information about archived web pages (spoiler alert: it's a lot harder than you might think; blog post).  
  • Corren McCoy published a tech report about ranking universities by their "engagement" with Twitter.  
  • Yasmin AlNoamany, now a post-doc at UC Berkeley,  published two papers based on her dissertation about storytelling: a tech report about the different kinds of stories that are possible for summarizing archival collections, and a paper at Web Science 2017 about how our automatically created stories are indistinguishable from those created by experts.
  • Lulwah Alkwai published an extended version of her JCDL 2015 best student paper in ACM TOIS about the archival rate of web pages in Arabic, English, Danish, and Korean languages (spoiler alert: English (72%), Arabic (53%), Danish (35%), and Korean (32%)).
  • The rest of our publications came from JCDL 2017:
    •  Alexander published a paper about his 2016 summer internship at Harvard and the Local Memory Project, which allows for archival collection building based on material from local news outlets. 
    • Justin Brunelle, now a lead researcher at Mitre, published the last paper derived from his dissertation.  Spoiler alert: if you use headless crawling to activate all the javascript, embedded media, iframes, etc., be prepared for your crawl time to slow and your storage to balloon.
    • John Berlin had a poster about the WAIL project, which allows easily running Heritrix and the Wayback Machine on your laptop (those who have tried know how hard this was before WAIL!)
    • Sawood Alam had a proof-of-concept short paper about "ServiceWorker", a new javascript library that allows for rewriting URIs in web pages and could have significant impact on how we transform web pages in archives.  I had to unexpectedly present this paper since thanks to a flight cancellation the day before, John and Sawood were in a taxi headed to the venue during the scheduled presentation time!
    • Mat Kelly had both a poster (and separate, lengthy tech report) about how difficult it is to simply count how many archived versions of a web page an archive has (spoiler alert: it has to do with deduping, scheme transition of http-->https, status code conflation, etc.).  This won best poster at JCDL 2017!
We were fortunate to be able to travel to about 15 different workshops, conferences, hackathons:

















WS-DL did not host any external visitors this year, but we were active with the colloquium series in the department and the broader university community:
In the popular press, we had had two main coverage areas:
  • RJI ran three separate articles about Shawn, John, and Mat participating in the 2016 "Dodging the Memory Hole" meeting. 
  • On a less auspicious note, it turns out that Sawood and I had inadvertently uncovered the Optionsbleed bug three years ago, but failed to recognize it as an attack. This fact was covered in several articles, sometimes with the spin of us withholding or otherwise being cavalier with the information.
We've continued to update existing and release new software and datasets via our GitHub account. Given the evolving nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
For funding, we were fortunate to continue our string of eight consecutive years with new funding.  The NEH and IMLS awarded us a $75k, 18 month grant, "Visualizing Webpage Changes Over Time", for which Dr. Weigle is the PI and I'm the Co-PI.  This is an area we've recognized as important for some time and we're excited to have a separate project dedicated to the visualizing archived web pages. 

Another point you can probably infer from the discussion above but I decided to make explicit is that we're especially happy to be able to continue to work with so many of our alumni.  The nature of certain jobs inevitably takes some people outside of the WS-DL orbit, but as you can see above in 2017 we were fortunate to continue to work closely with Martin (2011) now at LANL, Yasmin (2016) now at Berkeley, and Justin (2016) now at Mitre.  

WS-DL annual reviews are also available for 2016, 2015, 2014, and 2013.  Finally, I'd like to thank all those who at various conferences and meetings have complimented our blog, students, and WS-DL in general.  We really appreciate the feedback, some of which we include below.

--Michael











Saturday, January 6, 2018

2018-01-06: Two WSDL Classes Offered for Spring 2018


Two Web Science & Digital Library (WS-DL) courses will be offered in Spring 2018:
Also, although they are not WS-DL courses per se, WS-DL member Corren McCoy is also teaching CS 462 Cybersecurity Fundamentals again this semester, and WS-DL alumnus Dr. Charles Cartledge is teaching two classes: CS 395 "Data Wrangling" and CS 395 "Data Analysis".

--Michael

Tuesday, January 2, 2018

2018-01-02: Link to Web Archives, not Search Engine Caches

Fig.1 Link TheFoundingSon Web Cache
Fig.2 TheFoundingSon Archived Post
In a recent article in Wired, "Yup, the Russian propagandists were blogging lies on Medium too," Matt Burgess makes reference to three now-suspended Twitter accounts: @TheFoundingSon (archived), @WadeHarriot (archived), and @jenn_abrams (archived), and their activity on the blogging service Medium.

Fig.3 TheFoundingSon Suspended Medium Account
Burgess reports that these accounts were suspended on Twitter and Medium, and quotes a Medium spokesperson as saying:
 With regards to the recent reporting around Russian accounts specifically, we’re paying close attention and working to ensure that our trust and safety processes continue to evolve and identify any accounts that violate our rules.
Unfortunately, to provide evidence of the pages' former content, Burgess links to Google caches instead of web archives.  At the time of this writing, two of the three links for @TheFoundingSon's blog posts, which were included in Wired's article, produced a 404 response code from Google (the search engine containing the cached page) when clicking on the link (see Fig.1). 

Only one link (Fig. 4), related to science and politics, was still available a few days after the article was written.

Fig.4 TheFoundingSon Medium Post Related to Science and Politics
Why is only one out of three web cache links still available? Search Engine (SE) caches are useful for covering transient errors in the live web, but they are not archives and thus not suitable for long-term access. In previous work our group has studied SE caches ("Characterization of Search Engine Caches") and the rate at which SE caches are purged ("Observed Web Robot Behavior on Decaying Web Subsites"). SE caches used to play a larger role in providing access to the past web (e.g., "How much of the web is archived?"), but improvements in the Internet Archive (i.e., no longer has a quarantine period, has a "save page now" function) and restrictions on SE APIs (e.g., monetization of the Yahoo BOSS API) have greatly reduced the role of SE caches in providing access to the past web.
To answer our original question of why two of the three links were not useful can be explained in that since Burgess is using SE caches to provide evidence of web pages that are removed from Medium's servers, and scientific research studies have proven SE’s will purge the index and cache of resources that are no longer available, we can expect that all links in the Wired's article pointing to SE caches will eventually decay.
If I were going to inquire about the type of blog @TheFoundingSon was writing, I could query https://medium.com/@TheFoundingSon from the IA's Wayback machine at web.archive.org (Fig.5).
Fig.5 TheFoundingSon Web Archived Pages

Doing so provides a list of ten archived URIs:
  1. https://web.archive.org/web/20170223230217/https://medium.com/@TheFoundingSon/5-things-hillary-is-going-to-do-on-debate-night-81412f6878ab
  2. https://web.archive.org/web/20170223012115/https://medium.com/@TheFoundingSon/blindfolded-election-2016-bc269463dc7
  3. https://web.archive.org/web/20170626021233/https://medium.com/@TheFoundingSon/catholicism-is-evil-and-islam-is-religion-of-peace-74f2d7947162
  4. https://web.archive.org/web/20170222145442/https://medium.com/@TheFoundingSon/gun-control-absurdity-34cabd52f0e4
  5. https://web.archive.org/web/20170120073029/https://medium.com/@TheFoundingSon/hillarys-actions-vs-trump-s-words-what-is-louder-92798789eaf6
  6. https://web.archive.org/web/20170119183629/https://medium.com/@TheFoundingSon/lessons-huffpost-wants-us-to-learn-from-orlando-ac74f2a27922
  7. https://web.archive.org/web/20170222224659/https://medium.com/@TheFoundingSon/making-america-deplorable-37b9cea48b4b
  8. https://web.archive.org/web/20170223094335/https://medium.com/@TheFoundingSon/one-missed-wake-up-call-6cb87200cc2a
  9. https://web.archive.org/web/20170120021905/https://medium.com/@TheFoundingSon/see-something-say-nothing-b144aa5d4d39
  10. https://web.archive.org/web/20170807072351/https://medium.com/@TheFoundingSon/votes-that-count-7766810f0809
The archived web pages are in a time capsule preserved for generations to come, in contrast to SE caches which decay in a very short period of time. It is interesting to see that for @WadeHarriot, the account with the smallest number of Twitter followers before its suspension, Wired resorted to the IA for the 'lies' from Hillary Clinton  posting; the other link was a Web search engine cache. Both web pages are available on the IA.

Another advantage of web archives over search engine caches is that web archives allow us to analyze changes of a web page through time.  For example, @TheFoundingSon on 2016-06-16 had 14,253 followers, and on 2017-09-01 it had 41,942 followers.



The data to plot @JennAbrams and @TheFoundingSon Twitter follower counts over time were obtained by utilizing a tool created by Orkun Krand while working at ODU Web Science Digital Library Group (@WebSciDL). Our tool, which will be release in the near future (UPDATE [2018-03-15]: The tool was completed by Miranda Smith, and it is now available for download), makes use of the IA and Mementos. Ideally, we would like to capture as many copies (mementos) as possible of available resources, not only in the IA, but in all the web archives around the world. However, our Follower-Count-History tool only uses the IA, because some random Twitter pages most likely will not be found in the other web archives, and since our tool is using HTML scraping to extract the data, other archives may store their web pages in a different a format than the IA.


The IA allows us to analyze our Twitter accounts in greater detail. We could not graph the count over time for @WadeHarriot's Twitter followers because only one memento was available in the web archives. However, multiple mementos were found for the other two accounts. The Followers-Count-Over-Time tool provided the data to plot the two graphs shown above. We notice by looking at the graph of @TheFoundingSon that its Twitter followers doubled from around 15K to around 30K in only six months, and it continued an accelerated ascend reaching over 40K followers before its suspension. Similar analysis can be made with the @jenn_abrams account. Before October of 2015 @jenn_abrams had around 30K followers, and a year later, it almost doubled to around 55K followers, topping over 70K followers before its suspension. We could question if the followers of these accounts are real people, or if the rate of accumulation of followers followed a normal rate on Twitter, but we will leave these questions for another post.

SE caches are an important part of the Web Infrastructure, but using them as a link is a bad idea since they are expected to decay. Instead we should link to web archives. They are more stable, and as shown in the Twitter-Followers-Count-Over-Time graphs, they allow us time series analysis if we can find multiple mementos for the same URI.

- Plinio Vargas

HTTP responses for some links found in the Wired article.

UPDATE [2018-02-19]: CNN said "cache" when it meant "archive". A CNN article made used of the IA to show the website (not longer available in the live Web) of a company accused of selling stolen identities of personnel living in the United States to Russian agents in order to facilitate banking transactions in the US. CNN used the term "cached version" to describe how they were able to make a link the past. CNN meant to use the term "archive".

UPDATE [2018-03-15]: The tool was completed by Miranda Smith, and it is now available for download