Thursday, March 15, 2018

2018-03-15: Paywalls in the Internet Archive

Paywall page from The Advertister

Paywalls have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times or Wall Street Journal, because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times, as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below:

Memento Responses

All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respective news sites and tried to see how the Internet Archive captured these URI-Ms over a period of a month within the Internet Archive.

The image above shows requests sent to the Internet Archive's memento API with the initial request being 0 days and then adding 1, 7, 30 days to the inital request to see if the URI-M retrieved resolved to something other than 404. The initial request to these mementos all had a 404 status code. Adding a day to the memento and then requesting a new copy from the Internet Archive resulted in some of the URI-Ms resolving to with a 200 response code showing that these articles became available. Adding 7 days to the initial request date time shows that by this time the Internet Archive has found copies for all but 1 URI-M. This result is then repeated when adding 30 days to the initial memento request date time. The response code "0" indicates no response code caused by a infinite redirect loop. The chart follows the idea that content is released as free once a period of time has passed.

For the New York Times articles, they end up redirecting to a different part of the New York Times website: Although each of the URIs resolve with a 404 status code an earlier capture shows that it was a login page asking for signup or subscription:

Paywalls in Academia

Paywalls restrict not just news content but also academic content. When users are directly linked through a DOI assigned to a paper,  they are often redirected to a splash page showing a short description of a paper but not the actual pdf document. An example of this is: This URI currently points to for a published paper, but the content is only available via purchase:

In order to actually access the content a user is first redirected to the splash page, and is then required to purchase the requested content.

If we search for this DOI in the Internet Archive,, we find that it will ultimately lead to a memento,, of the same paywall we found on the live web. This shows that both the DOI and the paywall are archived, but the PDF is not ("Buy (PDF) USD 39.95").

Organizations that are willing to pay for a subscription to an association that host the academic papers will have access to content. A popular example is the ACM Digital Library. When users visit pages like springerlink, they may not have the option of getting the blue "Download PDF" button but rather a grey button signifying it is disabled for a non-subscribed user.

Van de Sompel et al. investigated 1.6 million URI references from arXiv and PubMed Central and found that over 500,000 of the URIs were locating URIs indicating the current document location. These URIs can expire over time and removes the use of DOIs.

Searching for Similarity

When considering hard paywall sites like Financial Times (FT) and Wall Street Journal (WSJ) it's intuitive that most of the paywall pages a non-subscribed user sees will be relatively the same. We experimented with 10 of the top WSJ articles on 11/01/2016 where each article was scraped from the homepage of WSJ. From these 10 articles we did pairwise comparisons between each article by taking the SimHash of each article's HTML representation and then computing the Hamming distance between each unique paired SimHash bit string. 

We found that pages with completely different representations stood out with a higher hamming distance of 40+ bits, while articles that had the same styled representation had at most a 3-bit hamming distance, regardless if the article was a snippet or a full length article. This showed that SimHash was not well suited for discovering differences in content but rather differences in content representation such as changes in: CSS, HTML, or Javascript. It didn't help our observations that WSJ was including entire font-family data text inside of their HTML at the time. In reference to Maciej Ceglowski's post on "The Website Obesity Crisis," WSJ injecting a CSS font-family data string does not aid in a healthy "web pyramid":

From here, I decided to explore the option of using a binary image classifier for a thumbnail of a news site, labeling an image as a "paywall_page" or a "content_page." To accomplish this I decided to use Tensorflow and the very easily applicable examples provided by the "Tensorflow for poets" tutorial. Utilizing the MobileNet model, I trained 122 paywall images and 119 content page images, mainly news homepages and articles. The images were collected using Google Images and manually classified as content or paywall pages.

I trained the model with the new images for 4000 iterations and this produced an accuracy of 80-88%. As a result, I built a simple web application named paywall-classify, that can be found on Github, that utilizes Puppeteer to take screenshots for a given list of URIs (maximum 10) at a resolution of 1920x1080 and then uses Tensorflow to classify images as well. More instructions on how to use the application can be found in the repository readme.

There are many other techniques that could be considered for image classification of webpages, for example, slicing a full page image of a news website into sections. However this approach would more than likely show bias towards the content classification as the "subscribe now" seems to always be at the top of an article meaning slicing would only have this portion in 1/n slices. For this application I also didn't consider the possibility of scrolling down a page to trigger a javascript popup of a paywall message.

Other approaches might utilize textual analysis, such as performing Naive Bayes classification on terms collected from a paywall page and then building a classifier from there. 

What to take away

It's actually difficult to find a cause as to why some the URI-Ms listed result in 404 responses while other articles for those sites may be a 200 response on their first memento. The New York Times has a limit of 10 "free" articles for each user, so perhaps at crawl time the Internet Archive hit its quota. As per Mat Kelly et al. in Impact of URI Canonicalization on Memento Count, they talk about "archived 302s", indicating at crawl time a live web site returns an HTTP 302 redirect, meaning these New York Times articles may actually be redirecting to a login page at crawl time.

-Grant Atkins (@grantcatkins)

Wednesday, March 14, 2018

2018-03-14: Twitter Follower Count History via the Internet Archive

The USA Gymnastics team shows significant growth during the years the Olympics are held.

Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive. If the account is popular enough to be archived, then a follower count for a specific date can be collected. 

The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation date of the follower is the lower bound for when they could have started following the account under observation. Its correctness is dependent on new accounts immediately following the account under observation to get an accurate lower bound. The order Twitter returns followers is subject to unannounced change, so it can't be depended on to work long term. That will not show when an account starts losing followers, because it only returns users still following the account. This tool will help accurately gather and plot the follower count based on mementos, or archived web pages, collected from the Internet Archive to show growth rates, track deleted accounts, and help pinpoint when an account might have bought bots to increase follower numbers.

I improved on a Python script, created by Orkun Krand, that collects the followers for a specific Twitter username from the mementos found in the Internet Archive. The code can be found on Github. Through the historical pages kept in the Internet Archive, the number of followers can be observed for a specific date of the collected memento. This script collects the follower count by identifying various CSS Selectors associated with the follower count for most of the major layouts Twitter has implemented. If a Twitter page isn't popular enough to warrant being archived, or too new, then no data can be collected on that user.

This code is especially useful for investigating users that have been deleted from Twitter. The Russian troll @Ten_GOP, impersonating the Tennessee GOP was deleted once discovered. However, with the Internet Archive we can still study its growth rate while it was active and being archived. 
In February 2018, there was an outcry as conservatives lost, mostly temporarily, thousands of followers due to Twitter suspending suspected bot accounts. This script enables investigating users who have lost followers, and for how long they lost them. It is important to note that the default flag to collect one memento a month is not expected to have the granularity to view behaviors that typically happen on a small time frame. To correct that, the flag [-e] to collect all mementos for an account should be used. The republican political commentator @mitchellvii lost followers in two recorded incidences. In January 2017 from the 1st to the 4th, @mitchellvii lost 1270 followers. In April 2017 from the 15th to the 17th, @mitchellvii lost 1602 followers. Using only the Twitter API to collect follower growth would not show this phenomenon.


  • Python 3
  • R* (to create graph)
  • bs4
  • urllib
  • archivenow* (push to archive)
  • datetime* (push to archive)

How to run the script:

$ git clone
$ cd FollowerCountHistory
$ ./ [-h] [-g] [-e] [-p | -P] <twitter-username-without-@> 


The program will create a folder named <twitter-username-without-@>. This folder will contain two .csv files. One, labeled <twitter-username-without-@>.csv, will contain the dates collected, the number of followers for that date, and the URL for that memento. The other, labeled <twitter-username-without-@>-Error.csv, will contain all the dates of mementos where the follower count was not collected and will list the reason why. All file and folder names are named after the Twitter username provided, after being cleaned to ensure system safety.

If the flag [-g] is used, then the script will create an image <twitter-username-without-@>-line.png of the data plotted on a line chart created by the follower_count_linechart.R script. An example of that graph is shown as the heading image for the user @USAGym, the official USA Olympic gymnastics team. The popularity of the page changes with the cycle of the Summer Olympics, evidenced by most of the follower growth occurring in 2012 and 2016.

Example Output:

./ -g -p USAGym
242 archive points found
Not Pushing to Archive. Last Memento Within Current Month.
null device 

cd usagym/; ls
usagym.csv  usagym-Error.csv  usagym-line.png

How it works:

$ ./ --help

usage: [-h] [-g] [-p | -P] [-e] uname

Follower Count History. Given a Twitter username, collect follower counts from
the Internet Archive.

positional arguments:

uname       Twitter username without @

optional arguments:

-h, --help  show this help message and exit
-g          Generate a graph with data points
-p          Push to Internet Archive
-P          Push to all archives available through ArchiveNow
-e          Collect every memento, not just one per month

First, the timemap, the list of all mementos for that URI, is collected for Then, the script collects the dates from the timemap for each memento. Finally, it dereferences each memento and extracts the follower count if all the following apply:
    1. A previously created .csv of the name the script would generate does not contain the date.
    2. The memento is not in the same month as a previously collected memento, unless [-e] is used.
    3. The page format can be interpreted to find the follower count.
    4. The follower count number can be converted to an Arabic numeral.
A .csv is created, or appended to, to contain the date, number of followers, and memento URI for each collected data point.
A error .csv is created, or appended, with the date, number of followers, and memento URI for each data point that was not collected. This will contain repeats if run repeatedly because it will not delete the old entries while writing the new errors in.

If the [-g] flag is used, a .png of the line chart will be created "<twitter-username-without-@>-line.png".
If the [-p] flag is used, the URI will be pushed to the Internet Archive to create a new memento if there is no current memento.
If the [-P] flag is used, the URI will be pushed to all archives available through archivenow to create new mementos if there is no current memento in Internet Archive.
If the [-e] flag is used, every memento will be collected instead of collecting just one per month.

As a note for future use, if the Twitter layout undergoes another change, the code will need to be updated to continue successfully collecting data.

Special thanks to Orkun Krand, whose work I am continuing.
--Miranda Smith (@mir_smi)

Monday, March 12, 2018

2018-03-12: NEH ODH Project Directors' Meeting

Michael and I attended the NEH Office of Digital Humanities (ODH) Project Directors' Meeting and the "ODH at Ten" celebration (#ODHatTen) on February 9 in DC.  We were invited because of our recent NEH Digital Humanities Advancement Grant, "Visualizing Webpage Changes Over Time" (described briefly in a previous blog post when the award was first announced), which is joint work with Pamela Graham and Alex Thurman from Columbia University Libraries and Deborah Kempe from the Frick Art Reference Library and NYARC.

The presentations were recorded, so I expect to see a set of videos available in the future, as was done for the 2014 meeting (my 2014 trip report).

The afternoon keynote was given by Kate Zwaard, Chief of National Digital Initiatives at the Library of Congress. She highlighted the great work being done at LC Labs.

After the keynote, each project director was allowed 3 slides and 3 minutes to present an overview of their newly funded work.  There were 45 projects highlighted and short descriptions of each are available through the award announcements (awarded in August 2017, awarded in December 2017).  Remember, video is coming soon for all of the 3-minute lightning talks.

Here are my 3 slides, previewing our grid, animation/slider, and timeline views for visualizing significant webpage changes over time.

Visualizing Webpage Changes Over Time from Michele Weigle

Following the lightning talks, the ODH at Ten celebration began with a keynote by Honorable John Unsworth, NEH National Council Member and University Librarian and Dean of Libraries at the University of Virginia.

I was honored to be invited to participate in the closing panel highlighting the impact that ODH support had on our individual careers and looking ahead to future research directions in digital humanities. 
Panel: Amanda French (George Washington), Jesse Casana (Dartmouth College), Greg Crane (Tufts), Julia Flanders (Northeastern), Dan Cohen (Northeastern),  Michele Weigle (Old Dominion), Matt Kirschenbaum (University of Maryland)

Thanks to the ODH staff, especially ODH Director Brett Bobley and our current Program Officer Jen Serventi, for organizing a great meeting.  It was also great to be able to catch up with our first ODH Program Officer, Perry Collins. We are so appreciative of the support for our research from NEH ODH.

Here are more tweets from our day at ODH:


Sunday, March 4, 2018

2018-03-04: Installing Stanford CoreNLP in a Docker Container

Fig. 1: Example of Text Labeled with the CoreNLP Part-of-Speech, Named-Entity Recognizer and Dependency Annotators.
The Stanford CoreNLP suite provides a wide range of important natural language processing applications such as Part-of-Speech (POS) Tagging and Named-Entity Recognition (NER) Tagging. CoreNLP is written in Java and there is support for other languages. I tested a couple of the latest Python wrappers that provide access to CoreNLP but was unable to get them working due to different environment-related complications. Fortunately, with the help of Sawood Alam, our very able Docker campus ambassador at Old Dominion University, I was able to create a Dockerfile that installs and runs the CoreNLP server (version 3.8.0) in a container. This eliminated the headaches of installing the server and also provided a simple method of accessing CoreNLP services through HTTP requests.
How to run the CoreNLP server on localhost port 9000 from a Docker container
  1. Install Docker if not already available
  2. Pull the image from the repository and run the container:
Using the server
The server can be used either from the browser or the command line or custom scripts:
  1. Browser: To use the CoreNLP server from the browser, open your browser and visit http://localhost:9000/. This presents the user interface (Fig. 1) of the CoreNLP server.
  2. Command line (NER example):
    Fig. 2: Sample request URL sent to the Named Entity Annotator 
    To use the CoreNLP server from the terminal, learn how to send requests to the particular annotator from the CoreNLP usage webpage or learn from the request URL the browser (1.) sends to the server. For example, this request URL was sent to the server by from the browser (Fig. 2), and corresponds to following command that uses the Named-Entity Recognition system to label the supplied text:
  3. Custom script (NER example): I created a Python function nlpGetEntities() that uses the NER annotator to label a user-supplied text.
To stop the server, issue the following command: 
The Dockerfile I created targets CoreNLP version 3.8.0 (2017-06-09). There is a newer version of the service (3.9.1). I believe it should be easy to adapt the Dockerfile to install the latest version by replacing all occurrences of "2017-06-09" with "2018-02-27" in the Dockerfile.  However, I have not tested this operation since version 3.9.1 is marginally different from version 3.8.0 for my use case, and I have not tested version 3.9.1 with my application benchmark. 


Tuesday, February 27, 2018

2018-02-27: Summary of Gathering Alumni Information from a Web Social Network

While researching my dissertation topic (slides 2--28) on social media profile discovery, I encountered a related paper titled Gathering Alumni Information from a Web Social Network written by Gabriel Resende Gonçalves, Anderson Almeida Ferreira, and Guilherme Tavares de Assis, which was published in the proceedings of the 9th IEEE Latin American Web Congress (LA-WEB). In this paper, the authors detailed their approach to define a semi-automated method to gather information regarding alumni of a given undergraduate program at Brazilian higher education institutions. Specifically, they use the Google Custom Search Engine (CSE) to identify candidate LinkedIn pages based on a comparative evaluation of similar pages in their training set. The authors contend alumni are efficiently found through their process, which is facilitated by focused crawling of data publicly available on social networks posted by the alumni themselves. The proposed methodology consists of three main modules and two data repositories, which are depicted in Figure 1. Using this functional architecture, the authors constructed a tool that gathers professional data on the alumni in undergraduate programs of interest, then proceeds to classify the associated HTML page to determine relevance. A summary of their methodology is presented here.

Functional architecture of the proposed method
Figure 1 - Functional architecture of the proposed method


The first repository, Pages Repository, stores the web pages from the initial set of data samples which are used to start the classification process. This set is comprised of alumni lists obtained from five universities across Brazil. The lists contain the names of students enrolled between 2000 and 2010 in undergraduate programs, namely Computer Science at three institutions, Metallurgical Engineering at one institution, and Chemistry at one institution. The total number of alumni available on all lists is 6,093. For the purpose of validation, a random set of 15 alumni are extracted from each list as training examples during each run of their classifier. The second repository, Final Database, is the database where academic data on each alumnus is stored for further analysis.


The first module, Searcher, determines the candidate pages from a Google result set that might belong to the alumni group. LinkedIn is the social network of choice from which the authors leverage public pages on the web which have been indexed by a search engine. The search is initiated using a combination of the first, middle and last names of a given alumnus, then, relevant data concerning the undergraduate program, program degree, and institution are extracted from the candidate pages. The authors chose not to search using LinkedIn's Application Programming Interface (API) due to its inherent limitations. Specifically, the API requires authentication by a registered LinkedIn user and searches are restricted to the first degree connections of the user conducting the search. As an alternative, the authors use the Google Custom Search Engine which provides access to Google's  massive repository of indexed pages, but is limited to 100 daily free searches returning 100 results per query.

We should note in the years since this paper was published in 2014, LinkedIn has instituted a number of security measures to impede data harvesting of public profiles. They employ a series of automated tools, FUSE, Quicksand, Sentinel, and Org Block, that are used to monitor suspicious activity and block web scraping. Requests are throttled based on the requester's IP address (see HIQ Labs V. LinkedIn Corporation).  Anonymous viewing of a large number of public LinkedIn profile pages, even if retrieved using Google's boolean search criteria, is not always possible. After an undisclosed number of  public profile views, LinkedIn forces the user to either sign up or log in as a way to thwart scraping by 3rd party applications (Figure 2).

LinkedIn Anonymous Search Limit Reached
Figure 2 - LinkedIn Anonymous Search Limit Reached
The second module, Filter, determines the significance of the candidate pages provided by the Searcher module via the Pages Repository. The classification process determines the similarity among pages using the academic information on the LinkedIn page as terms which are then separated into categories that describe the undergraduate program, institution, and degree. The authors proceed to use Cosine Similarity to build a relationship between candidate pages from the Searcher module and the initial training set based on term frequency and specify a 30% threshold for the minimum percentage of pages on which a term must appear.

The third module, Extraction, extracts the demographic and academic information from the HTML pages returned by the Filter module using regular expressions as shown in Figure 3. The extracted information is stored in the Final Database for further analysis using the Naive Bayes bag-of-words model to identify specific alumni of the desired undergraduate program.

Figure 3 - Regular Expressions Used by Extraction Module

Results and Takeaways

The authors acknowledge that obtaining an initial list of alumni names is not a major obstacle. However, collecting the initial set of sample pages from a social network, such as LinkedIn, may be time consuming and labor intensive even with small data sets. Their evaluation, as shown in Figure 4, indicates satisfactory precision and the methodology proposed in their paper is able to find an average of 7.5% to 12.2% of alumni for undergraduate programs with more than 1,000 alumni.

Pages Retrieved and Precision Results For Proposed Method and Baseline
Figure 4 - Pages Retrieved and Precision Results For Proposed Method and Baseline
Given the highly structured design of LinkedIn HTML pages, we would expect the Filter and Extraction modules to identify and successful retrieve a higher percentage of alumni; even without applying a machine learning technique. The bulk of this paper's research is predicated upon access to public data on the web. If social media networks choose to present barriers that impede the collection of this public information, continued research by these authors and others will be significantly impacted. With regards to LinkedIn public profiles, we can only anticipate the imminent outcome of pending litigation which will determine who controls publicly available data.

--Corren McCoy (@correnmccoy)

Gonçalves, G. R., Ferreira, A. A., de Assis, G. T., & Tavares, A. I. (2014, October). Gathering alumni information from a web social network. In Web Congress (LA-WEB), 2014 9th Latin American (pp. 100-108). IEEE.

Monday, January 8, 2018

2018-01-08: Introducing Reconstructive - An Archival Replay ServiceWorker Module

Web pages are generally composed of many resource such as images, style sheets, JavaScript, fonts, iframe widgets, and other embedded media. These embedded resources can be referenced in many ways (such as relative path, absolute path, or a full URL). When the same page is archived and replayed from a different domain under a different base path, these references may not resolve as intended, hence, may result in a damaged memento. For example, a memento (an archived copy) of the web page can be seen at Note that domain name has changed from to and some extra path segments are added to it. In order for this page to render properly, various resource references in it are rewritten, for example, images/logo-university.png in a CSS file is replaced with /web/20171225230642im_/

Traditionally, web archival replay systems rewrite link and resource references in HTML/CSS/JavaScript responses so that they resolve to their corresponding archival version. Failure to do so would result in a broken rendering of archived pages (composite mementos) as the embedded resource references might resolve to their live version or an invalid location. With the growing use of JavaScript in web applications, often resources are injected dynamically, hence rewriting such references is not possible from the server side. To mitigate this issue, some JavaScript is injected in the page that overrides the global namespace to modify the DOM and monitor all network activity. In JCDL17 and WADL17 we proposed a ServiceWorker-based solution to this issue that requires no server-side rewriting, but catches every network request, even those that were initiated due to dynamic resource injection. Read our paper for more details.
Sawood Alam, Mat Kelly, Michele C. Weigle and Michael L. Nelson, "Client-side Reconstruction of Composite Mementos Using ServiceWorker," In JCDL '17: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries. June 2017, pp. 237-240.

URL Rewriting

There are primarily three ways to reference a resource from another resource, namely, relative path, absolute path, and absolute URL. All three have their own challenges when served from an archive (or from a different origin and/or path than the original). In the case of archival replay, both the origin and base paths are changed from the original while original origin and paths usually become part of the new path. Relative paths are often the easiest to replay as they are not tied to the origin or the root path, but they cannot be used for external resources. Absolute paths and absolute URLs on the other hand are resolved incorrectly or live-leaked when a primary resource is served from an archive, neither of these conditions are desired in archival replay. There is a fourth way of referencing a resource called schemeless (or protocol-relative) that starts with two forward slashes followed by a domain name and paths. However, usually web archives ignore the scheme part of the URI when canonicalizing URLs, so we can focus on just three main ways. The following table illustrates examples of each with their resolution issues.

Reference type Example Resolution after relocation
Relative path images/logo.png Potentially correct
Absolute path /public/images/logo.png Potentially incorrect
Absolute URL Potentially live leakage

Archival replay systems (such as OpenWayback and PyWB) rewrite responses before serving to the client in a way that various resource references point to their corresponding archival page. Suppose a page, originally located at, has an image in it that is referenced as <img src="/public/images/logo.png">. When the same page is served from an archive at<datetime>/, the image reference needs to be rewritten as <img src="/<datetime>/"> in order for it to work as desired. However, URLs constructed by JavaScript, dynamically on the client-side are difficult to rewrite just by the static analysis of the code at server end. With the rising usage of JavaScript in web pages, it is becoming more challenging for the archival replay systems to correctly replay archived web pages.


ServiceWorker is a new web API that can be used to intercept all the network requests within its scope or originated from its scope (with a few exceptions such as an external iframe source). A web page first delivers a ServiceWorker script and installs it in the browser, which is registered to watch for all requests from a scoped path under the same origin. Once installed, it persists for a long time and intercepts all subsequent requests withing its scope. An active ServiceWorker sits in the middle of the client and the server as a proxy (which is built-in to the browser). It can change both requests and responses as necessary. The primary use-case of the API is to provide better offline experience in web apps by serving pages from a client-side cache when there is no network or populating/synchronizing the cache. However, we found it useful to solve an archival replay problem.


We created Reconstructive, a ServiceWorker module for archival replay that sits on the client-side and intercepts every potential archival request to properly reroute it. This approach requires no rewrites from the server side. It is being used successfully in our IPFS-based archival replay system called InterPlanetary Wayback (IPWB). The main objective of this module is to help reconstruct (hence the name) a composite memento (from one or more archives) while preventing from any live-leaks (also known as zombie resources) or wrong URL resolutions.

The following figure illustrates an example where an external image reference in an archived web page would have leaked to the live-web, but due to the presence of Reconstructive, it was successfully rerouted to the corresponding archived copy instead.

In order to reroute requests to the URI of a potential archived copy (also known as Memento URI or URI-M) Reconstructive needs the request URL and the referrer URL, of which the latter must be a URI-M. It extracts the datetime and the original URI (or URI-R) of the referrer then combines them with the request URL as necessary to construct a potential URI-M for the request to be rerouted to. If the request URL is already a URI-M, it simply adds a custom request header X-ServiceWorker and fetches the response from the server. When necessary, the response is rewritten on the client-side to fix some quirks to make sure that the replay works as expected or to optionally add an archival banner. The following flowchart diagram shows what happens in every request/response cycle of a fetch event in Reconstructive.

We have also released an Archival Capture Replay Test Suite (ACRTS) to test the rerouting functionality in different scenarios. It is similar to our earlier Archival Acid Test, but more focused on URI references and network activities. The test suite comes with a pre-captured WARC file of a live test page. captured resources are all green while the live site has everything red. The WARC file can be replayed using any archival replay system to test how well the system is replaying archived resources. In the test suite a green box means properly rerouting, red box means a live-leakage, while white/gray means incorrectly resolving the reference.

Module Usage

The module is intended to be used by archival replay systems backed by Memento endpoints. It can be a web archive such as IPWB or a Memento aggregator such as MemGator. In order use the module, write a ServiceWorker script (say, serviceworker.js) with your own logic to register and update it. In that script, import reconstructive.js script (locally or externally) which will make the Reconstructive module available with all of its public members/functions. Then bind the fetch event listener to the publicly exposed Reconstructive.reroute function.

const rc = new Reconstructive();
self.addEventListener('fetch', rc.reroute);

This will start rerouting every request according to a default URI-M pattern while excluding some requests that match a default set of exclusion rules. However, URI-M pattern, exclusion rules, and many other configuration options can be customized. It even allows customization of the default response rewriting function and archival banner. The module can also be configured to only reroute a subset of the requests while letting the parent ServiceWorker script deal with the rest. For more details read the user documentation, example usage (registration process and sample ServiceWorker), or heavily documented module code.

Archival Banner

Reconstructive module has implemented a custom element named <reconstructive-banner> to provide an archival banner functionality. The banner element utilizes Shadow DOM to prevent any styles from the banner to leak into the page or the other way. Banner inclusion can be enabled by setting the showBanner configuration option to true when initializing Reconstructive module after which it will be added to every navigational page. Unlike many other archival banners in use, it does not use an iframe or stick to the top of the page. It floats at the bottom of the page, but goes out of the way when not needed. The banner element is currently in its early stage with very limited information and interactivity, but it is intended to be evolved in to a more functional component.

<script src=""></script>
<reconstructive-banner urir="" datetime="20180106175435"></reconstructive-banner>


It is worth noting that we rely on some fairly new web APIs that might not have a very good and consistent support across all browsers and may potentially change in future. At the time of writing this post ServiceWorker support is available in about 74% active browsers globally. To help the server identify whether a request is coming from Reconstructive (to provide fallback of server-side rewriting), we add a custom request header X-ServiceWorker.

As per current specifications, there can be only one service worker active on a given scope. This means, if an archived page has its own ServiceWorker, it cannot work along with Reconstructive. However, in usual web apps ServiceWorkers are generally used for better user experience and gracefully degrade to remain functional (this is not guaranteed though). The best we can do in this case is to rewrite every ServiceWorker registration code (on client-side) in any archived page before serving the response to disable it so that Reconstructive continues to work.


We conceptualized an idea, experimented with it, published a peer-reviewed paper on it, implemented it in a more production-ready fashion, used it in a novel archival replay system, and made the code publicly available under the MIT License. We also released a test suite ACRTS that can be useful by itself. This work is supported in part by NSF grant III 1526700.


Update [Jan 22, 2018]: Updated usage example after converting the module to an ES6 Class and updated links after changing the repo name from "reconstructive" to "Reconstructive".

Sawood Alam

Sunday, January 7, 2018

2018-01-07: Review of WS-DL's 2017

The Web Science and Digital Libraries Research Group had a steady 2017, with one MS student graduated, one research grant awarded ($75k), 10 publications, and 15 trips to conferences, workshops, hackathons, internships, etc.  In the last four years (2016--2013) we have graduated five PhD and three MS students, so the focus for this year was "recruiting" and we did pick up seven new students: three PhD and four MS.  We had so many new and prospective students that Dr. Weigle and I created a new CS 891 web archiving seminar to indoctrinate introduce them to web archiving and graduate school basics.

We had 10 publications in 2017:
  • Mohamed Aturban published a tech report about the difficulties in simply computing fixity information about archived web pages (spoiler alert: it's a lot harder than you might think; blog post).  
  • Corren McCoy published a tech report about ranking universities by their "engagement" with Twitter.  
  • Yasmin AlNoamany, now a post-doc at UC Berkeley,  published two papers based on her dissertation about storytelling: a tech report about the different kinds of stories that are possible for summarizing archival collections, and a paper at Web Science 2017 about how our automatically created stories are indistinguishable from those created by experts.
  • Lulwah Alkwai published an extended version of her JCDL 2015 best student paper in ACM TOIS about the archival rate of web pages in Arabic, English, Danish, and Korean languages (spoiler alert: English (72%), Arabic (53%), Danish (35%), and Korean (32%)).
  • The rest of our publications came from JCDL 2017:
    •  Alexander published a paper about his 2016 summer internship at Harvard and the Local Memory Project, which allows for archival collection building based on material from local news outlets. 
    • Justin Brunelle, now a lead researcher at Mitre, published the last paper derived from his dissertation.  Spoiler alert: if you use headless crawling to activate all the javascript, embedded media, iframes, etc., be prepared for your crawl time to slow and your storage to balloon.
    • John Berlin had a poster about the WAIL project, which allows easily running Heritrix and the Wayback Machine on your laptop (those who have tried know how hard this was before WAIL!)
    • Sawood Alam had a proof-of-concept short paper about "ServiceWorker", a new javascript library that allows for rewriting URIs in web pages and could have significant impact on how we transform web pages in archives.  I had to unexpectedly present this paper since thanks to a flight cancellation the day before, John and Sawood were in a taxi headed to the venue during the scheduled presentation time!
    • Mat Kelly had both a poster (and separate, lengthy tech report) about how difficult it is to simply count how many archived versions of a web page an archive has (spoiler alert: it has to do with deduping, scheme transition of http-->https, status code conflation, etc.).  This won best poster at JCDL 2017!
We were fortunate to be able to travel to about 15 different workshops, conferences, hackathons:

WS-DL did not host any external visitors this year, but we were active with the colloquium series in the department and the broader university community:
In the popular press, we had had two main coverage areas:
  • RJI ran three separate articles about Shawn, John, and Mat participating in the 2016 "Dodging the Memory Hole" meeting. 
  • On a less auspicious note, it turns out that Sawood and I had inadvertently uncovered the Optionsbleed bug three years ago, but failed to recognize it as an attack. This fact was covered in several articles, sometimes with the spin of us withholding or otherwise being cavalier with the information.
We've continued to update existing and release new software and datasets via our GitHub account. Given the evolving nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
For funding, we were fortunate to continue our string of eight consecutive years with new funding.  The NEH and IMLS awarded us a $75k, 18 month grant, "Visualizing Webpage Changes Over Time", for which Dr. Weigle is the PI and I'm the Co-PI.  This is an area we've recognized as important for some time and we're excited to have a separate project dedicated to the visualizing archived web pages. 

Another point you can probably infer from the discussion above but I decided to make explicit is that we're especially happy to be able to continue to work with so many of our alumni.  The nature of certain jobs inevitably takes some people outside of the WS-DL orbit, but as you can see above in 2017 we were fortunate to continue to work closely with Martin (2011) now at LANL, Yasmin (2016) now at Berkeley, and Justin (2016) now at Mitre.  

WS-DL annual reviews are also available for 2016, 2015, 2014, and 2013.  Finally, I'd like to thank all those who at various conferences and meetings have complimented our blog, students, and WS-DL in general.  We really appreciate the feedback, some of which we include below.