Friday, March 24, 2017

2017-03-24: The Impact of URI Canonicalization on Memento Count

Mat reports that relying solely on a Memento TimeMap to evaluate how well a URI is archived is not a sufficient method.                           

We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report.


Memento represents a set of captures for a URI (e.g., http://google.com) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource).

Variations in the "original URI" are canonicalized (coalescing https://google.com and http://www.google.com:80/, for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value.

<http://ws-dl.blogspot.com/>; rel="original",
<http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/>; rel="self"; type="application/link-format"; from="Wed, 29 Sep 2010 00:03:40 GMT"; until="Mon, 20 Mar 2017 19:09:10 GMT",
<http://web.archive.org/web/http://ws-dl.blogspot.com/>; rel="timegate",
<http://web.archive.org/web/20100929000340/http://ws-dl.blogspot.com/>; rel="first memento"; datetime="Wed, 29 Sep 2010 00:03:40 GMT",
<http://web.archive.org/web/20110202180231/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 02 Feb 2011 18:02:31 GMT",
<http://web.archive.org/web/20110902171049/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:10:49 GMT",
<http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:12:56 GMT",
...
<http://web.archive.org/web/20151205080546/http://www.ws-dl.blogspot.com/>; rel="memento"; datetime="Sat, 05 Dec 2015 08:05:46 GMT",
<http://web.archive.org/web/20161104143102/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 04 Nov 2016 14:31:02 GMT",
<http://web.archive.org/web/20161109005749/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 09 Nov 2016 00:57:49 GMT",
<http://web.archive.org/web/20170119233646/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Thu, 19 Jan 2017 23:36:46 GMT",
<http://web.archive.org/web/20170320190910/http://ws-dl.blogspot.com/>; rel="last memento"; datetime="Mon, 20 Mar 2017 19:09:10 GMT"
Figure 1. An abbreviated TimeMap for
http://ws-dl.blogspot.com from Internet Archive

For instance, to view the TimeMap for this very blog from Internet Archive, a user may request http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/ (Figure 1). Each URI-M (e.g., http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/) is listed with a corresponding relationship (rel) and datetime value. Note the www.ws-dl.blogspot.com and ws-dl.blogspot.com subdomain variants are both included in the same TimeMap, an product of the canonicalization procedure. The TimeMap for this URI-R currently contains 60 URI-Ms. Internet Archive's Web interface reports 58 captures -- a subtle yet differing "count". This difference get much more extreme with other URI-Rs.

The quality of each memento (e.g., in terms of completeness of capture of embedded resources) cannot be determined using the TimeMap alone. This fact is inherent in a URI-M needing to be dereferenced and each embedded resource requested upon rending the base URI-M. Comprehensively evaluating the quality over time is something we have already covered (see our TPDL2013, JCDL2014, and IJDL2015 papers/article).

In performing some studies and developing web archiving tools, we required knowing how many captures existed for a particular URI using both a Memento aggregator and the TimeMap from an archive's Memento endpoint. For http://google.com, counting the number of URIs in a TimeMap with a rel value of "memento" produces a count of 695,525 (as of May 2017). The number obtained from Internet Archive's calendar interface and CDX endpoint currently show much smaller count values (e.g., calendar interface currently states 62,339 captures for google.com).

Dereferencing these URI-Ms would take a very long time due to network latency in accessing the archive as well as limits on pipelining (though the latter can be mitigated with distributing the task). We did exactly this for google.com and found that the large majority of the URI-Ms produced a redirect to another URI-M in the TimeMap. This lead us to know that counting mementos in an archive's holdings is not sufficient with this procedure.

Figure 2. Dereferencing URI-Ms may produce a representation, a redirect, or an archived error.

For google.com we found that nearly 85% of the URI-Ms resulted in a redirect when dereferenced. We repeated this procedure for seven other TimeMaps for large web sites (e.g., yahoo.com, instagram.com, wikipedia.org) and found a wide array of trends in this naïve counting method (88.2%, 67.3%, and 44.6% are redirects, respectively). We also repeated this procedure with thirteen academic institutions' URI-Rs to observe if this trend persisted.

We have posted an extensive report of our findings as a tech report available on arXiv (linked below).

— Mat (@machawk1)

Mat Kelly, Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel. "Impact of URI Canonicalization on Memento Count," Technical Report arXiv:1703.03302, 2017.

Monday, March 20, 2017

2017-03-20: A survey of 5 boilerplate removal methods

Boilerplate removal result from BeautifulSoup's get_text() method for news website. Extracted text includes extraneous text, HTML and Javascript text.
Fig. 1: Boilerplate removal result for BeautifulSoup's get_text() method for a news website. Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text.
Boilerplate removal result from NLTK's (OLD) clean_html() method for news website. Extraneous text included, but does not include Javascript and HTML text.
Fig. 2: Boilerplate removal result for NLTK's (OLD) clean_html() method for a news websiteExtracted text includes extraneous text, but does not include Javascript, HTML, comments or CSS text.
Boilerplate removal result from Justext method for news website. Smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but title missing.

Fig. 3: Boilerplate removal result for Justext method for a news websiteExtracted text includes smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent.
Boilerplate removal result from Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext but missing title text and first paragraph.
Fig. 4: Boilerplate removal result for Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent.
Boilerplate removal result from  Python-boilerpipe (ArticleExtractor) method for this news website. Smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.
Fig. 5: Boilerplate removal result for  Python-boilerpipe (ArticleExtractor) method for a news websiteExtracted text includes smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.
Boilerplate removal refers to the task of extracting the main text content of webpages. This is done through the removal of content such as navigation links, header and footer sections, etc. Even though this task is a common prerequisite for most text processing tasks, I have not found an authoritative versatile solution. In other to better understand how some common options for boilerplate removal perform against one another, I developed a simple experiment to measure how well the methods perform when compared to a gold standard text extraction method (myself). Python-boilerpipe (ArticleExtractor mode) performed best on my small sample of 10 news documents with an average Jaccard Index score of 0.7530, and median Jaccard Index score of 0.8964. The Jaccard scores for each document for a given boilerplate removal method was calculated over the sets (bag of words) created from the news documents and the gold standard text.

Some common boilerplate removal methods
  1. BeautifulSoup's get_text()
    • Description: BeautifulSoup is a very (if not the most) popular python library used to parse HTML. It offers a boilerplate removal method - get_text() - which can be invoked with a tag element such as the body element of a webpage. Empirically, the get_text() method does not do a good job removing all the Javascript, HTML markups, comments, and CSS text of webpages, and includes extraneous text along with the extracted text.
    • Recommendation: I don't recommend exclusive use of get_text() for boilerplate removal.
  2. NLTK's (OLD) clean_html()
    • Description: Natural Language processing Toolkit (NLTK) used to provide a method called clean_html() for boilerplate removal. This method used regular expressions to parse and subsequently remove HTML, Javascript, CSS, comments, and white spaces. However, presently, NLTK deprecated this implementation and suggests the use of BeautifulSoup's get_text() method, which as we have already seen does not do a good job.
    • Recommendation: This method does a good job removing HTML, Javascript, CSS, comments, and white spaces. However, it includes the boilerplate text such as the navigation link text, as well as header and footer sections text. Therefore, if your application is not sensitive to extraneous text, and you just care about including all text from a page, this method is sufficient.
  3. Justext
    • Description: According to Mišo Belica, the creator of Justext, it was designed to preserve mainly text containing full sentences, thus, well suited for creating linguistic resources. Justext also provides an online demo.
    • Recommendation: Justext is a decent boilerplate removal method that performed almost as well as the best boilerplate removal method from our experiment (Python-boilerpipe). But note that Justext may omit page titles.
  4. Python-goose
    • Description: Python-goose is a python rewrite of an application originally written in Java and subsequently Scala. According to the author, the goal of Goose is to process news article or article-type pages, extract the main body text, metadata, and most probable image candidate.
    • Recommendation: Python-goose is a decent boilerplate removal method, but it was outperformed by Python-boilerpipe. Also note that Python-goose may omit page titles just like Justext.
  5. Python-boilerpipe
    • Description: Python-boilerpipe is a python wrapper of the original Java library for boilerplate removal and text extraction from HTML pages.
    • Recommendation: Python-boilerpipe outperformed all the other boilerplate removal methods in my small test sample. I currently use this method as the boilerplate removal method for my applications.
With the following corresponding gold standard text documents:

  1. Gold standard text for news document - 1
  2. Gold standard text for news document - 2
  3. Gold standard text for news document - 3
  4. Gold standard text for news document - 4
  5. Gold standard text for news document - 5
  6. Gold standard text for news document - 6
  7. Gold standard text for news document - 7
  8. Gold standard text for news document - 8
  9. Gold standard text for news document - 9
  10. Gold standard text for news document - 10
The HTML extracted from the 10 news documents was extracted by dereferencing each of the 10 URLs with curl. This means the boilerplate removal methods operated on just HTML (without running Javascript). I also ran the boilerplate removal methods on archived copies from archive.is for the 10 documents. This was based on the rationale that since archive.is runs Javascript and transforms the original page, this might impact the results. My experiment showed that boilerplate removal run on archived copies reduced the similarity between the gold standard texts and the output texts of all the boilerplate removal methods except BeautifulSoup's get_text() method (Table 2).

Second, for each document, I manually copied text I considered to be the main body of text for the document to create a total of 10 gold standard texts. Third, I removed the boilerplate from the 10 documents using the 8  methods outlined in Table 1. This led to a total of 80 extracted text documents (10 for each boilerplate removal method). Fourth, for each of the 80 documents, I computed the Jaccard Index (intersection divided by union of both set) over each document and it's respective gold standard. Fifth, for each of the 8 boilerplate removal methods outlined in Table 1, I computed the average of the Jaccard scores for the 10 news documents (Table 1).

Result

Table 1: Boilerplate removal results for live web news documents

Rank Methods Averages of Jaccard Indices for 10 documents Median of Jaccard Indices for 10 documents
1 Python-boilerpipe.ArticleExtractor 0.7530 0.8964
2 Justext 0.7134 0.8339
3 Python-boilerpipe.DefaultExtractor 0.6706 0.7073
4 Python-goose 0.7009 0.6822
5 Python-boilerpipe.CanolaExtractor 0.6227 0.6472
6 Python-boilerpipe.LargestContentExtractor 0.6188 0.6444
7 NLTK's (OLD) clean_html() 0.3847 0.3479
8 BeautifulSoup's get_text() 0.1959 0.2201


Table 2: Boilerplate removal results for archived news documents showing lower similarity compared to live web version (Table 1)
Rank Methods Averages of Jaccard Indices for 10 documents Median of Jaccard Indices for 10 documents
1 Python-boilerpipe.ArticleExtractor 0.6240 0.7121
2 Python-boilerpipe.DefaultExtractor 0.5534 0.7010
3 Justext 0.5956 0.6414
4 Python-boilerpipe.CanolaExtractor 0.5028 0.5274
5 Python-boilerpipe.LargestContentExtractor 0.4961 0.4669
6 Python-goose 0.4209 0.4289
7 NLTK's (OLD) clean_html() 0.3365 0.3232
8 BeautifulSoup's get_text() 0.2630 0.2687

Python-boilerpipe (ArticleExtractor mode) outperformed all the other methods. I acknowledge that this experiment is by no means rigorous for important reasons which include:
  • The test sample is very small.
  • Only news documents were considered.
  • The use of the Jaccard similarity measure forces documents to be represented as sets. This eliminates order (the permutation of words) and duplicate words. Consequently, if a boilerplate removal method omits some occurrences of a word, this information will be lost in the Jaccard similarity calculation.
Nevertheless, I believe this small experiment sheds some light about the different behaviors of the different boilerplate removal methods. For example, BeautifulSoup get_text() does not do a good job removing HTML, Javascript, CSS, and comments unlike NLTK's clean_html(), which does a good job removing these, but includes extraneous text. Also, Justext and Python-goose do not include a large body of extraneous text, even though they may omit a news article's title. Finally, based on these experiment results, Python-boilerpipe is best boilerplate removal method.

2017-04-13 Edit: At the request of Ryan Baumann, I included Python-readability in this survey.
Fig. 6: Boilerplate removal result for Python-readability method for a news website. Extracted text does not include Javascript, HTML comments, and CSS. However, the extracted text includes non-contiguous segments of extraneous HTML. Also, the title and first paragraph was omitted.
Description: Python-readability was developed by Yuri Baburov. It is a python port of a ruby port of arc90's readability project.  It attempts to pulls out the main body text of a document and cleans it up.
Recommendation: Python-readability ranked 5th place in Table 1 with an average Jaccard index score and median Jaccard index score of 0.5990 and 0.6567, respectively. Similarly, it ranked 5th place in Table 2 with an average Jaccard index score and median Jaccard index score of  0.5021 and 0.5236, respectively. This library removes the Javascript, HTML comments, and CSS. But it does not do a good job removing all the HTML from the output text. But note that Python-readability may omit page titles. If you choose to use this library, consider further clean up operations to remove the extraneous HTML.

2017-10-07 Edit: I included ScrapyNewspaper and news-please in the boilerplate removal survey. Please note that these libraries were not designed exclusively for boilerplate removal - boilerplate removal is a single feature from a collection of other primary functionalities. Therefore, my recommendation on the use of any library only considers the effectiveness of a library toward boilerplate removal.


Additional methods for boilerplate removal

  1. Scrapy
    • Description: Scrapy is a very popular python library used for crawling and extracting structured data from websites. Boilerplate removal is provided in the remove_tags() function. This method performed poorly in the survey since it combined extraneous text with JavaScript and CSS text and empty spaces. Scrapy ranked 8th place in Table 1 with an average Jaccard index score and median Jaccard index score of 0.2140 and 0.2235, respectively. Also, it ranked 8th place in Table 2 with an average Jaccard index score and median Jaccard index score of  0.2635 and 0.2692, respectively.
    • Recommendation: I don't recommend exclusive use of remove_tags() for boilerplate removal.
  2. Newspaper
    • Description: Newspaper is a python library developed by Lucas Ou-Yang, designed primarily for news article scraping and curation. Some Newspaper features include: a multi-threaded article download capability, news URL identification, text extraction from HTML, top/all image extraction from HTML, and summary/keyword extraction from text.
    • Recommendation: Newspaper is a decent boilerplate removal method, but it was outperformed by Python-boilerpipe. Also note that Newspaper may omit page titles. Newspaper ranked 5th in Table 1 (marginally outperformed by Python-goose) with average Jaccard index score and median Jaccard index score of 0.6709 and 0.6822, respectively. It ranked 5th in Table 2 also with average Jaccard index score and median Jaccard index score of 0.4941 and 0.4741, respectively.
  3. news-please
    • Description: Felix Hamborg introduced me to the news-please based on his research on news crawling and extraction. news-please is a multi-language, open-source crawler and extractor of news articles. news-please is designed to help users crawl news websites and extract metadata such as titles, lead paragraphs, main content, publication date, author, and main image. news-please combines Scrapy, Newspaper, and Python-readability. 
    • Recommendation: See Newspaper recommendation because news-please had the same performance scores as Newspaper. This is no surprise because news-please utilizes Newspaper.

Fig. 7: Boilerplate removal result for Scrapy method for a news website. Extracted text includes extraneous text (Junk text), Javascript, CSS and empty spaces.
Fig. 8: Boilerplate removal result for Newspaper and news-please methods for a news website. No extraneous text, but some missing text such as the title.
--Nwala

Thursday, March 9, 2017

2017-03-09: A State Of Replay or Location, Location, Location

We have written blog posts about the time traveling zombie apocalypse in web archives and how the lack of client-side JavaScript execution at preservation time prevented the SOPA protest of certain websites from being seen in the archive. A more recent post about CNN's utilization of JavaScript to load and render the contents of its homepage have made it unarchivable since November 1st, 2016. The CNN post detailed how some "tricks" were utilized to circumvent CORS restrictions of HTTP requests made by JavaScript to talk to their CDN were the root cause of why the page is unarchivable / unreplayable. I will now present to you a variation of this which is more insidious and less obvious than what was occurring in the CNN archives.

TL;DR

In this blog post, I will be showing in detail what caused a particular web page to fail on replay. In particular, the replay failure occurred due to the lack of necessary authentication and HTTP methods made for the custom resources this page requires for viewing. Thus the pages JavaScript thought the current page being viewed required the viewer to sign in and will always cause redirection to happen before the page has loaded. Also depending on the replay systems rewrite mechanisms, the JavaScript of the page could collide with the replays systems causing undesired effects. The biggest issue highlighted in the blog post is that certain archives replay systems are employing unbounded JavaScript rewrites that, albeit in certain situations, fundamentally destroy the original page's JavaScript. Putting its execution into states its creators could not have prepared for or thought possible when viewing the page on the live web. It must be noted that this blog post is the result of my research into the modifications made to a web page in order to archive and replay it faithfully as it was on the live web.

Background

Consider the following URI https://www.mendeley.com/profiles/helen-palmer which when viewed on the live web behaves as you would expect any page not requiring a login to behave.
But before I continue, some background about mendely.com since you may not have known about this website as I did not before it was brought to my attention. mendely.com is a LinkedIn of sorts for researchers which provides additional services geared towards them specifically. Like LinkedIn, mendely.com has publicly accessible profile pages listing a researcher's interests, their publications, educational history, professional experience, and following/follower network. All of this is accessible without a login and the only features you would expect to require a login such as follow the user or read one of their listed publications take you to a login page. But the behavior of the live web page is not maintained when replayed after being archived.

A State of Replay

Now consider the memento of https://www.mendeley.com/profiles/helen-palmer from Archive-It on 2016-12-15T23:19:00. When the page starts to load and becomes partially rendered, an abrupt redirection occurs taking you to
www.mendeley.com/sign-in/?routeTo=https%3A%2F%2Fwww.mendeley.com%2Fprofiles%2Fhelen-palmer
which is 404 in the archive.
Obviously, this should not be happening since this is not the behavior of the page on the live web. It is likely that the pages JavaScript is misbehaving when running on the host wayback.archive-it.org. Before we investigate what is causing this to happen let us see if the redirection occurs when replaying a memento from the Internet Archive on 2017-01-26T21:48:31 and a memento from Webrecorder on 2017-02-12T23:27:56.
Webrcorder
Internet Archives
The video below shows this occurring in all three archives



and as seen in the video below this happens on other pages on mendeley.com


Comparing The Page On The Live Web To Replay On Archive-It

Unfortunately, both are unable to replay the page due to the redirection occurring which points to the credibility of the original assumption that the pages JavaScript is causing the redirection. Before diving into JavaScript detective mode, let us see if the output from the developer console can give us any clues. Seen below is the browser console with XMLHttpRequests (XHR) logging enabled when viewing
https://www.mendeley.com/profiles/helen-palmer
on the live web seen below Besides the optimizely (user experience/analytics platform) XHR requests the page's own JavaScript makes several requests to the sites backend at
https://api.mendeley.com
and a single GET request for
https://www.mendeley.com/profiles/helen-palmer/co-authors
A breakdown of the request to api.mendely listed below:
  • GET api.mendeley.com/catalog (x8)
  • GET api.mendeley.com/documents (x1)
  • GET api.mendeley.com/scopus/article_authors (x8)
  • POST api.mendely.com/events/_batch (x1)
From these network requests, we can infer that the live web page is dynamically populating the publications list of its profile pages and perhaps some other elements of the page. Now let's check the browser console from the Archive-It memento on 2016-12-15T23:19:00.
Many errors are occurring as seen in the browser console from the Archive-It memento but it is the XHR request errors and lack of XHR requests made that are significant. The first significant XHR error is a 404 that occurred when trying to execute a GET request for
http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/helen-palmerco-authors/
This is a rewrite error (URI-R -> URI-M). The live web pages JavaScript requested
https://www.mendeley.com/profiles/helen-palmer/co-authors
but when replayed the archived JavaScript made the request for
https://www.mendeley.com/profiles/helen-palmerco-authors
Stranger yet is that the XHR finished loading console entry indicates it was made to
http://wayback.archive-it.org/profiles/helen-palmerco-authors
not, the URI-M that received the 404. Thankfully we can consult the developer tools included in our web browsers to see request/response headers for each request. The corresponding headers for
http://wayback.archive-it.org/profiles/helen-palmerco-authors
are seen below
The request was really 302 and was indeed made to
http://wayback.archive-it.org/profiles/helen-palmerco-authors
but the actual location indicated in the response is to the "correct" URI-M
http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/helen-palmerco-authors
The other significant difference from the live webs XHR requests is that the archived pages JavaScript is no longer requesting the resources from api.mendely.com. We now have a single request for
http://wayback.archive-it.org/profiles/refreshToken
This request suffered the same fate as the previous request, 302 with location of
http://wayback.archive-it.org/8130/20161215231900/https://www.mendeley.com/profiles/refreshToken
and then the redirection happens. Now we have a better understanding of what is happening with the Archive-It memento. The question about the Internet Archives and Webrecorders memento remains.

Does This Occur In Other Archives

The console output from the Internet Archives memento on 2017-01-26T21:48:31 seen below shows that the requests to api.mendeley.com are not made. The request for the refresh token is made, but unlike the Archive-It memento the request to co-authors is rewritten successfully and does not receive a 404 but still redirects seen below:
Likewise with the memento from Webrecorder on 2017-02-12T23:27:56 seen below, the request made to co-authors is rewritten successfully, we have the request for refresh token but still redirect to the sign-in page like the others.
As the redirection occurs for the Internet Archives and Webrecorders memento we can now finally ask the question what happened to the api.mendeley.com requests and what in the pages JavaScript is making replay fail.

Location, Location, Location

The mendeley website defines a global object that contains definitions for URLs to be used by the pages JavaScript when talking to the backend. That global object seen below (from the Archive-It memento) is untouched by the archives rewriting mechanisms. Now there is another inline script tag that adds some preloaded state for use by the pages JavaScript seen below (also from Archive-It). But here we find that our first instance of erroneous JavaScript rewriting. As you can see the __PRELOADED_STATE__ object has a key of WB_wombat_self_location which is a rewrite targeting window.location or self.location. Clearly, this is not correct when you consider the contents of this object which describe a physical location. When comparing the live web key for this entry seen below, the degree of error in this rewrite becomes apparent. Some quick background on the WB_wombat prefix before continuing on. The WB_wombat prefix normally indicates that the replay system is using the wombat.js library from PyWb and conversely Webrecorder. They are not, rather they are using their own rewrite library called ait-client-rewrite.js. The only similarity between the two is the usage of the name wombat.
Finding the refresh token code in the pages JavaScript was not so difficult seen below is the section of code that likely causes the redirect. You will notice that the redirect occurs when it is determined the viewer is not authorized to be seeing this page. This becomes clearer when seeing the code that executes retrieval of the refresh token. Here, we see two things: mendeley.com has a maximum number of retries for actions they require some form of authentication for (this is the same for the majority of its resources the pages JavaScript makes requests for) and the second instance of erroneous JavaScript rewriting:
e = t.headers.WB_wombat_self_location;
It is clear to see that Archive-It is using regular expressions to rewrite any <pojo>.location to WB_wombat_self_location as on inspection of that code section you can see that the pages JavaScript is clearly looking for the location sent in the headers commonly for 3xx or 201 responses (RFC 7231#6.3.2). This is further confirmed by the following line from the code seen above
e && this.settings.followLocation && 201 === t.status
The same can be seen in this code section from the Webrecorder memento which leaves the Internet Archives memento but the Internet Archive does not do such rewrites making this a non-issue for them. These files can be found in a gist I created if you desire to inspect them for yourself. Now at this point, you must be thinking case closed we have found out what went wrong and so did I but was not so sure as the redirection occurs in the Internet Archives memento as well.

Digging Deeper

I downloaded the Webrecorder memento loaded into my own instance of PyWb, and used its fuzzy match rewrite rules (via regexes) to insert print statements at locations in the code I believed would surface additional errors. The fruit of this labor can be seen below.
As seen above the requests to
api.mendely.com/documents and api.mendely.com/events/_batch
are actually being made but are shown as even going through by the developer tools which is extremely odd. However, the effects of this can be seen by the two errors shown after the console entries for
/profiles/helen-palmer/co-authors
and
anchor_setter herf https://www.mendely.com/profiles/helen-palmer/co-authors
which are store:publications:set.error and data:co-authors:list.error. These are the errors which I believe to be the root cause of the redirection. Before I address why that is and what the anchor_setter console entry means, we need to return to considering the HTTP requests made by the browser when viewing the live web page and not just those the browsers built in developer tools show us.

Understanding A Problem By Proxy

To achieve this I used an open-source alternative to Charles called James. James is an HTTP Proxy and Monitor that will allow us to intercept and view the requests made from the browser when viewing
https://www.mendeley.com/profiles/helen-palmer
on the live web. The image below displays the HTTP requests made by the browser starting at the time when the request for co-authors was made.
The blue rectangle highlights the requests made when replayed via Archive-It, Internet Archive and Webrecorder which include the request for co-authors (data:co-authors:list.error). The red rectangle highlights the request made for retrieving the publications (store:publications:set.error). The pinkish purple rectangle highlights a block of HTTP Options (RFC 7231#4.3.7) requests made when requesting resources from api.mendely. The request made in the red rectangle also have the Options request made before the
GET request for api.mendeley.com/catalog?=[query string]
This is happening because and to quote from the MDN entry for HTTP OPTIONS request:
Preflighted requests in CORS
In CORS, a preflight request with the OPTIONS method is sent, so that the server can respond whether it is acceptable to send the request with these parameters. The Access-Control-Request-Method header notifies the server as part of a preflight request that when the actual request is sent, it will be sent with a POST request method. The Access-Control-Request-Headers header notifies the server that when the actual request is sent, it will be sent with a X-PINGOTHER and Content-Type custom headers. The server now has an opportunity to determine whether it wishes to accept a request under these circumstances.
What they mean by preflighted is that this request is made implicitly by the browser and the reason it is being sent before the actual JavaScript made request is because the content type they are requesting is
application/vnd.mendeley-document.1+json
A full list of the content-types the mendeley pages request are enumerated in a gist likewise with the JavaScript that makes the requests for each content-type Again let's compare the browser requests as seen by James from the live web to the archived versions to see if what our browser was not showing us for the live web version is happening in the archive. Seen below are the browser-made HTTP requests as seen by James for the Archive-It memento on 2016-12-15T23:19:00.
The
helen-palmer/co-authors -> helen-palmerco-authors
rewrite issue is indeed occurring with the requests which are not made for the URI-M but hitting wayback.archive-it.org first same with profile/refreshToken. We do not see any of the requests for api.mendely as you would expect. Another strange thing is both of the requests for refreshToken get 302 status until a 200 response comes back but now from a memento on 2016-12-15T23:19:01. The memento from the Internet Archive on 2017-01-26T21:48:31 suffers similarly as seen below, but the request for helen-palmer/co-authors remains intact. The biggest difference here is that the memento from the Internet Archive is bouncing through time much more than the Archive-It memento.
The memento from Webrecorder on 2017-02-12T23:27:56 suffers similarly as did the memento from Archive-It, but this time something new happens as seen below.
The request for refreshToken goes through the first time and resolves to a 200 but we have the
helen-palmer/co-authors -> helen-palmerco-authors
rewrite error occurring. Only this time the request stays a memento request but promptly resolves to a 404 due to the rewrite error. Both the Archive-It memento and the Webrecorder memento share this rewrite error, and both use wombat to some extent so what gives. The explanation for this is likely to lie with the use of wombat (at least for the Webrecorder memento) as the library overrides a number of the global dom elements and friends at the prototype level (enumerated for clarity via this link). This is to bring the URL rewrites to the JavaScript level and to ensure the requests made become rewritten at request time. In order to better understand the totality of this, recall the image seen below (this time with sections highlighted) which I took after inserting print statements into the archived JavaScript via PyWbs fuzzy match rewrite rules.
The console entry anchor_setter href represents an instance when the archived JavaScript for mendeley.com/profiles/helen-palmer is about to make an XHR request and is logged from the wombat override of the a tags href setter method. I added this print statement to my instance of PyWb's wombat because the mendeley JavaScript uses a promise based XHR request library called axios. The axios library utilizes an anchor tag to determine if the URL for the request being made is same origin and does its own processing of the URL to be tested after using the anchor tag to normalize it. As you can see from the image above, the URL being set is relative to the page but becomes a normalized URL after being set on the anchor tag (I logged the before and after of just the set method). It must be noted that the version of wombat I used likely differs from the versions being employed by Webrecorder and maybe Archive-It. But from the evidence presented it appears to be a collision between the rewriting code and the axios libraries own code.

HTTP Options Request Replay Test

Now I can image that the heads of the readers of this blog post maybe heads are hurting, or I may have lost a few along the way. I apologize for that by the way. However, I have one more outstanding issue I brought before you to clear up. What happened to the api.mendely requests especially the options requests. The options requests were not executed for one of two reasons. The first is the pages JavaScript could not receive the expected responses due to the Authflow requests failed when replayed from an archive. Second one of the requests for content-type
application/vnd.mendeley-document.1+json
failed due to the lack of replaying HTTP Options methods or it did not return what you thought it would when replayed. To test this out I created a page hosted using GitHub pages called replay test. This page's goal is to throw some gotchas at archival and replay systems. Of those gotchas is an HTTP Options request (using axios) to https://n0tan3rd.github.io/replay_test which is promptly replied to by GitHub with a 405 not allowed. An interesting property of the response by GitHub for the request that the body is HTML which the live web displays once the request is complete. We may assume a service like Webrecorder would be able to replay this. Wrong it does not nor does the Internet Archive. What does happen is the following as seen when replayed via Webrecoder which created the capture.
The same can be seen in the capture from the Internet Archive below
What you are seeing is the response to my Options request which is to respond as if I my browser made an GET request to view the capture. This means the headers and status code I was expecting to find were never sent but saw a 200 response for viewing the capture not the request for the resource I made. This implies that the mendeleys JavaScript will never be able to make the requests for its resources that are content-type
application/vnd.mendeley-document.1+json
when replayed from an archive. Few, this now concludes this investigation and I leave what else my replay_test pages does as an exercise for the reader.

Conclusions

So what is the solution for this but first we must consider.... I'm joking. I can see only two solutions for this. The first is that replay systems used by archives that use regular expressions for JavaScript rewrites need to start thinking like JavaScript compilers such as Babel when doing the rewriting. Regular expressions can not understand the context of the statement being rewritten whereas compilers like Babel can. This would ensure the validity of the rewrite and avoid rewriting JavaScript code that has nothing to do with the windows location. The second is to archive and replay the full server client HTTP request-response chain.
- John Berlin
2017-03-19 Update:
The rewrite error occurring on Webrecorder has been corrected.
Thanks to Ilya Kreymer for his help in diagnosing the issue on Webrecorder. The getToken code of the page retrieves the token from a cookie via document.cookie which was not preserved thus the redirect. Ilya has created a capture with the preserved cookie. When replayed the capture will not redirect.

Tuesday, March 7, 2017

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Archive Unleashed 3.0 took place in the Internet Archive, San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium, hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam, Mohamed Aturban, Erika Siregar, and myself. This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0, and Web Archive Hackathon 2.0.

This workshop, was supported by the Internet ArchiveRutgers University, and the University of Waterloo. The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools to web archives. The three organizers of this workshop were: Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan, (Assistant Professor, Department of History, University of Waterloo), and Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo).
It was a big moment for me as I first saw the Internet Archive building, it had an Internet Archive truck parked outside. Since 2009, the IA headquarters have been at 300 Funston Avenue in San Fransisco, a former Christian Science Church. Inside the building in the main hall there were multiple mini statues for every archivist who worked in the IA for over three years.
On Wednesday night, we had a welcome dinner and a small introduction of the members that have arrived.
Day 1 (February 23, 2017)
On Thursday, we started with a breakfast and headed to the main hall where several presentations occurred. Matthew Weber presented “Stating the Problem, Logistical Comments”. Dr. Weber started by stating the goals which include developing a common vision of web archiving development and tool development, and to learn to work with born digital resources for humanities and social science research.
Next, Ian Milligan presented “History Crawling with Warcbase”. Dr. Milligan gave an overview of Warcbase. Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The tool is used to analyze web archives using Spark, and to take advantage of HBase to provide random access as well as analytics capabilities.

Next, Jefferson Bailey (Internet Archive) presented “Data Mining Web Archives”. He talked about the conceptual issues in Access to WA which include: Provenance (much data, but not all as expected), Acquisition (highly technical; crawl configs; soft 404s), Border issues (the web never really ends), Lure of evermore data (more data is not better data), and Attestation (higher sensitivity to elision than in traditional archives?). He also explained the different formats that the Internet Archive can save its data, which include CDX, Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). In addition, he presented an overview of some research projects based on the IA collaboration. Some of the researches he mentioned were: The ALEXANDRIA project, Web Archives for Longitudinal Knowledge, Global Event and Trend Archive Research & Integrated Digital Event Archiving, and many more.
Next, Vinay Goel (Internet Archive) presented “API Overview”. He presented the Beta WayBack Machine, which searches the IA based on a URL or a word related to a sites home page. He mentioned that search results are presented based on the anchor text search.
Justin Littman (George Washington University Libraries), presented “Social Media Collecting with Social Feed Manager”. SFM is an open source software that collects social media from APIs of Twitter, Tumblr, Flickr, and Sina Weibo.

The final talk was by “Ilya Kreymer” (Rhizome), where he presents an overview of the tool “Webrecorder”. The tool provides an integrated platform for creating high-fidelity web archives while browsing, sharing, and disseminating archived content.
After that, we had a short coffee break and started to form three groups. In order to form the groups, all participants were encouraged to write a few words on the topic they would like to work on, some words that appeared were: fake news, news, twitter, etc. Similar notes are grouped together and associating members. The resulting groups were Local News, Fake News, and End of Term Transition.

Group Name Group Members
Local News:
Good News/Bad News
Sawood Alam, Old Dominion University
Lulwah Alkwai, Old Dominion University
Mark Beasley, Rhizome
Brenda Berkelaar, University of Texas at Austin
Frances Corry, University of Southern California
Ilya Kreymer, Rhizome
Nathalie Casemajor, INRS
Lauren Ko, University of North Texas
Fake News Erika Siregar, Old Dominion University
Allison Hegel, University of California, Los Angeles
Liuqing Li, Virginia Tech
Dallas Pillen, University of Michigan
Melanie Walsh, Washington University
End of Term Transition Mohamed Aturban, Old Dominion University
Justin Littman, George Washington University
Jessica Ogden, University of Southampton
Yu Xu, University of Southern California
Shawn Walker, University of Washington
Every group started to work on its dataset and brain storm different research questions to answer, and formed a plan of work. Then we basically worked all through the day, and ended the night with a working dinner at the IA.

Day 2 (February 24, 2017)
On Friday we started by eating breakfast, and then each team continued to work on their projects.
Every Friday the IA has free lunches where hundreds of people join together; some were artists, activists, engineers, librarians and many more. After that, a public tour of the IA takes place.
We had some light talks after lunch. The first talk was by Justin Littman, were he presented an overview of his new tool called “Fbarc”. This tool archive webpages from Facebook using the Graph API.
Nick Ruest (Digital Assets Librarian at York University), gave a talk on “Twitter”. Next, Shawn Walker (University of Washington), presented “We are doing it wrong!”. He explained how the current collecting process of social media is not how people view social media.

After that all the teams presented their projects. Starting with our team, we called our project "Good News/Bad News". We utilized historical captures (mementos) of various local news sites' homepages from Archive-It to prepare our seed dataset. In order to transform the data for our usage we utilized the Webrecorder, WAT converter, and some custom scripts. We extracted various headlines featured on the homepages of the each site for each day. With the extracted headlines we analyzed the sentiments on various levels including individual headlines, individual sites, and over the whole nation using the VADER-Sentiment-Analysis Python library. To leverage more machine learning capabilities for clustering and classification, we built a Latent Semantic Indexer (LSI) model using a Ruby library called Classifier Reborn. Our LSI model helped us convey the overlap of discourse across the country. We also explored the possibility of building Word2Vec model using TensorFlow for advanced machine learning, but due to limited amount of available time, despite the great potential, we could not pursue it. To distinguish between the local and the national discourse we planned on utilizing Term Frequency-Inverse Document Frequency, but could not put it together on time. For the visualization we planned on showing the interactive US map along with the heat map of the newspaper location with the newspaper ranking as the size of the spot and the color indicating if it is good news (green) or bad news (red). Also, when a newspaper is selected a list of associated headlines is revealed (color coded as Good/Bad), a Pie chart showing the overall percentage Good/Bad/Neutral, related headlines from various other news sites across the country, and a word cloud of the top 200 most frequently used words. This visualization could also have a time slider that show the change of the sentiment for the newspapers over time. We had many more interesting visualization ideas to express our findings, but the limited amount of time only allowed us to go this far. We have made all of our code and necessary data available in a GitHub repo and trying to make a live installation available for exploration soon.


Next, the team “Fake News” presented their work. The team started with the research questions: “Is it fake news to misquote a presidential candidate by just one word? What about two? Three? When exactly does fake news become fake?”. Based on these question, they hypothesis that “Fake news doesn’t only happen from the top down, but also happens at the very first moment of interpretation, especially when shared on social media networks". With this in mind, they want to determine how Twitter users were recording, interpreting, and sharing the words spoken by Donald Trump and Hillary Clinton in real time. Furthermore, they also want to find out how the “facts” (the accurate transcription of the words) began to evolve into counter-facts or alternate versions of their words. They analyzed the twitter data from the second presidential debate and focused on the most famous keywords such as "locker room", "respect for women", and "jail". The analysis result is visualized using word tree and bar chart. They also conducted a sentiment analysis which outputs a surprising result: most twitter result has positive sentiments towards the locker-room talk. Further analysis showed that apparently sarcastic/insincere comments skewed the sentiment analysis, hence the positive sentiments.


After that, the team “End of Term Transition” presented their project. The group were trying to use public archives to estimate change in the main government domains at the time of each US presidential administration transition. For each of these official websites, they planned to identify the kind and the rate of change using multiple techniques including the Simhash, TF–IDF, edit distance, and efficient thumbnail generation. They investigated each of these techniques in terms of its performance and accuracy. The datasets were collected from the Internet Archive Wayback Machine around the 2001, 2005, 2009, 2013, and 2017 transitions. The team made their work available on Github.

Finally, a surprise new team joined, it was team “Nick”. It was presented by Nick Ruest, (Digital Assets Librarian at York University). Nick has been exploring Twitter API mysteries, he showed some visualizations showing some odd peaks that occurred.

After the teams presented their work, the judges announced the team with the most points, and the winner team was “End of Term Transition”.

This workshop was extremely interesting and I enjoyed it fully. The fourth Datathon Archives Unleashed 4.0: Web Archive Datathon was announced, and will occur at the British Library, London, UK, at June 11 – 13, 2017. Thanks to Matthew Weber, Ian Milligan, and Jimmy Lin for organizing this event, and for Jefferson Bailey, and Vinay Goel, and everyone at the Internet Archive.

-Lulwah M. Alkwai