Thursday, March 15, 2018

2018-03-15: Paywalls in the Internet Archive

Paywall page from The Advertister

Paywalls have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times or Wall Street Journal, because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times, as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below:

Memento Responses

All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respective news sites and tried to see how the Internet Archive captured these URI-Ms over a period of a month within the Internet Archive.



The image above shows requests sent to the Internet Archive's memento API with the initial request being 0 days and then adding 1, 7, 30 days to the inital request to see if the URI-M retrieved resolved to something other than 404. The initial request to these mementos all had a 404 status code. Adding a day to the memento and then requesting a new copy from the Internet Archive resulted in some of the URI-Ms resolving to with a 200 response code showing that these articles became available. Adding 7 days to the initial request date time shows that by this time the Internet Archive has found copies for all but 1 URI-M. This result is then repeated when adding 30 days to the initial memento request date time. The response code "0" indicates no response code caused by a infinite redirect loop. The chart follows the idea that content is released as free once a period of time has passed.

For the New York Times articles, they end up redirecting to a different part of the New York Times website: https://web.archive.org/web/20100726195833/http://www.nytimes.com/glogin. Although each of the URIs resolve with a 404 status code an earlier capture shows that it was a login page asking for signup or subscription:

Paywalls in Academia

Paywalls restrict not just news content but also academic content. When users are directly linked through a DOI assigned to a paper,  they are often redirected to a splash page showing a short description of a paper but not the actual pdf document. An example of this is: https://doi.org/10.1007/s00799-014-0120-4. This URI currently points to springer.com for a published paper, but the content is only available via purchase:

In order to actually access the content a user is first redirected to the splash page https://link.springer.com/article/10.1007%2Fs00799-014-0120-4, and is then required to purchase the requested content.

If we search for this DOI in the Internet Archive, http://web.archive.org/web/20180316075554/https://doi.org/10.1007/s00799-014-0120-4, we find that it will ultimately lead to a springer.com memento, http://web.archive.org/web/20160916200215/http://link.springer.com/article/10.1007/s00799-014-0120-4, of the same paywall we found on the live web. This shows that both the DOI and the springer.com paywall are archived, but the PDF is not ("Buy (PDF) USD 39.95").

Organizations that are willing to pay for a subscription to an association that host the academic papers will have access to content. A popular example is the ACM Digital Library. When users visit pages like springerlink, they may not have the option of getting the blue "Download PDF" button but rather a grey button signifying it is disabled for a non-subscribed user.

Van de Sompel et al. investigated 1.6 million URI references from arXiv and PubMed Central and found that over 500,000 of the URIs were locating URIs indicating the current document location. These URIs can expire over time and removes the use of DOIs.

Searching for Similarity

When considering hard paywall sites like Financial Times (FT) and Wall Street Journal (WSJ) it's intuitive that most of the paywall pages a non-subscribed user sees will be relatively the same. We experimented with 10 of the top WSJ articles on 11/01/2016 where each article was scraped from the homepage of WSJ. From these 10 articles we did pairwise comparisons between each article by taking the SimHash of each article's HTML representation and then computing the Hamming distance between each unique paired SimHash bit string. 

We found that pages with completely different representations stood out with a higher hamming distance of 40+ bits, while articles that had the same styled representation had at most a 3-bit hamming distance, regardless if the article was a snippet or a full length article. This showed that SimHash was not well suited for discovering differences in content but rather differences in content representation such as changes in: CSS, HTML, or Javascript. It didn't help our observations that WSJ was including entire font-family data text inside of their HTML at the time. In reference to Maciej Ceglowski's post on "The Website Obesity Crisis," WSJ injecting a CSS font-family data string does not aid in a healthy "web pyramid":



From here, I decided to explore the option of using a binary image classifier for a thumbnail of a news site, labeling an image as a "paywall_page" or a "content_page." To accomplish this I decided to use Tensorflow and the very easily applicable examples provided by the "Tensorflow for poets" tutorial. Utilizing the MobileNet model, I trained 122 paywall images and 119 content page images, mainly news homepages and articles. The images were collected using Google Images and manually classified as content or paywall pages.


I trained the model with the new images for 4000 iterations and this produced an accuracy of 80-88%. As a result, I built a simple web application named paywall-classify, that can be found on Github, that utilizes Puppeteer to take screenshots for a given list of URIs (maximum 10) at a resolution of 1920x1080 and then uses Tensorflow to classify images as well. More instructions on how to use the application can be found in the repository readme.

There are many other techniques that could be considered for image classification of webpages, for example, slicing a full page image of a news website into sections. However this approach would more than likely show bias towards the content classification as the "subscribe now" seems to always be at the top of an article meaning slicing would only have this portion in 1/n slices. For this application I also didn't consider the possibility of scrolling down a page to trigger a javascript popup of a paywall message.

Other approaches might utilize textual analysis, such as performing Naive Bayes classification on terms collected from a paywall page and then building a classifier from there. 

What to take away

It's actually difficult to find a cause as to why some the URI-Ms listed result in 404 responses while other articles for those sites may be a 200 response on their first memento. The New York Times has a limit of 10 "free" articles for each user, so perhaps at crawl time the Internet Archive hit its quota. As per Mat Kelly et al. in Impact of URI Canonicalization on Memento Count, they talk about "archived 302s", indicating at crawl time a live web site returns an HTTP 302 redirect, meaning these New York Times articles may actually be redirecting to a login page at crawl time.

-Grant Atkins (@grantcatkins)

1 comment:

  1. Thanks you for sharing the article. The data that you provided in the blog is infromative and effectve. Through you blog I gained so much knowledge. Also check my collection at selenium Online Training Blog

    ReplyDelete