Tuesday, September 19, 2017

2017-09-19: Carbon Dating the Web, version 4.0

With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0.



We've also decided to switch from MementoProxy and take advantage of the Memgator Aggregator tool built by Sawood Alam.

Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue. Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below.

The previous version of this project, Carbon Date 3.0, added Pubdate extraction, Twitter searching, and Bing search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay. We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides the Memento-Datetime retrieved from an archive's HTTP headers. However, news articles can contain metadata indicating the actual publication date or time. This gives our tool a more accurate time of an article's publication.

Whats New

With APIs changing over time it was decided we needed a proper way to test Carbon Date. To address this issue, we decided to use the popular Travis CI. Travis CI enables us to test our application every day using a cron job. Whenever an API changes, a piece of code breaks, or is styled in an unconventional way, we'll get a nice notification saying something has broken.

CarbonDate contains modules for getting dates for URIs from Google, Bing, Bitly and Memgator. Over time the code has had various styles and no sort of convention. To address this issue, we decided to conform all of our python code to pep8 formatting conventions.

We found that when using Google query strings to collect dates we would always get a date at midnight. This is simply because there is not timestamp, but rather a just year, month and day. This caused Carbon Date to always choose this as the lowest date. Therefore we've changed this to be the last second of the day instead of the first of the day. For example, the date '2017-07-04T00:00:00' becomes '2017-07-04T23:59:59' which allows a better precision for timestamp created.

We've also decided to change the JSON format to something more conventional. As shown below:

Other sources explored

It has been a long term goal to continuously find and add available sources to the Carbon Date application that bring offer a creation date. However, not all the sources we explore bring what we expect. Below there is a list of APIs and other sources that were tested but were unsuccessful in returning a URI creation date. We explored URL shortener APIs such as:
The bitly URL shortener still remains the best as the Bitly API allows a lookup of full URLs not just shortened ones.

How to use

Carbon Date is built on top of Python 3 (most machines have Python 2 by default). Therefore we recommend installing Carbon Date with Docker.

We do also host the server version here: http://cd.cs.odu.edu/. However, carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally via Docker.


Instructions:

After installing docker you can do the following:

2013 Dataset explored

The Carbon Date application was originally built by Hany SalahEldeen, mentioned in his paper in 2013. In 2013 they created a dataset of 1200 URIs to test this application and it was considered the "gold standard dataset." It's now four years later and we decided to test that dataset again.

We found that the 2013 dataset had to be updated. The dataset originally contained URIs and actual creation dates collected from the WHOIS domain lookup, sitemaps, atom feeds and page scraping. When we ran the dataset through the Carbon Date application, we found Carbon Date successfully estimated 890 creation dates but 109 URIs had estimated dates older than their actual creation dates. This was due to the fact that various web archive sites found mementos with creation dates older than what the original sources provided or sitemaps might have taken updated page dates as original creation dates. Therefore, we've taken taken the oldest version of the archived URI and taken that as the actual creation date to test against.

We found that 628 of the 890 estimated creation dates matched the actual creation date, achieving a 70.56% accuracy - originally 32.78% when conducted by Hany SalahEldeen. Below you can see a polynomial curve to the second degree used to fit the real creation dates.

Troubleshooting:

Q: I can't install Docker on my computer for various reason. How can I use this tool?
A: If you can't use Docker then my recommendation is to download the source code from Github and install using a python virtual environment and installing the dependencies with pip from there. 

Q: After x amount of requests Google doesn't give a date when I think it should. What's happening?
A: Google is very good at catching programs (robots) that aren't using their APIs. Carbon Dating is not using an API but rather doing a string query, like a browser would be, and then looking at the results. You might have hit a Captcha so Google might lock Carbon Date out for a while.

Q: I sent a simple website like http://apple.com to Carbon Date to check the date of creation, but it says it was not found in any archive. Why is that?
A: Websites like apple.com, cnn.com, google.com, etc., all have an exceedingly large number of mementos. The Memgator tool is searching for tens of thousands of mementos for these websites across multiple archiving websites. This request can take minutes which eventually leads to a timeout, which in turn means Carbon Date will return zero archives.

Q: I have another issue not listed here, where can I ask questions?
A: This project is open source on github. Just navigate to the issues tab on Github, start a new issue and ask away!

Carbon Date 4.0? What about 3.0?

With this being Carbon Date 4.0, that means there has been three blogs previously for this project! You can find them here:

10/24/17 Update - API route change:

An important update has gone out for those using our demo server as an endpoint! The API endpoint to retrieve the JSON for the carbon date of a URI is now "http://cd.cs.odu.edu/cd/" instead of "http://cd.cs.odu.edu/cd?url=" where the URI requested should be after the "/cd/". This addresses an issue we had with parameter handling in a URI. Also, we've updated the UI page to show the estimated date if one was found. To send the carbon date of URI to a friend you could send it like so: http://cd.cs.odu.edu/#example.org/index.html. Which redirects them to our UI with the carbon date service running as soon as they land on the page.

-Grant Atkins

Wednesday, September 13, 2017

2017-09-13: Pagination Considered Harmful to Archiving



Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015


Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015
https://web.archive.org/web/20151030092546/https://www.usnews.com/education/best-global-universities/rankings

While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive. In fact, 21 of 75 pages (or 28%)  had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017).  Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews, single page writings) and surveyed regarding their expectation for later access to additional content that was linked from or appeared on the main page.  Specifically, they investigated (1) how relationships between page content affect expectations and (2) how perceptions of content value relate to internet resources. In other words, if I save the main page as a resource, what else should I expect to be saved along with it?

I experienced this paradox first hand when I attempted to locate an historical entry from the 2016 edition of the U.S. News Best Global University Rankings.  As shown in Figure 1, October 30, 2015 is a particular date of interest because on the day prior, a revision of the original ranking for the University at Buffalo-SUNY was reported. The university's ranking was revised due to incorrect data related to the number of PhD awards. A re-calculation of the ranking metrics resulted in a change of the university's ranking position from a tie at No. 344 to a tie at No. 181.


Figure 3 - Summary of U.S. News
https://web.archive.org/web/*/https://www.usnews.com/education/best-global-universities/rankings



 

Figure 4 - Capture of U.S. News Revision for Buffalo-SUNY
https://web.archive.org/web/20160314033028/http://www.usnews.com/education/best-global-universities/rankings?page=19

A search of the Internet Archive, Figure 3, shows the U.S. News web site was saved 669 times between October 28, 2014 and September 3, 2017. We should first note that regardless of the ranking year you choose to locate via a web search, U.S. News reuses the same URL from year to year. Therefore, an inquiry against the live web will always direct you to their most recent publication. As of September 3, 2017, the redirect would be to the 2017 edition of their ranking list. Next, as shown in Figure 2, the 2016 U.S. News ranking list consisted of 750 universities presented in groups of 10 spread across 75 web pages. Therefore, the revised entry for the University at Buffalo-SUNY at rank No. 181 should appear on page 19, Figure 4.

Page No.
Captures
Start Date
End Date
1
669
10/28/2014
09/03/2017
2
434
10/28/2014
08/15/2017
3
171
10/28/2014
07/17/2017
4
43
10/28/2014
01/13/2017
5
37
10/28/2014
01/13/2017
6
7
10/28/2014
06/29/2017
7
4
09/29/2015
01/13/2017
8
1
01/12/2017
01/12/2017
9
2
01/28/2016
01/12/2017
10
1
01/12/2017
01/12/2017
11
2
01/12/2017
02/22/2017
12
1
02/22/2017
02/22/2017
13
1
02/22/2017
02/22/2017
14
2
02/22/2017
03/30/2017
15
2
03/12/2017
03/12/2017
16
2
03/30/2017
06/30/2017
17
2
06/20/2015
07/16/2017
18
4
06/19/2015
07/16/2017
19
3
06/18/2015
07/16/2017
Table 1 - Page Captures of U.S. News (Abbreviated Listing)

While I could readily locate the main page of the 2016 list as it appeared on October 30, 2015, I noted that subsequent pages were archived with diminishing frequency and over a much shorter period of time. We see in Table 1, after the first three pages, there can be a significant variance in the frequency with which the remaining site pages are crawled. And, as was noted earlier, more than a quarter (28%) of the ranking list cannot be reconstructed at all. Ainsworth and Nelson examined the degree of temporal drift that can occur during the display of sparsely archived pages using the Sliding Target policy allowed by the web archive user interface (UI); namely many years in just a few clicks. Since a substantial portion of the U.S. News ranking list is missing, it is very likely the web browsing experience will result in a hybrid list of universities that encompasses different ranking years as the user follows the page links.


Figure 5 - Frequency of Page Captures

Ultimately, we found page 19 had been captured three times during the specified time frame. However, the page containing the revised ranking that was of interest, Figure 4, was not available in the archive until March 14, 2016; almost five months after the ranking list had been updated. Further, in Figure 5, we note heavy activity for the first few and last few pages of the ranking list which may occur because, as shown on Figure 2, these links are presented prominently on page 1. The remaining pages 3 through 5 must be discovered manually by clicking on the next page. We note, in Figure 5, here the sporadic capture scheme for these intermediate pages.

Current web designs which feature pagination create a frustrating experience for the user when subsequent pages are omitted in the archive. It was my expectation that all pages associated with the ranking list would be saved in order to maintain the integrity of the complete listing of universities as they appeared on the publication date. My intuition is consistent with Poursadar and Shipman, who among their other conclusions, noted that navigational distance from the primary page can affect perceptions regarding what is considered to be viable content that should be preserved.  However, for multi-page articles, nearly 80% of the participants in their study considered linked information in the later pages as part of the resource. This perception was especially profound "when the content of the main and connected pages are part of a larger composition or set of information" as in perhaps a ranking list.


Overall, the findings of Poursadar and Shipman along with our personal observations indicate that archiving systems require an alternative methodology or domain rules that recognize when content spread across multiple pages represent a single collection or a composite resource that should be preserved in its entirety. From a design perspective, we can only wonder why there isn't a "view all" link on multi-page content such as the U.S. News ranking list. This feature might present a way to circumvent paginated design schemes so the Internet Archive can obtain a complete view of a particular web site; especially if the "view all" link is located on the first few pages which appear to be crawled most often. On the other hand, the use of pagination might also represent a conscious choice by the web designer or site owner as a way to limit page scraping even though people can still find a way to do so. Ultimately, the collateral damage associated with this type of design scheme is an uneven distribution in the archive; resulting in an incomplete archival record.

Sources:

Scott G. Ainsworth and Michael L. Nelson. "Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive." International Journal on Digital Libraries 16: 129-144. DOI: 10.1007/s00799-014-0120-4

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Faryaneh Poursardar and Frank Shipman, "What Is Part of That Resource?User Expectations for Personal Archiving.", Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries, 2017.
-- Corren (@correnmccoy)