Tuesday, September 19, 2017

2017-09-19: Carbon Dating the Web, version 4.0

With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0.

We've also decided to switch from MementoProxy and take advantage of the Memgator Aggregator tool built by Sawood Alam.

Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue. Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below.

The previous version of this project, Carbon Date 3.0, added Pubdate extraction, Twitter searching, and Bing search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay. We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides the Memento-Datetime retrieved from an archive's HTTP headers. However, news articles can contain metadata indicating the actual publication date or time. This gives our tool a more accurate time of an article's publication.

Whats New

With APIs changing over time it was decided we needed a proper way to test Carbon Date. To address this issue, we decided to use the popular Travis CI. Travis CI enables us to test our application every day using a cron job. Whenever an API changes, a piece of code breaks, or is styled in an unconventional way, we'll get a nice notification saying something has broken.

CarbonDate contains modules for getting dates for URIs from Google, Bing, Bitly and Memgator. Over time the code has had various styles and no sort of convention. To address this issue, we decided to conform all of our python code to pep8 formatting conventions.

We found that when using Google query strings to collect dates we would always get a date at midnight. This is simply because there is not timestamp, but rather a just year, month and day. This caused Carbon Date to always choose this as the lowest date. Therefore we've changed this to be the last second of the day instead of the first of the day. For example, the date '2017-07-04T00:00:00' becomes '2017-07-04T23:59:59' which allows a better precision for timestamp created.

We've also decided to change the JSON format to something more conventional. As shown below:

Other sources explored

It has been a long term goal to continuously find and add available sources to the Carbon Date application that bring offer a creation date. However, not all the sources we explore bring what we expect. Below there is a list of APIs and other sources that were tested but were unsuccessful in returning a URI creation date. We explored URL shortener APIs such as:
The bitly URL shortener still remains the best as the Bitly API allows a lookup of full URLs not just shortened ones.

How to use

Carbon Date is built on top of Python 3 (most machines have Python 2 by default). Therefore we recommend installing Carbon Date with Docker.

We do also host the server version here: http://cd.cs.odu.edu/. However, carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally via Docker.


After installing docker you can do the following:

2013 Dataset explored

The Carbon Date application was originally built by Hany SalahEldeen, mentioned in his paper in 2013. In 2013 they created a dataset of 1200 URIs to test this application and it was considered the "gold standard dataset." It's now four years later and we decided to test that dataset again.

We found that the 2013 dataset had to be updated. The dataset originally contained URIs and actual creation dates collected from the WHOIS domain lookup, sitemaps, atom feeds and page scraping. When we ran the dataset through the Carbon Date application, we found Carbon Date successfully estimated 890 creation dates but 109 URIs had estimated dates older than their actual creation dates. This was due to the fact that various web archive sites found mementos with creation dates older than what the original sources provided or sitemaps might have taken updated page dates as original creation dates. Therefore, we've taken taken the oldest version of the archived URI and taken that as the actual creation date to test against.

We found that 628 of the 890 estimated creation dates matched the actual creation date, achieving a 70.56% accuracy - originally 32.78% when conducted by Hany SalahEldeen. Below you can see a polynomial curve to the second degree used to fit the real creation dates.


Q: I can't install Docker on my computer for various reason. How can I use this tool?
A: If you can't use Docker then my recommendation is to download the source code from Github and install using a python virtual environment and installing the dependencies with pip from there. 

Q: After x amount of requests Google doesn't give a date when I think it should. What's happening?
A: Google is very good at catching programs (robots) that aren't using their APIs. Carbon Dating is not using an API but rather doing a string query, like a browser would be, and then looking at the results. You might have hit a Captcha so Google might lock Carbon Date out for a while.

Q: I sent a simple website like http://apple.com to Carbon Date to check the date of creation, but it says it was not found in any archive. Why is that?
A: Websites like apple.com, cnn.com, google.com, etc., all have an exceedingly large number of mementos. The Memgator tool is searching for tens of thousands of mementos for these websites across multiple archiving websites. This request can take minutes which eventually leads to a timeout, which in turn means Carbon Date will return zero archives.

Q: I have another issue not listed here, where can I ask questions?
A: This project is open source on github. Just navigate to the issues tab on Github, start a new issue and ask away!

Carbon Date 4.0? What about 3.0?

With this being Carbon Date 4.0, that means there has been three blogs previously for this project! You can find them here:
-Grant Atkins

No comments:

Post a Comment