Friday, November 14, 2014

2014-11-14: Carbon Dating the Web, version 2.0



For over 1 year, Hany SalahEldeen's Carbon Date service has been out of service mainly because of API changes in some of the underlying modules on which the service is built upon. Consequently, I have taken up the responsibility of maintaining the service, beginning with the following now available in Carbon Date v2.0.

Carbon Date v2.0


The Carbon Date service currently makes requests to the different modules (Archives, backlinks, etc.), in a concurrent manner through threading.
The server framework has been changed from bottle server to CherryPy server which is still a python minimalist WSGI server, but a more robust framework which features a threaded server.

How to use the Carbon Date service

There are three ways:
  • Through the website, http://cd.cs.odu.edu/: Given that carbon dating is highly computationally intensive, the site should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally (local.py or server.py)
  • Through the local server (server.py): The second way to use the Carbon Date service is through the local server application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.
  • Through the local application (local.py): The third way to use the Carbon Date service is through the local python application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.

The backlinks efficiency problem

Upon running the Carbon Date service, you will notice a significant difference in the runtime of the backlinks module compared to the other modules, this is because the most expensive operation in the carbon dating process involves carbon dating backlinks. Consequently, in the local application (local.py), the backlinks module is switched off by default and reactivated with the --compute-backlinks option. For example, to Carbon Date cnn.com, with the backlinks module switched on:
Some effort was put towards optimizing the backlinks module, however, my conclusion is that the current implementation cannot be optimized.

This is because of the following cascade of operations associated with the inlinks:



Given a single backlink (an incoming link - inlink to the URL), the application retrieves all mementos (which could range from tens to hundreds). Thereafter, the application searches for the first occurrence of the link in the memento.

At first glance, one may suggest binary search since the mementos are in chronological order. However, given that there are potentially multiple memento instances which contain the URL, binary search does not help us because if we check the midpoint memento for the URL, we cannot act upon this information to narrow the search space by half, since the left half of the list of mementos or the right half of the list of mementos could contain the first occurrence of the URL. Therefore, the linear method is the only possible method.

I am grateful to everyone who contributed to the debugging of Carbon Date such as George Micros and the members of the Old Dominion University Introduction to Web Science class (Fall 2014). Further recommendation or comments about how this service can be improved is welcome and will be appreciated.

--Nwala

No comments:

Post a Comment