2016-09-20: Carbon Dating the Web, version 3.0

Due to API changes, the old carbon date tool is out of date and some modules no longer work, such as topsy. I have taken up the responsibility of maintaining and extending the service, beginning with the following now available in Carbon Date v3.0.

Carbon date 3.0

What's new

New services have been added, such as bing searching, twitter searching and pubdate parsing.

The new software architecture enable us to load given scripts or disable given services during runtime.

The server framework has been changed from CherryPy server to tornado server which is still a python minimalist WSGI server, with better performance.

How to use the Carbon Date service

Through the website, http://carbondate.cs.odu.edu: Given that carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally. Note that the old link http://cd.cs.odu.edu still works.

Through local installation: The project source can be found at the following repository: https://github.com/oduwsdl/CarbonDate. Consult README.md for instructions on how to install the application.

Dockerizing the Carbon Date Tool

Carbon Date now only supports python 3. Due to potential package conflicts between python 2 and python 3 (most machine have python 2 installed as default), we recommend running Carbon Date in docker.

Instructions:
Build docker image from source

Install the docker.
Clone the git hub source to local directory.
Run
docker build -t carbon .
Then you can choose either server or local mode

server mode
docker run --rm -it -p 8888:8888 carbon ./main.py -s
Don't forget to mapping your port to server port in container.
Then in the browser visit
http://localhost:8888
for index page or
in the terminal
http://localhost:8888/cd?url=http://cnn.com
for direct query
local mode
docker run --rm -it carbon ./main.py -l search http://example.org

or get deployed image automatically from dockerhub :

System Design

In order to make Carbon Date tool easier to maintain and develop, the structure of the application has been refactored. The system now has four layers:

When a query has been sent to application, the query proceed as following:

Add new module to Carbon Date

Now all the modules are loaded and executed automatically. The module manipulator will try to search and call the entry function of each module. A new module can be loaded and executed automatically without altering other scripts if it define the function in the way described below.

Name the module main script as cdGet<Module name>.py
And ensure the entry function is named:

get[module name](url,outputArray, indexOfOutputArray,verbose=False,**kwargs)

or customize your own entry function name by assign string value to 'entry' variable in the beginning of your script.

For example, a new module using baidu.com as search engine to find potential creation date of a URI. The script should be named cdGetBaidu.py. And the entry function should be:

The core.py will pass outputArray, indexOfOutputArray and "displayArray"in the kwargs into the function. Note that outputArray is for core.py to compute the earliest creation date, so only one value should be assigned here. And the displayArray is for return value, it can be the same as result creation date or anything else in the form of an array of tuples.

In this example, when we get the result from baidu.com, the code to return these value is:

Source maintenance

Some web service may change, so some modules should be updated frequently.

Here, the twitter module should be updated when twitter has changed their page hierarchy. Because currently cdGetTwitter.py crawls the twitter search page and parses the time stamp of each tweet in the result. So the old algorithm may not work when twitter moves the tweets' time stamp to other tags in the future.

Thus the twitter script should be updated periodically until twitter allows users to get old tweets more than one week ago through the twitter api.

I am grateful to everyone who helped me on Carbon Date especially Sawood Alam, who helped greatly with deploying the server and countless advice about refactoring the application, and John Berlin who advised me to use tornado instead of cherryPy. Further recommendations or comments about how this service can be improved are welcome and will be appreciated.

--Zetan

Search This Blog

Web Science and Digital Libraries Research Group