2013-04-19: Carbon Dating the Web
(note: Carbon Date 2.0 was released on 2014-11-14)
In the course of our research we often needed to determine when a certain web resource was created. In numerous cases, this question is fairly straightforward to answer by examining the resource itself. Articles often have publishing datetime stamps, social media contributions have posting time, and others you can estimate the creation date from reading the resource itself. This process is simple upon manually examining the resource, but when the dataset of resources is large it is harder to automate.
In the course of our research we often needed to determine when a certain web resource was created. In numerous cases, this question is fairly straightforward to answer by examining the resource itself. Articles often have publishing datetime stamps, social media contributions have posting time, and others you can estimate the creation date from reading the resource itself. This process is simple upon manually examining the resource, but when the dataset of resources is large it is harder to automate.
To solve this problem we conducted several experiments to
determine when the resource was created automatically. When a resource is
created it often gets indexed in the search engines, archived in the public
archives, and shared in the social media thus leaving trails of existence. We
trace those trails of existence and use the first appearance of the first trail
as a close estimate of the creation date. The timeline below illustrates a
common scenario of the lifetime of a resource.
We also examined the existence of a last modified timestamp
in the resource’s header and the feasibility of using it as an estimate of
creation date. We also examine the resource’s backlinks and in turn estimate
their creation date which could be easier to extract, which gives us an insight
on when the resource was created too.
In order to test the accuracy of our estimation we collected 1200 resources which we can manually extract the creation date from different sources. We tested our model and were able to estimate a creation date to over 75% of the resources and 33% having the exact creation date.
HTTP/1.0 200 OK
Date: Fri, 01 Mar 2013 04:44:47 GMT
Server: WSGIServer/0.1 Python/2.6.5
Content-Length: 550
Content-Type: application/json; charset=UTF-8
{
"URI": "http://www.mementoweb.org",
"Estimated Creation Date": "2009-09-30T11:58:25",
"Last Modified": "2012-04-20T21:52:07",
"Bitly": "2011-03-24T10:44:12",
"Topsy.com": "2009-11-09T20:53:20",
"Backlinks": "2011-01-16T21:42:12",
"Google.com": "2009-11-16",
"Archives": {
"Earliest": "2009-09-30T11:58:25",
"By Archive": {
"wayback.archive-it.org": "2009-09-30T11:58:25",
"api.wayback.archive.org": "2009-09-30T11:58:25",
"webarchive.nationalarchives.gov.uk": "2010-04-02T00:00:00"
}
}
}
We published the code implemented as well in GitHub. You can
download it from: https://github.com/HanySalahEldeen/CarbonDate along with the instructions to install. To
use this service, you should register with Bitly and Topsy and get their
corresponding API keys. Second, modify the config file by adding your keys.
Finally, launch server.py on your designated IP and port.
This work has been published at the third annual Temp Web workshop at the WWW 2013 conference in Rio de Janeiro, Brazil.
- Hany M. SalahEldeen, Michael L. Nelson, Carbon Dating The Web: Estimating the Age of Web Resources, Proceedings of TempWeb03, WWW 2013. (Also available as a Technical Report http://arxiv.org/abs/1304.5213).
http://cd.cs.odu.edu/cd/
ReplyDeleteNo longer seems to work
Our apologies, the server was down for maintenance. Now it is up and running.
ReplyDelete"Topsy.com": "Topsy Key has expired",
ReplyDeleteSorry to keep seeming to moan - your website is very useful to anyone researching a news story.
mikej
www.i-programmer.info
Hi Mike: the various keys (e.g., topsy, bitsy) are rate limited; when they exceed X requests/hour, you have to wait until the next hour (or day or whatever). Since this is a public demo service, we can't really control who or how much this has been used. Your best bet is to install your own copy of CD w/ your own keys: https://github.com/HanySalahEldeen/CarbonDate it should not be too hard, and it will ensure you get timely results. we're glad you like CD!
ReplyDeleteThanks for that - it makes perfect sense.
ReplyDeleteWe did a news item on it because of me saying how useful it was :-)
Have a look at
http://www.i-programmer.info/news/81-web-general/5939-carbon-dating-the-web.html
and tell us if we got anything wrong.
The article looks good to me -- much thanks for writing it!
ReplyDeletecurl -i http://cd.cs.odu.edu/cd/http://www.mementoweb.org ... cdplayerklein.blogspot.de
ReplyDelete