As the second step I needed to obtain the corresponding tags for each URI. I tried to be a good programmer and used the Delicious API to query for the tags instead of parsing the web interface. In order to use the API (v1) you need an account with Delicious/Yahoo. The request for
for example returns an XML formated response with the top five popular tags:
search google search engine engine web
The API returns at most the top five tags per URI even though there may be more than five visible through the web interface.
However, I split my URI set into 5 batches and ran five times a thousand queries with the same account and from the same IP address, all within 30 minutes. To my surprise I noticed that roughly 50% of the URIs did not return any tags even though they are indexed by Delicious. My intentions were good but a 50% loss was too much so I turned my attention to screen scraping the HTML page. You need to generate the md5 hash value for each URI (including http://) and append to the proper URI. For example for http://www.google.com you need to request
By parsing the source with simple regular expressions you can extract at most the top 30 tags and the frequency how often users have used this tag for this URI. This path turned out to be fast, reliable and provides better results since you get more than just five tags.
The discrepancy between the API and the web interface however raised some questions and so I will share some statistics about my data and provide theories trying to explain the observed behavior:
I only collected 4969 unique URIs. Apparently the recent tool distinguishes between e.g. google.com and www.google.com and possibly www.google.com/
The API did not return any tags for 78 URIs but the web interface provided tags for all 4969 URIs. Maybe the API accesses a smaller index than the web interface? The recent tool however may pull data from the "live" index. Similar behavior was observed by Frank McCown for search engine caches (JCDL 2007).
I got down to 78 URIs from originally 50% by distributing the queries over five different IP addresses and re-querying the API dozens of times stretched over an entire day. The API seems to be sensitive to high frequency requests or is simply not very powerful.
For the 78 URIs I obtained a mean of 23.2 tags with a standard deviation of 7.8. The minimum number of tags was two (for one URI) and the maximum was 30 (for 38 URIs). 51 of the 78 URIs had 20 or more and 73 URIs had 10 or more tags through the web interface. This just underlines the point: the API is not reliable.
I further found that in 465 cases the API returned less than five tags where the web interface returned more tags. This "under reporting" (meaning the API should have reported the top five) is another strong indicator for the API pulling from a smaller and possibly dated index.
One can argue whether or not the order of tags matters. I found that out of the 4891 URIs with tags from the API 1759 had a different order compared to the web interface data. 191 times I observed a change at rank 1. These changes account for 718 times where terms were added or removed from the union of both tag sets (API vs web interface). On average 1.11 moved in or out of the intersection of both sets.
The moral of all this? As much as you may appreciate an API, in the case of Delicious you can obtain more (better?) data by screen scraping the HTML page.