Saturday, November 10, 2012

2012-11-10: Site Transitions, Cool URIs, URI Slugs, Topsy

Recently I was emailing a friend and wanted to update her about the recent buzz we have enjoyed with Hany SalahEldeen's TPDL 2012 paper about the loss rate of resources shared over Twitter.  I remembered that an article in the MIT Technology Review from the Physics arXiv blog started the whole wave of popular press (e.g., MIT Technology Review, BBC, The Atlantic, Spiegel).  To help convey the amount of social media sharing of these stories, I was sending links to the sites using social media search engine Topsy.  Having recently discovered it, Topsy has quickly become one of my favorite sites.  It does many things, but the part I enjoy most is the ability to prepend "http://topsy.com/" to a URI to discover how many times a URI has been shared and who is sharing it.  For example:

http://www.bbc.com/future/story/20120927-the-decaying-web

becomes:

http://topsy.com/http://www.bbc.com/future/story/20120927-the-decaying-web

and you can see all the tweets that have linked to the bbc.com URI. 

While composing my email I recalled the Technology Review article was the one of the first (September 19, 2012) and most popular, so I did a Google search for the article and converted the resulting URI from:

http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

to:

http://topsy.com/http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

I was surprised when I saw Topsy reported 0 posts about the MIT TR story, because I recalled it being quite large.  I thought maybe it was a transient error and didn't think too much about it until later that night when I was on my home computer where I had bookmarked the MIT TR Topsy URI and it said "900 posts".  Then I looked carefully: the URI I had bookmarked now issues a 301 redirection to another URI:

% curl -I http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352561072"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 15:24:32 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 15:24:32 GMT
X-Varnish: 1779081554
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


A little poking around revealed that technologyreview.com reorganized and rebranded their site on October 24, 2012, and Google had already swapped the prior URI to the article with the new URI.  Their site uses Drupal and it appears their old site did as well but the URIs have changed.  The base URIs (e.g., http://www.technologyreview.com/view/429274/) have stayed the same (and is thus almost "cool"), but the slug has lengthed from 8 terms ("history as recorded on twitter is vanishing from") to the full title ("history as recorded on twitter is vanishing from the web say computer scientists").  Slugs are a nice way to make the URI more human readable, and can be useful in determining what the URI was "about" if (or when) it becomes 404 (see also Martin Klein's dissertation on lexical signatures).  The base URI will 301 redirect to the URI with the slug:

% curl -I http://www.technologyreview.com/view/429274/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352563816"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 16:10:16 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 16:10:16 GMT
X-Varnish: 1779473907
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


But this redirection is transparent to the user, so all the tweets that Topsy analyzes are the versions with slugs.  This results in two URIs for the article: the version from Sept 19 -- Oct 24 that has 900 tweets, and the Oct 24 -- now version that currently has 3 tweets (up from 0 when I first noticed this).  technologyreview.com is to be commended for not breaking the pre-update URIs (see the post about how ctv.ca handled a similar situation) and issuing 301 redirections to the new versions, but it would have been prefereable to have maintained the old URIs completely (perhaps the new software installation has a different default slug length, I'm not familiar with Drupal and in the code examples I can find a limit is not defined). 

Splitting PageRank with URI aliases is a well-known problem that can be addressed with 301 redirects (e.g., this is why most URI shorteners like bitly issue 301 redirects (instead of 302s), so the PageRank will accumulate at the target and not the short URI).  It would be nice if Topsy also merged redirects when computing their pages.  In the example above, that would result in either of the Topsy URIs (pre- and post-October 24) reporting 900+3 = 903 posts (or at least provided that as an option).  

--Michael

Edit: I did some more investigating and found that the slug doesn't matter, only the Drupal node ID of "429274" (those familiar with Drupal probably already knew that).  Here's a URI that should obviously return 404 redirecting to URI with the full title as the slug:

% curl -I http://www.technologyreview.com/view/429274/lasdkfjlajfdsljkaldsf/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352581871"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 21:11:11 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 21:11:11 GMT
X-Varnish: 1782237238
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


This makes the Drupal slug very close to the original Phelps & Wilensky concept of "Robust Hyperlinks Cost Just Five Words Each", which formed the basis for Martin's dissertation mentioned above.  While this is convenient in that it reduces the number of 404s in the world, it is also a bit of a white lie; user agents need to be careful to not assume that the original URI ever existed even though it is issuing a redirect to a target URI. 

No comments:

Post a Comment