Monday, August 7, 2017

2017-08-07: rel="canonical" does not mean what you think it means

The rel="identifier" draft has been submitted to the IETF.  Some of the feedback we've received via Twitter and email are variations of 'why don't you use rel="canonical" to link to the DOI?'  We discussed this in our original blog post about rel="identifier", but in fairness that post discussed a great deal of things and through updates and comments it has become quite lengthy. 

The short answer is that rel="canonical" handles cases where there are two or more URIs for a single resource (AKA "URI aliases"), whereas  rel="identifier" specifies relationships between multiple resources.

Having two or more URIs for the same resource is also known as "DUST: different URLs, similar text".  This is common place with SEO and catalogs (see the 2009 Google blog post and help center article about rel="canonical").  RFC 6596 gives abstract examples, but below we will examine real world examples (only one of which I'm fully prepared to buy).

Consider the two lexigraphically different URIs for the same resource (in this case, Amazon's page for DJ Shadow's upcoming EP "The Mountain Has Fallen"):
  1. https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow
  2. https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q
The first URI is what I got when I searched amazon.com for "dj shadow" and clicked on a search result.  The second URI is the "canonical" version that should be indexed by Google et al.  The first URI uses an HTML <link> element to inform search engines about the second URI so they know they haven't found two different resources with two different URIs:

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow" | grep -i canonical
<link rel="canonical" href="https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q" />


We can see that the HTML is not exactly the same (which would be trivial for the search engines to dedup), but can see the rendered HTML is essentially the same, with the exception of the navigation trail ("‹ Back to search results for "dj shadow"") vs. the categorization ("CDs & Vinyl › Dance & Electronic › Electronica") on the left-hand side, right above the EP artwork:

$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q/ref=sr_1_3?s=music&ie=UTF8&qid=1502078863&sr=1-3&keywords=dj+shadow" | wc
   12711   16648  446841


$ curl -i -A "mozilla" --silent "https://www.amazon.com/Mountain-Has-Fallen-EP/dp/B073JS3Y9Q" | wc
   12802   17120  459761




It is clear there is no need for a search engine to index both pages.   The raw HTML is nearly (but not exactly!) the same and unless it is aware of amazon.com URI patterns, your crawler would not easily discover that they refer to the same resource.  We can construct a similar example with ebay.com: again the raw HTML differs slightly but in this case I cannot tell a difference in the rendered HTML:

$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713?hash=item33d3451f79:g:6G8AAOSwiBpZcMhO&vxp=mtr" | fmt | grep --context canonical | tail -3
    hreflang="es-ni" /><link rel="canonical"
    href="http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713"
    /><lmeta Property="og:image"


$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713" | wc
    2678    9225  189098


$ curl -i -A "mozilla" --silent "http://www.ebay.com/itm/1970-Ford-Torino-King-Cobra-/222587854713?hash=item33d3451f79:g:6G8AAOSwiBpZcMhO&vxp=mtr" | wc
    2688    9246  189235




So why can't we use rel="canonical" for, say, DOIs and publisher pages?  In the case of DOIs, a technical reason is that the resource identified by the DOI and the resource identified by the publisher's page are not the same resource.  Admittedly this is a detour into the esoteric realm of HTTP 303 semantics, but the HTTP URI of a DOI does not have a representation and the publisher's URI does; the resources identified by these URIs are related but fundamentally different.

Another reason would be when you wish to specify part-whole relationships between resources that comprise the resource identified by a DOI.  For example, XML vs. HTML, Zip file(s) of associated code and data, embedded (and "recontextualizable"!) images, sound, or video, etc.  This would be for the purpose of expressing identity, and would not preclude combination with navigation (e.g., rel="up") or SEO links (e.g., rel="canonical"). These identification patterns are presented in more detail at the Signposting web site.

Another argument against using rel="canonical" for linking to DOIs (and friends) is that publishers are already using canonical to manage SEO within their own domain.  In the example below, springer.com signals to search engines that the URI in the third redirect from the DOI is canonical and not the previous two:

$ curl -iL --silent http://dx.doi.org/10.1007/978-3-319-43997-6_35 | egrep -i "(HTTP/1.1 [0-9]|^location:|rel=.canonical)"
HTTP/1.1 303 See Other
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
HTTP/1.1 301 Moved Permanently
Location: https://link.springer.com/10.1007/978-3-319-43997-6_35
HTTP/1.1 302 Found
Location: https://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
HTTP/1.1 200 OK
        <link rel="canonical" href="https://link.springer.com/chapter/10.1007/978-3-319-43997-6_35"/>


Furthermore,  publishers are specifying DOIs with a variety of incompatible ad hoc approaches (see the prior blog post for examples), meaning there is demand for this function even though there is currently not a standardized method of achieving it.

But there are other applications for rel="identifier" outside of scholarly content.   Consider the Wikipedia page for DJ Shadow.  As I type this, it has not yet been edited to include the upcoming EP mentioned above, but there's a good chance that by the time you read this that will have changed.


I can reference the particular version of the page using the "permalink", which yields the URI https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397.   That page will remain static, and never mention "The Mountain Has Fallen".  That page does use rel="canonical" to link back to the generic, current version of the page:

$ curl --silent -i "https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397" | grep "rel=.canonical"
<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>


Which is entirely expected and desirable: we don't want Google to separately index the 1000s of prior versions of this page, just the latest version.  The generic version of the page also asserts that it is canonical:

$ curl --silent -i "https://en.wikipedia.org/wiki/DJ_Shadow" | grep "rel=.canonical"
<link rel="canonical" href="https://en.wikipedia.org/wiki/DJ_Shadow"/>

But if I were using a reference manager to cite https://en.wikipedia.org/wiki/DJ_Shadow, and if that page also had:

<link rel="identifier" href="https://en.wikipedia.org/w/index.php?title=DJ_Shadow&oldid=787867397"/>

Then the reference manager would cite the specific version of the page, providing a machine-readable version of the human-readable guidance already provided under the "Cite This Page" link.  This use of rel="identifier" would not collide with the rel="canonical" which is already in place for SEO*.  In this Wikipedia example, the two rels coexist and specify URI preferences for different purposes:
  • rel="canonical": preferred for content indexing
  • rel="identifier": preferred for referencing
Herbert insisted on a New Mexico specific example, so we'll consider the ubiquitous multi-page articles, designed to expand content to increase advertising revenue.  Of interest to us is page 5 of this particular article about TV continuity errors: http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5.  It uses rel="canonical" to inform search engines to strip off any common, superfluous arguments that might be also be present (e.g., "&utm_source=...&utm_medium=...&utm_campaign=..."):

$ curl -i --silent "http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5" | grep canonical
<link rel="canonical" href="http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/?page=5" />

Assuming for a moment that coolimba.com wanted to facilitate referencing of this page as part of an aggregation, it could include:

<link rel="identifier up" href="http://www.coolimba.com/view/huge-tv-mistakes-no-one-noticed-c/" />

In this case, rel="up" also serves as a simple navigation function, if you chose to view these pages as a tree and not a list (if this is indeed a list, then "up" is probably not applicable).  But note that rel="up" would not be applicable in the Wikipedia (or even DOI) example(s) above.  Also note that rel="up" and rel="identifier" sharing the same URI is something of a coincidence: if a multi-page article has more than two "levels" then we would expect the URIs to diverge.

In conclusion, SEO/indexing and referencing are different functions and thus require different rel types; cases where the target URIs overlap should be considered coincidences.  rel="canonical" is used to collapse multiple URIs that yield duplicative text into a single, preferred URI to facilitate indexing, and rel="identifier" is used to select a single URI from among multiple URIs that yield different text to facilitate referencing. 


--Michael & Herbert


P.S. To return to our original pop culture reference: "have fun storming the castle!"


* Note that rel="permalink" and rel="bookmark" (the former was never registered and ultimately supplanted by the latter) do different things and are not usable in HTTP Link headers; see the prior blog post for details.

2017-08-09 edit: See also this Twitter moment about rel="bookmark".   I'll try to turn this into a separate blog post in the future.

No comments:

Post a Comment