Monday, November 7, 2016

2016-11-07: Linking to Persistent Identifiers with rel="identifier"

Do you remember hearing about that study that found that people who are "good" at swearing actually have a large vocabulary, refuting the conventional wisdom about a "poverty-of-vocabulary"?  The DOI (digital object identifier) for the 2015 study is:

http://dx.doi.org/10.1016/j.langsci.2014.12.003

But if you read about it in the popular press, such as the Independent or US News & World Report, you'll see that they linked to:

http://www.sciencedirect.com/science/article/pii/S038800011400151X

The problem is that although the DOI is the preferred link, browsers follow a series of redirects from the DOI to the ScienceDirect link, which is then displayed in the address bar of the browser, and that's the URI that most people are going to copy and paste when linking to the page.  Here's a curl session showing just the HTTP status codes and corresponding Location: headers for the redirection:

$ curl -iL --silent http://dx.doi.org/10.1016/j.langsci.2014.12.003 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 303 See Other
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectSinglePerm?Redirect=http%3A%2%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS038800011400151X%3Fvia%253Dihubkey=072c950bffe98b3883e1fa0935fb56a6f1a1b364
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X
HTTP/1.1 200 OK


Most publishers follow this model of a series of redirects to implement authentication, tracking, etc. While DOI use has made significant progress in scholarly literature, many times the final URL is the one that is linked to instead of the more stable DOI (see the study by Herbert, Martin, and Shawn presented at WWW 2016 for more information).  Furthermore, while sometimes the mapping between the final URL and DOI is obvious (e.g., http://dx.doi.org/10.1371/journal.pone.0115253 --> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253), the above example proves that's not always the case.

Ad-hoc linking back to DOIs

One of the obstacles limiting the correct linking is that there is no standard, machine-readable method for the HTML from the final URI to link back to its DOI (and by "DOI" we also mean all other persistent identifiers, such as handles, purls, arks, etc.).  In practice, each publisher adopts its own strategy for specifying DOIs in <meta> HTML elements:

In http://link.springer.com/article/10.1007%2Fs00799-016-0184-4 we see:

<meta name="citation_publisher" content="Springer Berlin Heidelberg"/>
<meta name="citation_title" content="Web archive profiling through CDX summarization"/>
<meta name="citation_doi" content="10.1007/s00799-016-0184-4"/>
<meta name="citation_language" content="en"/>
<meta name="citation_abstract_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_fulltext_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_pdf_url" content="http://link.springer.com/content/pdf/10.1007%2Fs00799-016-0184-4.pdf"/>


In http://www.dlib.org/dlib/january16/brunelle/01brunelle.html we see:

<meta charset="utf-8" />
<meta id="DOI" content="10.1045/january2016-brunelle" />
<meta itemprop="datePublished" content="2016-01-16" />
<meta id="description" content="D-Lib Magazine" />


In http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253 we see:

<meta name="citation_doi" content="10.1371/journal.pone.0115253" />
...
<meta name="dc.identifier" content="10.1371/journal.pone.0115253" />


In https://www.computer.org/csdl/proceedings/jcdl/2014/5569/00/06970187-abs.html we see:

<meta name='doi' content='10.1109/JCDL.2014.6970187' />

And in http://ieeexplore.ieee.org/document/754918/ there are no HTML elements specifying the corresponding DOI.  Furthermore, HTML elements can only appears in HTML -- which means you can't provide Links for PDF, CSV, Zip, or other non-HTML representations.  For example, NASA uses handles for the persistent identifiers of the PDF versions of their reports:

$ curl -IL http://hdl.handle.net/2060/19940023070
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Expires: Thu, 03 Nov 2016 17:47:07 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 221
Date: Thu, 03 Nov 2016 17:47:07 GMT

HTTP/1.1 301 Moved Permanently
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Location: https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Content-Length: 984250
Content-Type: application/pdf


And the final PDF obviously cannot use HTML elements to link back to its handle.

To address these shortcomings, and in support of our larger vision of Signposting the Scholarly Web, we are proposing a new IANA link relation type, rel="identifier", that will support linking from the final URL in the redirection chain (AKA as the "locating URI") back to the persistent identifier that ideally one would use to start the resolution.  For example, in the NASA example above the PDF would link back to its handle with the proposed Link header in red:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT

Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier"
Content-Length: 984250
Content-Type: application/pdf


And in the Language Sciences example that we began with, the final HTTP response (which returns the HTML landing page) would use the Link header like this:

HTTP/1.1 200 OK
Last-Modified: Fri, 04 Nov 2016 00:36:50 GMT
Content-Type: text/html
X-TransKey: 11/03/2016 20:36:50 EDT#2847_006#2415#68.228.137.112
X-RE-PROXY-CMP: 1
X-Cnection: close
X-RE-Ref: 0 1478219810005195
Server: www.sciencedirect.com
P3P: CP="IDC DSP LAW ADM DEV TAI PSA PSD IVA IVD CON HIS TEL OUR DEL SAM OTR IND OTC"
Vary: Accept-Encoding, User-Agent
Expires: Fri, 04 Nov 2016 00:36:50 GMT
Cache-Control: max-age=0, no-cache, no-store

Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"
...

But it's not just the landing page that would link back to the DOI, but also the constituent resources that are also part of a DOI-identified object.  Below is a request and response for the PDF file in the Language Sciences example, and it carries the same Link: response header as the landing page:

$ curl -IL --silent "http://ac.els-cdn.com/S038800011400151X/1-s2.0-S038800011400151X-main.pdf?_tid=338820f0-a442-11e6-9f85-00000aab0f6b&acdnat=1478451672_5338d66f1f3bb88219cd780bc046bedf"
HTTP/1.1 200 OK
Accept-Ranges: bytes
Allow: GET
Content-Type: application/pdf
ETag: "047508b07a69416a9472c3ac02c5a9a01"
Last-Modified: Thu, 15 Oct 2015 08:11:25 GMT
Server: Apache-Coyote/1.1
X-ELS-Authentication: SDAKAMAI
X-ELS-ReqId: 67961728-708b-4cbb-af64-bb68f1da03ea
X-ELS-ResourceVersion: V1
X-ELS-ServerId: ip-10-93-46-150.els.vpc.local_CloudAttachmentRetrieval_prod
X-ELS-SIZE: 417655
X-ELS-Status: OK
Content-Length: 417655
Expires: Sun, 06 Nov 2016 16:59:44 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sun, 06 Nov 2016 16:59:44 GMT
Connection: keep-alive

Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"

Although at first glance there seems to be a number of existing rel types (some registered and some not) that would be suitable:
  • rel="canonical"
  • rel="alternate" 
  • rel="duplicate" 
  • rel="related"
  • rel="bookmark"
  • rel="permalink"
  • rel="shortlink"
It turns out they all do something different.  Below we explain why these rel types are not suitable for linking to persistent identifiers.
rel="canonical" 

This would seem to be a likely candidate and it is widely used, but it actually exists for a different purpose: to "identify content that is either duplicative or a superset of the content at the context (referring) IRI." Quoting from RFC 6596:
If the preferred version of a IRI and its content exists at:

http://www.example.com/page.php?item=purse

Then duplicate content IRIs such as:

http://www.example.com/page.php?item=purse&category=bags
http://www.example.com/page.php?item=purse&category=bags&sid=1234

may designate the canonical link relation in HTML as specified in
[REC-html401-19991224]:

<link rel="canonical"
      href="http://www.example.com/page.php?item=purse">
In the representative cases shown above, the DOI, handle, etc. is neither duplicative nor a superset of the content.  For example, the URI of the NASA report PDF clearly bears some relation to its handle, but the PDF URI is clearly not duplicative nor a superset of the handle.  This is reinforced by the semantics of the "303 See Other" redirection, which indicates there are two different resources with two different URIs*.  rel="canonical" is ultimately about establishing primacy among the (possibly) many URI aliases for a single resource.  For SEO purposes, this avoids splitting Pagerank.

Furthermore, publishers like Springer are already using rel="canonical" (highlighted in red) to specify a preferred URI in their chain of redirects:

$ curl -IL http://dx.doi.org/10.1007/978-3-319-43997-6_35
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
Expires: Mon, 31 Oct 2016 20:52:26 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 191
Date: Mon, 31 Oct 2016 20:40:48 GMT

HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=UTF-8
Location: http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
Server: Jetty(9.2.14.v20151106)
X-Environment: live
X-Origin-Server: 19t9ulj5bca
X-Vcap-Request-Id: 48d17c7e-2556-4cff-4b2b-0e6fbae94237
Content-Length: 0
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448562:07a49aef;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=d9cf189bedb640a9b5d55c9d0;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Link: <http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35>; rel="canonical"
Server: openresty
X-Environment: live
X-Origin-Server: 19ta3iq6v47
X-Served-By: core-internal.live.cf.private.springer.com
X-Ua-Compatible: IE=Edge,chrome=1
X-Vcap-Request-Id: 5a458b2c-de85-42cd-7157-022c440a9668
X-Vcap-Request-Id: 54b0e2dc-7766-4c00-4f95-d33bdb6c427a
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448766:c35e0847;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=1d67fdfb47ab4a5f94b43326e;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

 
And some publishers use it inconsistently.  In this Elsevier example, the content from http://dx.doi.org/10.1016/j.acra.2015.10.004 is indexed at three different URIs:


Even if we accept that the PubMed version is a different resource (i.e., hosted at NLM instead of Elsevier) and should have a separate URI, Elsevier still maintains two different URIs for this article:

http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract
http://www.sciencedirect.com/science/article/pii/S1076633215004535

The DOI resolves to the former URI (academicradiology.org), but it is the latter (sciencedirect.com) that has in the HTML (and not in the HTTP response header):

<link rel="canonical" href="http://www.sciencedirect.com/science/article/pii/S1076633215004535">

Presumably to distinguish this URI from the various URIs that you get starting with http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 instead of the DOI:

$ curl -iL --silent http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectPrefsPerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS1076633215004535%3Fvia%253Dihub&key=07077ac16f0a77a870586ac94ad3c000cfa1973f
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535
HTTP/1.1 200 OK


In summary, although "canonical" seems promising at first, the semantics are different from what we propose and publishers are already using it for internal linking purposes.  This eliminates "canonical" from consideration. 

rel="alternate" 

This rel type has been around for a while and has some reserved historical definitions for stylesheets and RSS/Atom, but the general semantics for "alternate" is to provide "an alternate representation of the current document."  In practice, this means surfacing different representations for the same resource, but varying in Content-type (e.g., application/pdf vs. text/html) and/or Content-Language (e.g., en vs. fr).  Since a DOI, for example, is not simply a different representation of the same resource, "alternate" is removed from consideration.

rel="duplicate" 

RFC 6249 specifies how resources can specify resources with different URIs are in fact byte-for-byte equivalent.  "duplicate" might suitable for stating equivalence between the PDFs linked at both http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract and http://www.sciencedirect.com/science/article/pii/S1076633215004535, but we can't use it to link back to http://dx.doi.org/10.1016/j.acra.2015.10.004

rel="related

Defined in RFC 4287, "related" is probably the closest to what we propose but its semantics are purposefully vague.  A DOI is certainly related to locating URI, but it is also related to a lot of other resources as well: the other articles in a journal issue, other publications by the authors, citing articles, etc. Using "related" to link to DOIs could be ambiguous, and would eventually lead to parsing the linked URI for strings like "dx.doi.org", "handle.net", etc. -- not what we want to encourage. 

rel="bookmark" 

We initially hoped this could mean "when you press <control-D>, use this URI instead of one in your address bar."  Unfortunately, "bookmark" is instead used to identify permalinks for different sections of the document that it appears in.  And as a result, it's not even defined for Link: HTTP  headers, and thus eliminated from consideration. 

rel="permalink" 

It turns out that "permalink" was intended for what we thought "bookmark" would be used for, but although it was proposed, it was never registered nor did it gain significant traction ("bookmark" was used instead).  It is most closely associated with the historical problem of creating deep links within blogs and as such we choose not to resurrect it for persistent identifiers.

rel="shortlink" 

We include this one mostly for completeness since the semantics arguably provide the opposite of what we want: instead of a link to a persistent identifier, it allows linking to a shortened URI.   Despite its widespread use, it is actually not registered.

The ecosystem around persistent identifiers is fundamentally different than that of shortened URIs even though they may look similar to the untrained eye.  Putting aside the preservation nightmare scenario of bit.ly going out of business or Twitter deprecating t.co, "shortlink" could be used to complement "identifier".  Revisiting the NASA example from above, the two rel types could be combined to link to both the handle and the nasa.gov branded shortened URI:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT

Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
      <http://go.nasa.gov/2fkvyya>; rel="shortlink" 
Content-Length: 984250
Content-Type: application/pdf


Combining rel="identifier" with other Links

The "shortlink" example above illustrates that "identifier" can be combined with other rel type for more expressive resources.  Here we extend the NASA example further with rel="self":

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT

Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
 <http://go.nasa.gov/2fkvyya>; rel="shortlink",
 <http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf>; rel="self"
Content-Length: 984250
Content-Type: application/pdf


Now the HTTP response for the PDF is self-contained and unambiguously lists all of the appropriate URIs.  We could also combine rel="identifier" with version information.  arXiv.org does not issue DOIs or handles, but it does mint its own persistent identifiers.  Here we propose, using rel types from RFC 5829, how version 1 of an eprint could link both to version 2 (both the next and current version) as well as the persistent identifier (which we also know to be the "latest-version"):

$ curl -I https://arxiv.org/abs/1212.6177v1
HTTP/1.1 200 OK
Date: Fri, 04 Nov 2016 02:31:19 GMT
Server: Apache
ETag: "Tue, 08 Jan 2013 01:02:17 GMT"
Expires: Sat, 05 Nov 2016 00:00:00 GMT
Strict-Transport-Security: max-age=31536000
Set-Cookie: browser=68.228.137.112.1478226679112962; path=/; max-age=946080000; domain=.arxiv.org
Last-Modified: Tue, 08 Jan 2013 01:02:17 GMT

Link: <https://arxiv.org/abs/1212.6177>; rel="identifier latest-version",
      <https://arxiv.org/abs/1212.6177v2>; rel="successor-version",
      <https://arxiv.org/abs/1212.6177v1>; rel="self" 
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=utf-8


The Signposting web site has further examples how rel="identifier" can be used to express the relationship between the persistent identifiers, the "landing page", the "publication resources" (e.g., the PDF, PPT), and the combination of both the landing page and publication resources.  We encourage you to explore the analyses of existing publishers (e.g., Nature) and repository systems (e.g., DSpace, Eprints).

In summary, we propose rel="identifier" to standardize linking to DOIs, handles, and other persistent identifiers.  HTML <meta> tags can't be used as headers in HTTP responses, and existing rel types such as "canonical" and "bookmark" have different semantics.

We welcome feedback about this proposal, which we intend to eventually standardize with an RFC and register with IANA. Herbert will cover these issues at PIDapalooza, and we will include the slides here after the conference.

2016-11-10 Edit: Herbert's PIDapalooza slides are now available:




--Michael & Herbert



* Technically, a DOI is a "digital identifier of an object" rather than "identifier of a digital object", and thus there is not a representation associated with the resource identified by a DOI (i.e., not an information resource).  Relationships like "canonical", "alternate", etc. only apply to information resources, and thus are not applicable to most persistent identifiers.  Interested readers are encouraged to further explore the HTTPRange-14 issue.

1 comment: