2024-11-13: The DOI URI Scheme: Utility or Branding?
Illustration by Pat Hochstenbach |
The DOI URI Scheme: Utility or Branding?
A few days ago, Joe Wass published a blog post entitled "Falsehoods Programmers believe about DOIs." It’s an excellent resource for developers that work with DOIs and a sobering read for those that attribute magical powers to DOIs. For about 10 years, Joe was a developer at Crossref, the major DOI registration agency for scholarly communication, so chances are high that he knows what he’s writing about. One of the highlighted falsehoods pertains to the challenges involved when bots attempt to resolve DOIs, following their nose from https://doi.org/the-doi-name to eventually arrive at a landing page describing the scholarly artifact that has the-doi-name as its persistent identifier. Joe had previously described this bot Odyssey in detail in a Crossref blog post "URLs and DOIs: a complicated relationship" and now adds that “the landscape has only got more hostile to non-human web users”. We can safely assume that he’s referring to the heroic attempts by administrators of scholarly portals to keep AI bots from accessing the content they host. But, even prior to this global content grab, bots had a very hard time to successfully resolve DOIs. The 2020 paper "On the Persistence of Persistent Identifiers of the Scholarly Web" by Martin Klein and Luydmila Balakireva documents the sorry state of affairs based on a thorough investigation. And, in a 2023 experiment in which HTTP URIs of the form https://doi.org/the-doi-name were automatically extracted from PDFs and subsequently resolved, Patrick Hochstenbach surprisingly found that 80% of the processing time was spent on resolving DOIs. As such, it should not come as a surprise that interoperability protocols that were recently introduced in the scholarly realm - Signposting, COAR Notify, Rioxx v3 - use the URI of the landing page of scholarly objects as the token for in-the-moment interactions and provide the DOI as auxiliary metadata for long-term referencing. Doing so avoids the DOI resolution headache for momentaneous workflows in which long-term persistence is the least of worries.
When Joe shared his blog post on Mastodon, it led to a brief interaction in which Graham Klyne mentioned an effort to register doi: as a URI scheme. The IETF Datatracker page shows that the effort was launched by publishing an Internet Draft in May 2024. The draft is meanwhile in its seventh version, which might suggest significant activity and interest. The associated GitHub repository tells another story: it shows 18 issues of which 4 are open, 14 are closed, and all but one were submitted by the author of the Internet Draft himself, Pierre-Anthony Lemieux of Sandflow Consulting LLC, a 2-person consultancy digital media consulting company. A quick look-up in the IANA Registry for Uniform Resource Identifier (URI) Schemes reveals that a provisional registration of doi: as a URI scheme had taken place in 2020. At that time, the provisional registration was done by Frédéric Wang of Igalia. The brief 2020 registration record shows the effort was backed by the International DOI Foundation (IDF). Graham Klyne indicated that the IDF is also backing the current effort.
As a persistent identifier aficionado, I was aware of a previous effort that was started by the late Norman Paskin in 2002, but I was rather baffled not to have heard about this renewed attempt. Neither had my colleague Paul Walk of Antleaf, who makes a living as a technical consultant and developer working with Web technologies to support open access to research. It turned out that none of the other colleagues I immediately pinged - Michael L. Nelson, Martin Klein, and Patrick Hochstenbach - were aware of the effort, despite them having a long-term interest in persistent identifiers. This lack of awareness, even with the geeks, about a formal standardization effort pertaining to the most prevalent identifier used in scholarly communication is surprising to say the least, especially given the relentless advocacy for the use of persistent identifiers, for example, under the inescapable FAIR umbrella.
So, what is going on here? Why do DOIs need to be expressable using a dedicated doi: URI scheme i.e., as a URI of the form doi:the-doi-name? The status quo seems satisfactory. In order to be usable on the Web, DOIs are routinely expressed as HTTP URIs of the form https://doi.org/the-doi-name. This practice has been in place for a long time and the Crossref DOI display guidelines started to officially recommend the HTTP approach in 2017. DOIs are also commonly displayed in print as doi:the-doi-name and in cases where this approach is used on the Web, the string is hyperlinked with https://doi.org/the-doi-name so as to make it clickable. Having two ways to express DOI names as URIs is bound to create confusion and goes against the existing HTTP-oriented recommendation that has gained broad traction in scholarly communication. Obviously, DOIs are used beyond scholarly communication, but the DOI Handbook confirms (Section 3.3: PRESENTATION FORMATS OF A DOI NAME) that the aforementioned practices also apply in other application domains. Furthermore, it's not clear what the relevance of a registered URI scheme is when it comes to how DOIs are typeset in journals: thus far, that is as the string doi:the-doi-name and post-registration it would be as the URI doi:the-doi-name.
The current “doi” URI Scheme Internet Draft does not provide any substantial motivation for the registration of doi: as a URI scheme and only details HTTP-based resolution mechanisms. This is rather surprising because RFC2718 “Guidelines for new URL Schemes” and its successor RFC4395 “Guidelines and Registration Procedures for URI Schemes” state that new schemes must have demonstrable utility that is not available using existing schemes. RFC4395 puts it as follows: “New schemes ought to have utility to the Internet community beyond that available with already registered schemes.” It also states that “The scheme specification SHOULD discuss the utility of the scheme being registered.” Norman Paskin’s 2002 effort did address this aspect in a section entitled “Why Create a New URI Scheme for DOI?” The most pertinent statement in that section is “DOI is not bound to any Internet protocol and so requires its own dedicated URI scheme.” Paskin and his co-authors also include a section “Why Not Use a URN Namespace ID for DOI?” and provide as one of the justifications that the URN syntax does not support an optional query component and/or fragment identifier. Indeed, at the time of Paskin’s writing, RFC2141 “URN Syntax” did not support these; the 2014 RFC8141 "Uniform Resource Names (URNs)” that obsoletes RFC2141 does. Not that it matters because the current “doi” URI Scheme Internet Draft states “A DOI Name URI shall contain neither a query component nor a fragment component.”
It is interesting to further reflect on Paskin’s statement that DOI is not bound to any Internet protocol. Some, including myself, have observed an apparently long-standing aversion towards HTTP from some representatives of the IDF and from some of the originators of handles and the handle protocol (RFC3650, RFC3651, RFC3652). For example, resolving DOIs involves the handle system (DOIs are handles) and when discussing resolution of DOIs on the Web, the DOI Handbook refers to the facilitating system as “the HTTPS Proxy Server of the DOI System”. In the early days of the Web, it was not uncommon to regard HTTP merely as a replaceable transport protocol. As Michael Nelson and I observed in "Reminiscing About 15 Years of Interoperability Efforts", that isn’t really surprising because HTTP threw gopher overboard and anonymous FTP had been the primary method of resource discovery and transfer. But HTTP avoidance is hardly justifiable anymore, two decades after the publication of Roy Fielding’s thesis (2000), the W3C’s "Architecture of the World Wide Web" (2004), and Tim Berners-Lee’s" Linked Data" (2006). Quite to the contrary, a choice for infrastructure based on any protocol other than the omnipresent and ever-evolving HTTP to achieve utility that can inherently be provided with HTTP is bound to eventually face a maintenance and sustainability nightmare.
Nevertheless, one would almost be tempted to read the current effort to register doi: as a URI scheme as a move towards a future in which identifiers of the form doi:the-doi-name are natively resolved by client applications using the handle protocol, even though the Internet Draft only mentions HTTP-based resolution. Native resolution of handles has received significant attention as part of the ongoing FAIR Digital Object effort, which has a webby (read HTTP) and non-webby track. The latter has a strong background in handles and the handle protocol, and has at its core the Digital Object Identifier Resolution Protocol (DO-IRP). The protocol specification states: “While the DO-IRP draws on the Handle System3 Protocol for basic technical matters, and is largely backwards compatible with it, DO-IRP is a neutral specification that is not tied to any specific implementation.” It goes on to say that the DO-IRP specification replaces the three handle protocol RFCs mentioned above. But registration of doi: as a URI scheme would not be utterly helpful in the FAIR Digital Object and DO-IRP context because, while many scholarly artifacts are assigned DOIs (remember: all DOIs are handles), many others are assigned handles that are not DOIs. In that context, registration of hdl: as a URI scheme would be significantly more helpful as it would cover handles including DOIs. But the URI scheme would then likely be hdl: instead of doi:, which begs the question: is the registration of the doi: URI scheme a mere branding exercise? We will only know if/when Pierre-Anthony Lemieux adds a section about the utility of a doi: scheme to his Internet Draft.
Acknowledgements: Many thanks to Patrick Hochstenbach, Martin Klein, Michael Nelson, and Paul Walk for feedback to a draft of this post.
(posted by MLN on behalf of HVDS; illustration by Pat Hochstenbach)
Comments
Post a Comment