2015-07-24: ICSU World Data System Webinar #6: Web-Centric Solutions for Web-Based Scholarship

Earlier this week Herbert Van de Sompel gave a webinar for the ICSU World Data System entitled "Web-Centric Solutions for Web-Based Scholarship".  It's a short and simple review of some of the interoperability projects we've worked on through since 1999, including OAI-PMH, OAI-ORE, and Memento.  He ends with a short nod to his simple but powerful "Signposting the Scholarly Web" proposal, but the slides in the appendix give the full description.

The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol.  Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now.  Some highlights include:
  • Although Google existed, it was not the hegemonic force that it is today, and contemporary search engines that did exist (e.g., AltaVista, Lycos) weren't that great (both in terms of precision and recall).  
  • The Deep Web was still a thing -- search engines did not reliably find obscure resources likely scholarly resources (cf. our 2006 IEEE IC study "Search Engine Coverage of the OAI-PMH Corpus" and Kat Hagedorn's 2008 follow up "Google Still Not Indexing Hidden Web URLs").
  • Related to the above, the focus in digital libraries was on repositories, not the web itself.  Everyone was sitting on an SQL database of "stuff" and HTTP was seen just as a transport in which to export the database contents.  This meant that the gateway script (ca. 1999, it was probably in Perl DBI) between the web and the database was the primary thing, not the database records or the resultant web pages (i.e., the web "resource").  
  • Focus on database scripts resulted in lots of people (not just us in OAI-PMH) tunneling ad-hoc/homemade protocols over HTTP.  In fairness, Roy Fielding's thesis defining REST only came out in 2000, and the W3C Web Architecture document was drafted in 2002 and finalized in 2004.  Yes, I suppose we should have sensed the essence of these documents in the early HTTP RFCs (2616, 2068, 1945) but... we didn't. 
  • The very existence of technologies such as SOAP (ca. 1998) nicely illustrates the prevailing mindset of HTTP as a replaceable transport. 
  • Technologies similar to OAI-PMH, such as RSS, were in flux and limited to 10 items (belying their news syndication origin which made them unsuitable for digital library applications).  
  • Full-text was relatively rare, so the focus was on metadata (see table 3 in the original UPS paper; every digital library description at the time distinguished between "records" and "records with full-text links").  Even if full-text was available, downloading and indexing it was an expensive operation for everyone involved -- bandwidth was limited and storage was expensive in 1999!  Sites like xxx.lanl.gov even threatened retaliation if you downloaded their full-text (today's text on that page is less antagonistic, but I recall the phrase "we fight back!").  Credit to CiteSeer for being an early digital library that was the first to use full-text (DL 1998).
Eventually Google Scholar announced they were deprecating OAI-PMH support, but the truth is they never really supported it in the first place.  It was just simpler to crawl the web, and the early focus on keeping robots out of the digital library had given way to making sure that they got into the digital library (e.g., Sitemaps).

The OAI-ORE and then Memento projects were more web-centric, as Herbert nicely explains in the slides, with OAI-ORE having a Semantic Web spin and Memento being more grounded in the IETF community.   As Herbert says at the beginning of the video, our perspective in 1999 was understandable given the practices at the time, but he goes on to say that he frequently reviews proposals about data management, scholarly communication, data preservation, etc. that continue to treat the web as a transport protocol over which the "real" protocol is deployed.  I would add that despite the proliferation of web APIs that claim to be RESTful, we're seeing a general retreat from REST/HATEOAS principles by the larger web community and not just the academic and scientific community.

In summary, our advice would be to fully embrace HTTP, since it is our community's Fortran and it's not going anywhere anytime soon