2012-09-29: Data Curation, Data Citation, ResourceSync

During September 10-11, 2012 I attended the UNC/NSF Workshop Curating for Quality: Ensuring Data Quality to Enable New Science in Arlington.  The structure of the workshop was to invite about 20 researchers involved with all aspects of data curation and solicit position papers in one of four broad topics:
  1. data quality criteria and contexts
  2. human and institutional factors
  3. tools for effective and painless curation
  4. metrics
Although the majority of the discussion was about science data, my position paper was about the importance of archiving the web.  In short, treating the web as the corpus that should be retained for future research.  The pending workshop report will have a full list of participants and their papers, but in the meantime I've uploaded to arXiv my paper, "A Plan for Curating `Obsolete Data or Resources'", which is a summary version of the slides I presented at the Web Archiving Cooperative meeting this summer. 

To be included in the workshop report are the results of various breakout sessions.  The ones that I was involved with involved questions such as: how contextual information should be archived with the data (cf. "preservation description information" and "knowledge base" from OAIS), how much a university's institutional overhead goes to institutional repositories and archiving capability ("put everything in the cloud" is neither an informed nor acceptable answer), and how to handle versioning and diff/patch in large data sets (tools like Galaxy and Google Refine were mentioned in the larger discussion).

(2012-10-23 edit: the final workshop report is now available.)

A nice complement to the Data Curation workshop was the NISO "Tracking it Back to the Source: Managing and Citing Research Data" workshop in Denver on September 24.  This one day workshop focused on how to cite and link to scientific data sets (which came up several times in the UNC workshop as well).  While I applaud the move to make data sets first-class objects in the scholarly communication infrastructure, I always feel there is an unstoppable momentum to "solve" the problem by simply saying "use DOIs" (e.g., DataCite), while ignoring the hard issues of what exactly does a DOI refer to (see: ORE Primer), versioning what it might point to (see: Memento), as well as the minor quibble that DOIs aren't actually URIs (look it up: "doi" is not in the registry).  In short, DOIs are a good start, but they just push the problem one level down instead of solving it.  Highlights from the workshop included a ResourceSync+Memento presentation from Herbert Van de Sompel and "Data Equivalence", by Mark Parsons of the NSIDC

After the NISO workshop, there was a two day ResourceSync working group meeting (September 25-26) in Denver.  We made a great deal of progress on the specification; the pre-meeting (0.1) version of the specification is no longer valid.  Many issues are still being considered and I won't cover the details here, but main result is that the ResourceSync format will no longer be based on Sitemaps.  We were all disappointed to have to make that break, but Martin Klein did a nice set of experiments (to be released later) that showed despite being superficially suitable for the job, there were just too many areas where its primary focus of advertising URIs to search engines inhibited the more nuanced use of advertising resources that have changed.

--Michael

Comments