- data quality criteria and contexts
- human and institutional factors
- tools for effective and painless curation
To be included in the workshop report are the results of various breakout sessions. The ones that I was involved with involved questions such as: how contextual information should be archived with the data (cf. "preservation description information" and "knowledge base" from OAIS), how much a university's institutional overhead goes to institutional repositories and archiving capability ("put everything in the cloud" is neither an informed nor acceptable answer), and how to handle versioning and diff/patch in large data sets (tools like Galaxy and Google Refine were mentioned in the larger discussion).
(2012-10-23 edit: the final workshop report is now available.)
Tracking it Back to the Source: Managing and Citing Research Data" workshop in Denver on September 24. This one day workshop focused on how to cite and link to scientific data sets (which came up several times in the UNC workshop as well). While I applaud the move to make data sets first-class objects in the scholarly communication infrastructure, I always feel there is an unstoppable momentum to "solve" the problem by simply saying "use DOIs" (e.g., DataCite), while ignoring the hard issues of what exactly does a DOI refer to (see: ORE Primer), versioning what it might point to (see: Memento), as well as the minor quibble that DOIs aren't actually URIs (look it up: "doi" is not in the registry). In short, DOIs are a good start, but they just push the problem one level down instead of solving it. Highlights from the workshop included a ResourceSync+Memento presentation from Herbert Van de Sompel and "Data Equivalence", by Mark Parsons of the NSIDC.
pre-meeting (0.1) version of the specification is no longer valid. Many issues are still being considered and I won't cover the details here, but main result is that the ResourceSync format will no longer be based on Sitemaps. We were all disappointed to have to make that break, but Martin Klein did a nice set of experiments (to be released later) that showed despite being superficially suitable for the job, there were just too many areas where its primary focus of advertising URIs to search engines inhibited the more nuanced use of advertising resources that have changed.