Wednesday, March 27, 2013

2013-03-27: ResourceSync Meeting and JCDL 2013 PC Meeting

 On March 21 & 22 members of the ResourceSync technical group met in Ann Arbor Michigan to work the 0.5 version of the ResourceSync specification.  In case you're not familiar, ResourceSync is a framework, intended to replace OAI-PMH, for specifying how a destination ("harvester" in PMH terms) can synchronize the web resources of a source ("repository" in PMH terms).  The source publishes a list of resources that it makes available via ResourceSync (which may be a subset of valid resources at the web site) using Sitemaps, with the idea that if you're already using Sitemaps then you are already minimally compliant, and the more advanced features of ResourceSync also use the Sitemap syntax for consistency.  Although the syntactic details are in flux, Herbert's presentation at the September 2012 NISO Forum is a good introduction the framework, as are the two recent D-Lib Magazine articles (Sept/Oct 2012 and Jan/Feb 2013). 

Some important but nuanced results came from the March meeting, many of which are contained in the figure below (collaboratively drawn at the whiteboard and then Omnigraffled by Graham Klyne):



Some highlights:
  • The use of <sitemapindex> files in ResourceSync is now consistent.  The purpose of sitemap indexes is to provide pagination for large sitemaps (50K URIs or 10MB total), but those are engineering limitations; logically two physical sitemaps listed in a sitemap index were a single, logical sitemap.  In the 0.5 version of the ResourceSync specification sitemap indexes were used for both pagination as well as specifying archive functionality and grouping capability lists.  That dual use has been removed and indexes are now used only for pagination (shown as the dashed boxes in each of the 4 columns in the figure above).
  • A resource set capability list can describe an entire site, or a site can have multiple resource set capability lists (e.g., the various collections in physics, mathematics, etc. in arXiv). This ability to subdivide the site is analogous to "sets" in PMH). 
  • The top part of the figure contains a few changes as well: the set of capability lists is now contained within a sitemap called a resource set list.  Although the number of capabilities for a single resource set list will never grow to need an index, it is possible that the number of number of resource set lists will grow to need an index (the dashed box at the very top of the figure).  An example would be creating a resource set for each category in a large wiki; there could easily be more than 50K such resource set capability lists.
  • You'll notice that only four capabilities are listed: Resource List (i.e., conventional Sitemaps), Change List, Resource Dump, and Change Dump (just a quick reminder that these capabilities are orthogonal and optionally implementable; for example you can implement a Change List without implementing a Resource List).  A discussion of archiving these capabilities will be moved to a (yet to be published) separate document, and will borrow heavily from Memento to describe archiving.  The subtle distinction is that to make Change Lists and Change Dumps useful, a source will likely have to provide some memory (i.e., a single change is not useful (memory=1), a list is required (memory=n) for a destination to make use of this capability).  But there is a distinction between the level of memory provided (i.e., ResourceSync semantics) and the possibly archiving of these capabilities (i.e., Memento Semantics). 
  • Also to be described in a separate document will be the server push description of ResourceSync.  The current document will continue to only be client pull.  
  • The manifests inside the dumps (i.e., inside the zip file) are to be provided as a single Sitemap, even if there are > 50K URIs.  Providing indexes inside of a dump just isn't worth the trouble.  Also note that although zip files are mentioned by name in the specification, we've left the door open for other formats and encodings (e.g., .tar.gz, .tar.bz2).
  • Finally, even if you have only a single Resource Set List, it will be wrapped with a Resource Set List Index, so the ./well-known/resourcesync URI can point to the same thing over time.  Yes, that's not ideal, but if you don't do it that way, then adding your second Resource Set List (which you'll eventually do) is an even bigger problem than having an index with just a single member. 
We discussed a host of other changes, such as how to specify temporal coverage of a Change Lists (see figure 1 in the specification -- does Change List 2's coverage begin at t3, or at the time of the first observed change which might be > t3?), but I'll leave those details for the 0.6 version of the specification.

On Friday after the ResourceSync meeting, Rob, Martin, and I hopped in a rental car and drove from Ann Arbor to Chicago for the JCDL 2013 Program Committee meeting.  The program will be released soon, but it was a full weekend of reviewing, meta-reviewing, and planning.  Accepted for presentation were 28 long and 22 short papers (for an acceptance rate of ~29% for each category, exact numbers will be released later), and some large number of posters.  This year each paper had three regular reviews and two meta-reviewers.  Last year was the first year of meta-reviewing for JCDL (with a single meta-reviewer), and I think the quality of the reviews are better as a result of meta-reviewing.  Although it was a lot of work and I was anxious about which of my own papers made it in, this year Herbert and I weren't responsible for chairing the program, so things were much less stressful.

--Michael

No comments:

Post a Comment