2017-12-31: ACM Workshop on Reproducibility in Publication

On December 7 and 8 I attend the ACM Workshop on Reproducibility in Publication in NYC as part of my role as a member of the ACM Publications Board and co-chair (with Alex Wade) of the Digital Library Committee.  The purpose of this workshop was to gather input from the various ACM SIGs about the approach to reproducibility and "artifacts", objects supplementary to the conventional publication process.  The workshop was attended by 50+ people, mostly from the ACM SIGs but also included representatives from other professional societies and repositories and hosting services.  A collection of the slides presented at the workshop and a summary report are being worked on now, and as such this trip report is mostly my personal perspectives on the workshop; I'll update with slides, summary, and other materials as they become available.

This was the third such workshop that had been held, but it was the first for me since I joined the Publications Board in September of 2017.  I have a copy of a draft report, entitled "Best Practices Guidelines for Data, Software, and Reproducibility in Publication" from the second workshop, but I don't believe that report is public so I won't share it here.

I believe it was from these earlier workshops where the ACM adopted their policy of including "artifacts" (i.e., software, data, videos, and other supporting materials) in the digital library.  At the time of this meeting the ACM DL had 600-700 artifacts.  To illustrate the ACM's approach to reproducibility and artifacts in the DL, below I show and example from ACM SIGMOD (each ACM SIG is implementing different approaches to reproducibility as appropriate within their community). 

The first image below is a paper from ACM SIGMOD 2016, "ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks", which has the DOI URI of https://doi.org/10.1145/2882903.2882964.  This page also links to the SIGMOD guidelines for reproducibility.

Included under the "Source Materials" tab is a link to a zip file of the software and a separate README file in unprocessed markdown format.  What this page doesn't link to is the software page in the ACM DL that also has a separate DOI, https://doi.org/10.1145/3159287.  The software DOI does link back to the SIGMOD paper, but not the SIGMOD paper does not appear to explicitly link to the software DOI (again, it links to just the zip and README). 

In that page I've also clicked on the "artifacts" button to produce a pop up that explains the various "badges" that the ACM provides; a full description is also available at a separate page.  More tellingly, on this page there is a link to the software as it exists in GitHub.

In slight contrast to the SIGMOD example, The Graphics Replicability Stamp Initiative (GRSI) embraces GitHub completely, with a combination of linking both to the repositories of the individuals (or groups) that wrote the code as well as linking to forks of the code within the GSRI account.  Of course, existing in GitHub is not the same as being archived (reminder: the fading of SourceForge and the closing of Google Code) and a DL has a long-term responsibility in hosting bits and not just linking to them (though to be fair, Git is bigger than GitHub and ACM could commit to git.acm.org).  On the other hand, as GRSI implicitly acknowledge, decontextualizing the code from the community and functions that the hosting service (in this case, GitHub) provides is not a realistic short- or mid-term approach either.  Resolving the tension between memory organizations (like ACM) and non-archival hosting services (like GitHub) is one of the goals of the ODU/LANL AMF funded project ("To the Rescue of the Orphans of Scholarly Communication": slides, video, DSHR summary) and I hope to apply the lessons learned from the research project to the ACM DL.

One of the common themes was "who evaluates the artifacts?"  Initially, most artifacts are considered only for publications otherwise already accepted, and in most cases the evaluation is done non-anonymously by a different set of reviewers.  That adapts best to the current publishing process, but it is unresolved whether or not this is the ideal process -- if artifacts are to become true first class citizens in scholarly discourse (and thus the DL), perhaps they should be reviewed simultaneously with the paper submission.  Of course, the workload would be immense and anonymity (in both directions) would be difficult if not impossible.  Setting aside the issue of whether or not that it desirable, it would still represent a significant change to how most conferences and journals are administered.  Furthermore, while some SIGs have successfully implemented initial approaches to artifact evaluation with grad students and post-docs, it is not clear to me that this is scalable, and furthermore I'm not sure it sends the right message about the importance of the artifacts. 

Some other resources of note:
The discussion of identifiers, and especially DOIs, is of interest to me because one of the points I made in the meeting and continued on twitter can roughly be described as "DOIs have no magical properties".  No one actually claimed this, of course, but I did feel the discussion edging toward "just give it a DOI" (cf. getting DOIs for GitHub repositories).  I'm not against DOIs, rather the short version of my caution is that currently there is correlation between "archival properties" and "things we give DOIs to", but DOIs do not cause archival properties.

There was a fair amount of back channel discussion on Twitter with "#acmrepro"; I've captured the tweets during and immediately after the workshop in the Twitter moment embedded below.

I'll update this post as slides and the summary report become available.