2024-07-20: ACM Conference on Reproducibility and Replicability 2024 - Virtual Trip Report

 

 

Figure 1: Homepage of ACM REP ‘24

The ACM Conference for Reproducibility and Replicability (ACM REP) is a recently launched venue for computer and information scientists whose research covers reproducibility and replicability. It covers practical issues in computational science and data exploration, emphasizing the importance of community collaboration to adopt new methods and tackle challenges in this relatively young field. The conference is linked with the ACM Emerging Interest Group for Reproducibility and Replicability.

The 2024 ACM Conference on Reproducibility and Replicability (ACM REP ‘24) combined in-person and virtual participation to broaden participation. Held from June 18th to June 20th at the Inria Conference Center in Rennes, France, the conference featured tutorials, keynote speeches, paper presentations, and discussions on reproducibility in computational research. The conference had four tutorial sessions; each divided into three to four tracks, six paper presentation sessions, and a Gather Town (for virtual participants) poster presentation session.

Day 1 (June 18th, 2024): Tutorial Sessions

The tutorial sessions were divided into four tracks, each covering three or four topics in parallel, focusing on different aspects of reproducibility.

The ‘EnOSlib: Distributed System Experiments at Your Fingertips’ tutorial was a three-hour workshop focused on hands-on experience with the EnOSlib library and the Grid5000 platform. The tutorial covered composing EnOSlib functions to build experimental artifacts and performing parameter sweeps over infrastructure settings, highlighting the benefits of these tools for reproducibility and replicability in distributed system experiments. While this knowledge may not be directly applicable to my current research, it will be beneficial in enhancing the reliability of my future research projects, particularly in studies involving complex distributed systems.

Day 2 (June 19th, 2024): Keynotes and Paper Sessions

The second day started with introductory notes from the conference co-chairs, Dr. Jay Lofstead and Dr. Tanu Malik, followed by a keynote by Prof. Dr. Anne-Laure Boulesteix on "Replicable Empirical Machine Learning Research". The keynote highlighted the challenges in ensuring reliable results in machine learning studies. Prof. Boulesteix raised three questions

  1. Are the results published in this field reliable? 
  2. When authors claim that their new method performs better than existing ones, should readers trust them? 
  3. Is an independent study likely to obtain similar results?

For all these questions, her answer was ‘not always’. Then she discussed the replication crisis in science, noting that issues like publication bias, cherry-picking, and poor experimental design often undermine the trustworthiness of machine learning research. Prof. Boulesteix urged the community to adopt better practices to improve the reliability of empirical evidence, using benchmark studies on high-dimensional biological data as examples of recent positive developments.

Then the first presentation session, themed “Provenance and Reproducibility” and chaired by Dr. Victoria Stodden, started with Nichole Boufford et al. 's work on “Computational Experiment Comprehension Using Provenance Summarization”. Their work proposed summarizing provenance graphs using large language models (LLMs) to create textual summaries. Their user study showed that these summaries are promising for experiment reproduction, and qualitative results suggest future directions for reproducibility tools.

The first session concluded with the presentation by Samuel Grayson et al. on “A Benchmark Suite and Performance Analysis of User-Space Provenance Collectors”.

The second presentation session was about “The Human Side of Reproducibility” chaired by Dr. Camille Maumet. Christian Gilbertson presented “Towards Evidence-Based Software Quality Practices for Reproducibility”. Next, Yantong Zheng presented “The Idealized Machine Learning Pipeline for Advancing Reproducibility in Machine Learning”.

The third presentation session with the theme “Computational Experiment Preservation”, chaired by Dr. Arnaud Legrand, had two “Best Paper Finalist” presentations.

  1. Longevity Of Artifacts In Leading Parallel And Distributed Systems Conferences: A Review Of The State Of The Practice In 2023. Quentin Guilloteau, Florina M. Ciorba, Millian Poquet, Dorian Goepp and Olivier Richard (slides) (Best Paper Finalist)
  2. Source Code Archival To The Rescue Of Reproducible Deployment. Ludovic Courtès, Timothy Sample, Simon Tournier and Stefano Zacchiroli
  3. The Impact Of Hardware Variability On Applications Packaged With Docker And Guix: A Case Study In Neuroimaging. Gaël Vila, Emmanuel Medernach, Inés Gonzalez Pepe, Axel Bonnet, Yohan Chatelain, Michaël Sdika, Tristan Glatard and Sorina Camarasu Pop (slides) (Best Paper Finalist)

As the final presenter of the third session, Dr. Sorina Camarasu-Pop discussed how different hardware setups affect the reproducibility of neuroimaging applications. Reproducibility in computational research is often challenged by software dependencies and numerical instability, particularly in complex fields like neuroimaging. The researchers used containerization tools, Docker and Guix, to create consistent environments for running their experiments. She discussed how they conducted a series of tests using Monte Carlo Arithmetic (MCA) to introduce controlled variations in the computations. Their findings showed that while Docker and Guix can produce consistent results on the same hardware, variations in hardware, software, and numerical computations lead to differences in results. Although of similar magnitudes, these variations are uncorrelated and can affect subsequent analyses. The presentation highlighted the importance of transparency in how software is compiled and the need to consider hardware differences to ensure reproducibility in scientific research.

The last session for the day was about “Poster Lightning Talks”, chaired by Dr. Jay Lofstead. There were five presenters in the session and each person was given 2 minutes to showcase the ‘big picture story’ of their research.

Day 3 (June 20th, 2024): Frameworks and Short Paper Session

The final day of the conference began with a keynote by Dr. Konrad Hinsen on "Reproducibility and Replicability of Computer Simulations". He addressed the progress and remaining challenges in computational reproducibility and replicability (R&R). Focusing on computer simulations, Dr. Hinsen explored the following questions and answers: 

  1. Should computer simulations be made reproducible? Why? “Yes. If I cannot reproduce your simulation, then I don’t know what you have simulated.”
  2. To the last bit, or on a "good enough" basis? “Bit for bit, because it is cheaper and more useful.”
  3. At what cost? “Zero, once we have suitable infrastructure and have adapted our code to it.”
  4. Can we ensure reproducibility without repeating lengthy computations? “Yes, it can be guaranteed by the infrastructure.”

The talk also examined whether replicability is more crucial than reproducibility, the current state of simulation replicability, and the obstacles hindering better replicability.

After the Gather Town event for poster presentations, we had the 5th session with the theme “Reproducibility Enhancing Frameworks”, chaired by Dr. Sameer Shende. Adhithya Bhaskar virtually presented their work “Reproscreener: Leveraging LLMs for Assessing Computational Reproducibility of Machine Learning Pipelines”. He introduced “Reproscreener”, a tool to automate the evaluation of reproducibility metrics in machine learning research. The presentation covered the challenges in ensuring reproducibility within machine learning, outlined existing guidelines and metrics for reproducibility, and detailed the architecture and empirical performance of Reproscreener. Using LLMs, the tool automates the assessment process by evaluating both manuscripts and code repositories, providing detailed feedback and reproducibility scores. The presentation also highlighted the ReproScore metric for an overall assessment of manuscript readiness, demonstrating the potential for improving the reproducibility of machine learning studies through automation and comprehensive metric evaluation.

As the second and third presentations of the session, Michael Arbel showcased their work, “MLXP: A Framework For Conducting Replicable Machine Learning Experiments in Python” and Guineng Zheng presented “LogFlux: A Software Suite For Replicating Results In Automated Log Parsing”, respectively. One interesting fact about LogFlux is that it enables log parsing for large-scale studies independent of the original authors' code, promoting reproducibility and transparency in log parsing research.

At the final presentation session, the five accepted short papers were given 18 minutes each to briefly showcase their work:

  • Statistical Comparison in Empirical Computer Science with Minimal Computation Usage by Timothée Mathieu and Philippe Preux
  • Embracing Deep Variability for Reproducibility and Replicability by Mathieu Acher et al.
  • Can Citations Tell Us About a Paper’s Reproducibility? A Case Study of Machine Learning Papers by Rochana R. Obadage, Sarah M. Rajtmajer, and Jian Wu
  • Evaluating Tools for Enhancing Reproducibility in Computational Scientific Experiments by Lázaro Costa et al.
  • Toward Evaluating the Reproducibility of Information Retrieval Systems with Simulated Users by Timo Breuer and Maria Maistro

As the third presenter of this sixth session, I presented our short paper titled "Can citations tell us about a paper's reproducibility? A case study of machine learning papers" as a virtual participant. It was conducted with Dr. Sarah Rajtmajer and supervised by Dr. Jian Wu to assess whether downstream citations can indicate the reproducibility of machine-learning papers. I discussed how we conducted our study using a pilot dataset containing 41,244 citation contexts extracted from 130 ML papers. Also, we presented the observed correlation between reproducibility scores and citation context sentiment through training machine learning models to classify citation contexts based on reproducibility sentiments. Our findings suggest that it is possible to statistically estimate the reproducibility of ML papers using downstream citation contexts, offering a potential surrogate for assessing reproducibility trends when direct reproducibility studies are not feasible. The conference attendees complimented our work’s novelty.

Closing Ceremony

The closing ceremony of ACM REP ‘24 began with an award presentation, announcing the winners for Best Junior Presentation and Best Paper.

  • Best Junior Presentation Award: Computational Experiment Comprehension Using Provenance Summarization. By Nichole Boufford, Joseph Wonsil, Adam Pocock, Jack Sullivan, Margo Seltzer and Thomas Pasquier (slides) (Best Paper Finalist)
  • Best Paper Award: “The Impact Of Hardware Variability On Applications Packaged With Docker And Guix: A Case Study In Neuroimaging”. By Gaël Vila, Emmanuel Medernach, Inés Gonzalez Pepe, Axel Bonnet, Yohan Chatelain, Michaël Sdika, Tristan Glatard and Sorina Camarasu Pop

Attending ACM REP ‘24 was an inspiring experience that emphasized the importance of reproducibility in research. The sessions and discussions were insightful and motivating. We expect to incorporate these new ideas into practice in our future work. 


– Rochana R. Obadage

Comments