2024-08-25: IMLS Grant Awarded on Preserving Open Access Datasets and Software for Sustained Computational Reproducibility
Figure 1: The schematic overview of the project, including main components (lower) and stakeholders (upper).
In collaboration with Dr. Sawood Alam at the Internet Archive, Dr. Edward Fox at Virginia Tech, and Bill Ingram at Virginia Tech Libraries, I am so grateful to receive an award from the Institute of Museum and Library Services (IMLS) this year. The title of my grant is Preserving Open Access Datasets and Software for Sustained Computational Reproducibility. The total amount is about $564k. The official webpage of this award is https://www.imls.gov/grants/awarded/lg-256694-ols-24.
Recently, concerns of reproducibility have been raised in multiple academic disciplines such as the Social and Behavioral Sciences, Biomedical and Life Sciences, and Computer and Information Sciences. Datasets and software packages are crucial resources to many research domains requiring data analysis. Collberg and Proebsting found that a large fraction of works in Computer Science were not reproducible because the code and/or data were not available. In part due to advocacy for open science, an increasing number of authors choose to share datasets and software publicly. Further, the White House Office of Science and Technology Policy issued guidance in 2022 to make federally funded research freely available without delay. However, our recent research indicates that a substantial fraction of Open Access Datasets and Software (OADS) is not archived, posing a barrier for the academic and industrial communities to reproduce or replicate research outcomes. Therefore, it is urgent to identify and preserve endangered OADS resources for sustainable reproducibility.
The goal of this project is to develop, report about, and solve foundational problems related to the value, status, trends, and preservability of OADS for publicly available scholarly papers and electronic theses and dissertations (ETDs), and to enable and ensure progress toward sustainable computational reproducibility. To this end, we will focus on building machine learning models and datasets that encompass three key aspects of OADS, namely, availability (whether OADS-URLs appear in scholarly works), discoverability (whether OADS-URLs are alive on the web or in the archive), and accessibility (whether OADS are accessible through OADS-URLs). The above figure illustrates our main contributions, including validating the three aspects, the OADS Repo dataset to be built, analytical and prediction models, and dissemination platforms (CiteSeerX and Internet Archive).
This project is partially designed atop a prior IMLS proposal on mining ETDs, during which we collected a corpus consisting of more than 500K ETDs from institutional repositories hosted by U.S. university libraries. The project is also built on top of our preliminary works below:
- Salsabil et al. (2022): A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and Software
- Ajayi et al. (2023): A Study on Reproducibility and Replicability of Table Structure Recognition Methods
- Escamilla et al. (2023): It’s Not Just GitHub: Identifying Data and Software Sources Included in Publications
The proposed research project will have a national impact for its first investigation of computational reproducibility from the perspectives of availability, discoverability, and accessibility of OADS and linked resources, by mining the full text of scholarly papers and ETDs. Because OADS-URLs ubiquitously exist in scholarly papers and ETDs, the improved computational reproducibility will potentially benefit a spectrum of users from various disciplines to access OADS and instruct the best practices to preserve OADS for sustainable access.
-- Jian Wu
Comments
Post a Comment