2022-02-23: One in Five arXiv Articles Reference GitHub
Starting in Fall 2021, I've had the opportunity to work on the CoSAI Project under the guidance of Dr. Martin Klein, Dr. Michael Nelson, and Dr. Michele Weigle. The CoSAI Project is working to preserve web-based scholarship including source code. The goal of the project is to make the archival process more accessible to institutions by creating a curation workflow to facilitate the process.
As part of the project, we wanted to find a set of code repository URIs that were referenced in scholarly publications. To do this, we decided to extract URIs from PDFs in the arXiv corpus which now includes more than 2 million papers. We focused on a corpus of 1.56 million PDFs from April 2007 to November 2021.
During an internship at LANL in Summer 2021, Yasith Jayawardana created code that Robustifies URIs found in PDFs. Part of the code extracts URIs found in PDFs using the PyPDFium2 and PyPDF2 to extract annotated URIs and URIs in the text, respectively. I leveraged this part of Yasith's code to extract URIs from all PDF publications in our arXiv corpus. Some submissions contained more than one version. For those instances, we determined that the latest version was likely representative of what the author intended and used the latest version for our analysis. By extracting URIs from the PDFs, we found 4,039,772 million in-scope URIs.
Within the CoSAI team, we decided to focus on four Git Hosting Platforms (GHPs) at this stage in the project: GitHub, GitLab, SourceForge, and Bitbucket. I extracted the URIs that belonged to the four platforms from the list of all URIs in the arXiv corpus.
In the process of extracting URIs for the four repository platforms, we made a number of interesting observations. For starters, most people would assume there has been an increase in the prevalence of URIs in scholarly articles as scholarly ephemera is increasingly found on the Web. Through our experimentation, we found that, as shown in the figure below, scholarly documents have increasingly included URIs from an average of 1.0 URI per publication in 2007 to an average of 4.69 URIs in 2021. This data, the average number of in-scope URIs in each publication, is shown by the blue line in the line chart below. This was also shown by Klein et al. in 2014. They looked at URI use in three corpuses (arXiv, Elsevier, and PMC) from 1997 to 2012 and found that the number of URIs used in scholarly publications rapidly increased during that period of time.
Comments
Post a Comment