2022-02-23: One in Five arXiv Articles Reference GitHub



Starting in Fall 2021, I've had the opportunity to work on the CoSAI Project under the guidance of Dr. Martin KleinDr. Michael Nelson, and Dr. Michele Weigle. The CoSAI Project is working to preserve web-based scholarship including source code. The goal of the project is to make the archival process more accessible to institutions by creating a curation workflow to facilitate the process.

As part of the project, we wanted to find a set of code repository URIs that were referenced in scholarly publications. To do this, we decided to extract URIs from PDFs in the arXiv corpus which now includes more than 2 million papers. We focused on a corpus of 1.56 million PDFs from April 2007 to November 2021.

During an internship at LANL in Summer 2021, Yasith Jayawardana created code that Robustifies URIs found in PDFs. Part of the code extracts URIs found in PDFs using the PyPDFium2 and PyPDF2 to extract annotated URIs and URIs in the text, respectively. I leveraged this part of Yasith's code to extract URIs from all PDF publications in our arXiv corpus. Some submissions contained more than one version. For those instances, we determined that the latest version was likely representative of what the author intended and used the latest version for our analysis. By extracting URIs from the PDFs, we found 4,039,772 million in-scope URIs. 

Within the CoSAI team, we decided to focus on four Git Hosting Platforms (GHPs) at this stage in the project: GitHub, GitLab, SourceForge, and Bitbucket. I extracted the URIs that belonged to the four platforms from the list of all URIs in the arXiv corpus. 

In the process of extracting URIs for the four repository platforms, we made a number of interesting observations. For starters, most people would assume there has been an increase in the prevalence of URIs in scholarly articles as scholarly ephemera is increasingly found on the Web. Through our experimentation, we found that, as shown in the figure below, scholarly documents have increasingly included URIs from an average of 1.0 URI per publication in 2007 to an average of 4.69 URIs in 2021. This data, the average number of in-scope URIs in each publication, is shown by the blue line in the line chart below. This was also shown by Klein et al. in 2014. They looked at URI use in three corpuses (arXiv, Elsevier, and PMC) from 1997 to 2012 and found that the number of URIs used in scholarly publications rapidly increased during that period of time. 

As the number of URIs in publications increases over time, the number of GHP URIs follows.

While the prevalence of URIs in general has increased, the number of references to repository platforms has also increased from 2007 to 2021. The figure below shows a closer view of the data shown above by focusing only on references to the repository platforms. The figure further shows that, in 2021, 1 in 5 publications contains a GitHub reference. References to GitHub have steadily risen from 2014 to 2021 while the frequency of references to the other three platforms have remained low. Across the arXiv corpus, GitLab was referenced by 2,648 URIs, Bitbucket was referenced by 3,525 URIs, and SourceForge was referenced by 9,412 URIs.  GitHub was referenced far more than the other 3 platforms with 215,621 URIs. 

In 2021, one in five publications contained a URI to GitHub.

Additionally, while a strong majority of publications only reference a given repository platform once, a surprising number of publications reference a given platform's holding more than once. For example, of the 125,711 publications that reference GitHub, 83,328 publications (66.3%) reference GitHub once, but 42,383 publications (33.7%) reference GitHub more than once and 3,757 publications (3%) reference GitHub more than five times. As shown in the figure below, all four platforms show the same long tail distribution with a majority of publications containing one reference to the platform.

Most publications reference a repository platform once. Note: There was a publication that referenced GitHub 896 times. However, it was excluded from the visualization to not skew the representation of the data. 


Scholarly publications are increasingly referencing the Web and software repository platforms. Referencing a URI implies that the content of these platforms is an essential part of the context of the scholarly publication. This emphasizes our goal of archiving scholarly ephemera including the source code found in repository platforms. 

These findings while interesting on their own, raise other questions. What code repositories are referenced in multiple publications? Are these publications written by the same author group or by unique authors? Are the repositories that are referenced in the publications available on the live Web? Are the repositories available in the archives? 

I'm looking forward to what we will uncover as we investigate the use of references to repository platforms in scholarly publications. 

- Emily Escamilla












Comments