2023-08-31: The End of a Chapter

I joined the WS-DL (Web Science and Digital Libraries) Research Group at Old Dominion University in the Fall of 2021. On July 25, 2023, I successfully defended my thesis and will have officially earned my Master's Degree as of the end of August 2023. Before I continue, I would like to thank Dr. Nelson and Dr. Weigle, my advisors, and Dr. Wu, a member of my thesis committee, for their guidance and feedback throughout this process. This would not have been possible without them and many others! 

My Master's thesis was titled "Assessing the Prevalence and Archival Rate of URIs to Git Hosting Platforms in Scholarly Publications". Reference rot in scholarly publications has been well documented by studied including Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot by Martin Klein et al. (WSDL alum) and Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content by Shawn Jones et al. (WSDL alum). Anecdotally, we as researchers have noticed an increase in the number of links to Git Hosting Platforms (GHPs) within publications. How does reference rot effect GHP URIs within scholarly publications? But, before we could answer that question, we needed to first understand if/how often scholars include GHP URIs in their work. We published our results in The Rise of GitHub in Scholarly Publications. We extracted over 200k GHP URIs from 2.6 million publications with 92% of GHP URIs being URIs to GitHub. We also found that the prevalence of GHP URIs has grown steady and one in five arXiv articles published in 2021 contained at least one GHP URI. 

Now that we knew that scholars were including GHP URIs, we wanted to investigate whether these URIs were still available on the live Web and whether these resources had been archived. We investigated the holdings of both Web archives (using MemGator) and Software Heritage to determine if a GHP URI had been archived. We found that 13.3% of GHP URIs are vulnerable, meaning that they are currently available on the live Web, but they have not been archived. Therefore, they are vulnerable to being lost if the URI ceases to be publicly available on the live Web.

Our previous two research questions investigated GitHub, GitLab, SourceForge, and Bitbucket. But, we knew that these were not the only platforms that scholars used to host their software and data products. What other platforms were scholars using? To answer this question, we teamed up with Dr. Wu and Lamia Salsabil to use their machine learning classifier to identify Open Access Data and Software (OADS) URIs from the corpora. Our results demonstrated the utility of a machine learning classifier in this application for two reasons: 1) we found more non-GHP OADS URIs than GHP URIs and 2) the non-GHP OADS URIs contained almost 50,000 unique hostnames. The prevalence of OADS URIs indicates the importance of capturing these URIs in addition to GHP URIs when analyzing URIs to scholarly products. The large number of unique hostnames demonstrates the need for a solution that identifies OADS URIs without the unscalable need for manual enumeration. The results of this study were published in It’s Not Just GitHub: Identifying Data and Software Sources Included in Publications.

Lastly, a study on the relationship between GitHub repository engagement metrics and publication impact inspired us to analyze the relationship between GitHub repository engagement metrics and archival rate. We specifically looked at forks, subscribers, and stargazers individually in relation to archival rate in Web archives and Software Heritage. We found that there was a statistically significant relationship between each of the three engagement metrics and archival rate in both Web archives and Software Heritage. 

I have embedded the defense recording and the slides used in my defense presentation below. The full thesis document will be available in ODU Digital Commons. All source code and resulting data is publicly available at https://github.com/oduwsdl/Extract-URLs.

To wrap up this blog post, I want to share a few things I wish I would have known when I began this program:

1. Your advisors aren't going to give you the answers

The whole point of research is that you are pushing the bounds of human knowledge and that will eventually include your advisors. They are there to guide your research and ask questions you never thought of. You won't know all of the answers to the questions they ask. But, your job is to keep asking "why?". Did you find a cool trend? Find out what might have caused it. Did you find an anomaly? Dig into the data to find the source. Then, when it's time for your weekly meeting, show them what you found. More often than not, they'll ask a question or make a point you had not thought of. That's why they get paid the big bucks. And you will be better for it. If you keep asking "why?' and trying to find answers to their questions, you'll eventually end up with all the material you could ever want for your thesis. 

2. Write EVERYTHING down

In case someone has not told you yet, write EVERYTHING down. Even that thing you think you will remember, the meeting notes you are sure you will not need, and the background to that code that you think is self-explanatory. Pretend that someone is coming after you in 6 months that will need to understand everything you did. Because someone will, and that person is you. As you try to finish a conference submission or your thesis, you will thank yourself because you will not be spending 3 hours trying to figure out what that regular expression captures and why you put it there in the first place. 

3. You're going to write... a lot

Before I started grad school, I heard someone talk about "paper season" and I am now fully convinced that they were lying. My entire degree was spent with the next conference submission in mind. As some one who has never enjoyed writing, that was challenging. Through the course of my research, I never learned to like writing, but I became more and more convinced that my research had found cool things and I wanted to tell people about my results. My excitement over my results began to outweigh my dislike for writing. And, just as important, I found a process that worked for me. So, find your process. Find a coffee shop you like and start writing. When you think you are done, send it to others for their feedback. If any of my publications have been even remotely good, Talya Cooper, Dr. Weigle, Dr. Nelson, and Martin Klein are to thank. I know that writing is not my strength, so I often relied on their input to move the needle from "decent" to "great". Whatever process works for you, always create something you and all your coauthors are proud to put their name on. 

4. Learn from your paper rejections

"Don't take a paper rejection personally". You have probably heard this before and, while true, I found it unrealistic. After I spent countless hours coding a solution, running results, creating figures, detailing the methodology, and staying up late to implement feedback, it was impossible to not be disappointed by a rejection from a conference. Instead, I chose to learn from it.

"How could they possibly have gotten that from what I said" --> I need to clarify this point
"Did they even read my paper?" --> I need to state my conclusion more prominently

I learned how to use even the most frustrating comments to make my paper better. Yes, the comments and the rejection feel personal, but use them to make your work better. Then, when you do get that paper accepted, it will be even sweeter. 

5. Answer the call (for papers)

"You miss 100% of the shots you don't take". Do your advisors think you have an outside shot of getting accepted at that conference? Do you see a Call for Papers for a conference in your field? Go for it! 

Just like writing, presenting is a valuable and necessary way to communicate results to the research community. And, the more you practice, the better you will get. Conferences can give you the opportunity to not only present your work, but, also, to collaborate with other researchers in your field. 
So, take every opportunity you can get to present your work. In April 2022, I presented the early stages of my research virtually at the annual WSDL Research Expo. In June, I gave a prerecorded lightning talk at IIPC WAC 2022. In July, I gave a live, virtual presentation at WADL 2022. With each presentation, I became more comfortable presenting my work and fielding audience questions. These conferences also allowed me to learn about the work other researchers were doing in the field and gave me ideas for the direction of my research. 

In September 2022, I had the opportunity to present my research at TPDL 2022 in Padua, Italy. This was my first in-person conference and my first in-person presentation. In May 2023, I presented my work in an individual presentation and as part of a panel at IIPC WAC 2023 in Hilversum, Netherlands. Taking every opportunity to present my work early in my research made me more prepared to present my work at larger venues. 

So, submit your work to the call for papers. You never know if your work will be accepted if you never try. Plus, you may get a few trips to Europe out of it!

6. Surround yourself with great people

I lucked into this, but surrounding yourself with great people will help you to not only survive but to enjoy your time as a grad student. I somehow managed to be a part of both the best research group and the best project. For most of my time as a grad student, I worked in the WSDL lab with Tarannum, David, Himarsha, Caleb, and other WSDL members. Being able to share in the stress of a conference deadline or brainstorm with people working on similar problems made the not so fun parts of grad school so much more enjoyable. 

I worked on the Collaborative Software Archiving for Institutions (CoSAI) Project. As part of this team, I worked with Martin Klein (WSDL alum), Lyuda Balakireva, Vicky Rampin, Talya Cooper, David Calano (current WSDL member), Dr.  Weigle, Dr. Nelson. There is something special about working with people who love their job and are passionate about the field they are working in that allows you to stay excited about what you are working on. 

All of these people helped me to not only complete my Master's degree with their knowledge, feedback, and encouragement, but also made it an enjoyable experience. Grad school is hard, so find great people that make it fun!

- Emily Escamilla (@EmilyEscamilla_)