2025-10-10: An Internship Experience With the Internet Archive as a Google Summer of Code Contributor

 


In the summer of 2025, I was selected for Google Summer of Code (GSoC), a program that introduces new contributors to open source software development. I had the opportunity to contribute to the Internet Archive, an organization I have long admired for its efforts to preserve digital knowledge for all.


Numerous open source organizations annually participate in the program as mentoring organizations (2025 mentoring organizations), and that includes the Internet Archive. As a GSoC contributor, I was mentored by Dr. Sawood Alam, Research Lead of the Wayback Machine and WS-DL alum. Over the coding period, our project focused on detecting social media content in TV news, specifically through logo and screenshot detection. My work as a contributor is documented in my previous blog post, while this post highlights the GSoC program and my experience in it.

Becoming a GSoC contributor

Becoming a GSoC contributor is open to any beginners in open source (student or non-student) who meet a few basic requirements: you must be at least 18 years old at the time of registration, be a student or a newcomer to open source, be eligible to work in your country of residence during the program, and not reside in a country currently under U.S. embargo. The application process begins by exploring project ideas listed by mentoring organizations, drafting a proposal, and submitting it to Google for review. The project ideas are published on each organization’s page, and contributors can choose one (or more) of these ideas to develop into a proposal. Alternatively, you could also propose your own project idea (this is the  option I did) that may be of interest to the organization you are applying for. Contributors are encouraged to share their drafts with mentors from the organization to get feedback before submitting to Google. Once accepted, contributors spend the summer coding under the guidance of a mentor.

Working on My Project

Information diffusion on social media has been widely studied, but little is known about how social media is referenced in traditional TV news. Our project addresses this gap by analyzing broadcasts for such references by detecting social media logos and screenshots of user posts. 


My original proposal to GSoC involved training object detection and image classification models. However, we then pivoted to using large language models (LLMs), specifically ChatGPT-4o for logo and screenshot detection. This change was worthwhile as we realized that LLMs could perform logo and screenshot detection tasks with significantly less manual data labeling and setup than traditional machine learning approaches. It also taught me to stay flexible and adapt your methods when needed.


You can find the final work product here:

https://summerofcode.withgoogle.com/programs/2025/projects/j0CKIRCi


And a blog post on the technical details here:

https://ws-dl.blogspot.com/2025/09/2025-09-29-summer-project-as-google.html


This was my first time working with LLMs. I have learned a lot, and am still learning about creating effective prompts and integrating this model into a functional pipeline.


Beyond coding, GSoC taught me several valuable lessons. It is really important to stay flexible and to communicate regularly with your mentors. It is also crucial to prioritize your work by putting off critical tasks for future work to maintain steady progress. And of course, effective time management is key, since juggling work and life requires careful planning.

The Best Part

For me, the most exciting part of GSoC was working with the Internet Archive team. I had weekly meetings with my mentors - Dr. Sawood Alam, my assigned GSoC mentor and Will Howes, a Software Engineer at the Internet Archive. Will was mentoring two other GSoC students who joined the same sessions. Both the mentors were very helpful, very responsive through Slack, and always offering advice whenever needed. The Internet Archive leadership, such as Mark Graham, the Director of the Wayback Machine and Roger Macdonald, the Founder of the TV News Archive created a welcoming environment for contributors and always made sure we had the resources we needed. 

Being added to the TV News Archive guest Slack channel and invited to join the weekly TV News Archive team meetings during the Summer were great opportunities for me as a student researcher interested in this field. It was nice to observe how the team curates and preserves broadcast news content, and to learn about their ongoing projects.

Final Thoughts

GSoC was more than just a coding program - it was a huge opportunity for me to learn from great mentors and contribute to the open source community.  I hope to stay involved with the Internet Archive and its team. The technical and collaborative skills I gained, especially from working with LLMs boosted my skills and confidence as a student researcher. Finally, being selected as a GSoC contributor was a great experience and not to mention a notable addition to my resume, I would definitely consider applying again.


~ Himarsha Jayanetti (HimarshaJ)


Comments