2025-10-10: An Internship Experience With the Internet Archive as a Google Summer of Code Contributor
In the summer of 2025, I was selected for Google Summer of Code (GSoC), a program that introduces new contributors to open source software development. I had the opportunity to contribute to the Internet Archive, an organization I have long admired for its efforts to preserve digital knowledge for all.
Becoming a GSoC contributor
Becoming a GSoC contributor is open to any beginners in open source (student or non-student) who meet a few basic requirements: you must be at least 18 years old at the time of registration, be a student or a newcomer to open source, be eligible to work in your country of residence during the program, and not reside in a country currently under U.S. embargo. The application process begins by exploring project ideas listed by mentoring organizations, drafting a proposal, and submitting it to Google for review. The project ideas are published on each organization’s page, and contributors can choose one (or more) of these ideas to develop into a proposal. Alternatively, you could also propose your own project idea (this is the option I did) that may be of interest to the organization you are applying for. Contributors are encouraged to share their drafts with mentors from the organization to get feedback before submitting to Google. Once accepted, contributors spend the summer coding under the guidance of a mentor.
Working on My Project
Information diffusion on social media has been widely studied, but little is known about how social media is referenced in traditional TV news. Our project addresses this gap by analyzing broadcasts for such references by detecting social media logos and screenshots of user posts.
My original proposal to GSoC involved training object detection and image classification models. However, we then pivoted to using large language models (LLMs), specifically ChatGPT-4o for logo and screenshot detection. This change was worthwhile as we realized that LLMs could perform logo and screenshot detection tasks with significantly less manual data labeling and setup than traditional machine learning approaches. It also taught me to stay flexible and adapt your methods when needed.
You can find the final work product here:
https://summerofcode.withgoogle.com/programs/2025/projects/j0CKIRCi
And a blog post on the technical details here:
https://ws-dl.blogspot.com/2025/09/2025-09-29-summer-project-as-google.html
This was my first time working with LLMs. I have learned a lot, and am still learning about creating effective prompts and integrating this model into a functional pipeline.
Beyond coding, GSoC taught me several valuable lessons. It is really important to stay flexible and to communicate regularly with your mentors. It is also crucial to prioritize your work by putting off critical tasks for future work to maintain steady progress. And of course, effective time management is key, since juggling work and life requires careful planning.
The Best Part
Being added to the TV News Archive guest Slack channel and invited to join the weekly TV News Archive team meetings during the Summer were great opportunities for me as a student researcher interested in this field. It was nice to observe how the team curates and preserves broadcast news content, and to learn about their ongoing projects.
Final Thoughts
GSoC was more than just a coding program - it was a huge opportunity for me to learn from great mentors and contribute to the open source community. I hope to stay involved with the Internet Archive and its team. The technical and collaborative skills I gained, especially from working with LLMs boosted my skills and confidence as a student researcher. Finally, being selected as a GSoC contributor was a great experience and not to mention a notable addition to my resume, I would definitely consider applying again.
~ Himarsha Jayanetti (HimarshaJ)
Comments
Post a Comment