2025-09-29: Summer Project as a Google Summer of Code (GSoC) Contributor

This summer (summer of 2025), I got the opportunity to be a Google Summer of Code (GSoC) contributor. As a GSoC contributor, I worked with the TV News Archive at the Internet Archive under the mentorship of Dr. Sawood Alam, Research Lead of the Wayback Machine. Over the 12-week coding period, our project focused on detecting social media content in TV news, specifically through logo and post screenshot detection.

Introduction 

Information diffusion in social media has been well studied.  For example, social and political scientists have tracked how social movements like #MeToo spread on social media and estimated political leanings of social media users.  But one area that has yet to be studied is the reference of social media in conventional broadcast television news. 


Our study aims to address this gap by analysing TV news broadcasts for references to social media. These references can be in many representations - on-screen text, verbal mentions by the hosts/speakers, or visual objects displayed on the screen (Figure 1). In this GSoC project, our focus was on the visual objects, specifically detecting social media logos and screenshots of user posts appearing in TV news.



Figure 1: Different representations of social media on TV news (On-screen text, verbal mentions by the hosts/speakers, and visual objects such as logos and screenshots)

Datasets

TV News Data

Internet Archive’s TV News Archive provides access to over 2.6 million U.S. news broadcasts dating back to 2009. Each broadcast at the TV News Archive is uniquely identified by an episode ID that encodes the channel, show name, and timestamp.

For example, the episode ID:


FOXNEWSW_20230711_040000_Fox_News_at_Night 


corresponds to the show link: 

https://archive.org/details/FOXNEWSW_20230711_040000_Fox_News_at_Night


A show within the TV News Archive is represented by 1-minute clips, each accessible via a URL arguments in the path, where time is specified in seconds:


https://archive.org/details/FOXNEWSW_20230711_040000_Fox_News_at_Night/start/0/end/60 


For our visual detection tasks, we used the full-resolution frames extracted every second throughout the entirety of the broadcast, provided by Dr. Kalev Leetaru, at The GDELT Project

Our Sample Data (Selected Episodes)

Between 2020 and 2024, we sampled one day per year during primetime hours (8–11pm) across three major cable news channels (Fox News, MSNBC, and CNN), resulting in 45 episodes.


After excluding 9 episodes consisting of documentaries or special programs (which fall outside the scope of regular prime-time news coverage), the final sample contained 36 news episodes. Of these, 15 were from Fox News, 12 from MSNBC, and 9 from CNN. Table 1 presents the full list of episodes, with the excluded episodes highlighted in red.

Gold Standard Dataset 

To create the fold standard dataset, we labeled every 1-second frame of each episode with the presence of logo and/screenshot along with the mentioned social media platform name. The labeling process was facilitated by a previously compiled dataset, which I had created through manual review of TV news broadcasts. In that dataset, each 60-second clip was annotated for any social media references, including text, host mentions, or visual elements such as logos and screenshots. By cross-referencing these annotations, we constructed the gold standard dataset. The gold standard dataset includes only those frames that contain at least one social media reference (either a logo or a screenshot), rather than every second of an episode.


Below is a snippet of the gold standard for CNN episodes (Figure 2). 


Figure 2: A snippet of the gold standard for CNN episodes


Each row represents a single frame and is described by five columns. The filename format is episode ID-seconds.


For example,

CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon-000136.jpg


CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon is the episode ID and 000136.jpg indicates the frame taken at the 136th second of that 60-minute episode.


The Logo and Screenshot columns indicate their presence, while their Type columns specify the platform. 


For example, the entry CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon-000136.jpg shows that the frame contains both a Twitter logo and a Twitter screenshot. 


The complete gold standard datasets for all three channels can be accessed via the following links:


CNN:  https://github.com/internetarchive/tvnews_socialmedia_mentions/blob/main/GoldStandardDataset/Labels_for_Images/gold_standard_images_cnn.csv


Fox News: https://github.com/internetarchive/tvnews_socialmedia_mentions/blob/main/GoldStandardDataset/Labels_for_Images/gold_standard_images_foxnews.csv


MSNBC:  https://github.com/internetarchive/tvnews_socialmedia_mentions/blob/main/GoldStandardDataset/Labels_for_Images/gold_standard_images_msnbc.csv

Logo and Screenshot Detection Process

We implemented a system to automatically detect social media logos and user post screenshots in television news images using the ChatGPT API (GPT-4o is a multimodal model capable of processing both text and image input). The workflow is summarized below. 

1. System Setup

We accessed the API of the GPT-4o model (using an access token provided by the Internet Archive) to process image frames and return structured text output.


Image-to-Text Pipeline:

  • Input: TV news frame image in .jpg format.

  • Output: Structured CSV file containing the fields:

    • Social Media Logo (Yes/No)

    • Logo Detection Confidence (0–1)

    • Social Media Logo Type (e.g., Instagram, Twitter (bird logo), X (X logo))

    • Social Media Post Screenshot (Yes/No)

    • Screenshot Detection Confidence (0–1)

    • Social Media Screenshot Type (e.g., Instagram, Twitter)

2. Image Preprocessing

We extracted the full-resolution frames (per second) of each episode for processing. The raw frames were provided as a .tar file per episode. 


Since video content such as TV news broadcasts often contains long segments of visually static or near-identical scenes, processing every extracted frame independently can introduce significant redundancy and is computationally expensive. To address this, we applied perceptual hashing to detect and eliminate duplicate or near-duplicate frames.


We used the Python library ImageHash (with average hashing) to reduce the number of frames that need to be processed (code). To measure how close two frames were, we calculated the Hamming distance between their hashes. A low Hamming distance means the frames are almost the same, while a higher value means they are more different. By setting a threshold t (for example, treating frames with a distance of t ≤ 5 as duplicates), we were able to keep just one representative frame from a group of similar ones.


To identify duplicate groups, we defined a threshold parameter t, such that any two frames with a Hamming distance ≤ t were considered equivalent. Within each group of near-duplicates, only a single representative frame was retained. We evaluated multiple thresholds (t=5,4,3). We also explored whether keeping the middle frame or the last frame from a group of similar frames made any difference in the results. While these choices did not significantly impact our initial findings, it is an aspect that requires further investigation and will be considered as part of future work.


For the final configuration of the deduplication process, we used t=3 for a relatively strict threshold and to minimize the chance of discarding distinct, relevant content. Within each group, we retained the middle frame, guided by the intuition tha the last frame of a group often coincides with transition boundaries (e.g., cuts, fades), whereas the middle frame is less likely to be affected.

3. Prompt Design and Iterations

To automatically detect social media logos and user post screenshots using the ChatGPT API, we designed a structured prompt. We iteratively refined the prompt over seven versions (link to commits) to ensure strict and reproducible detection of social media logos and user post screenshots. Each version introduced improvements and changes made in each version are documented in the commit descriptions.


A major change was made from prompt version 3 (link to v3) to prompt version 4 (link to v4). The update narrowed the task to focus strictly on logo and user post screenshot detection. Previous versions included additional attributes such as textual mentions of social media, post context, and profile mentions, but version 4 and subsequent versions disregarded these elements, emphasizing visual detection only.


After several iterations and refinements based on the results of earlier versions, the final prompt we used was version 7 (link to v7)


The final version of the prompt instructed the model to output the following fields:

  1. Social Media Logo (Yes/No)

  2. Logo Detection Confidence (0–1)

  3. Social Media Logo Type

  4. Social Media Post Screenshot (Yes/No)

  5. Screenshot Detection Confidence (0–1)

  6. Social Media Screenshot Type


The final prompt reflects the following considerations:

  1. Scope of platforms

  • Only considered the following social media platforms for detection:

Facebook, Instagram, Twitter (bird logo), X (X logo), Threads, TikTok, Truth Social, LinkedIn, Meta, Parler, Pinterest, Rumble, Snapchat, YouTube, and Discord. These are the platforms that appeared in the gold standard.

  • Explicitly excluded other platforms including messaging apps like WhatsApp or Messenger. 

 

  1. Logo detection rules

  • Only count the official graphical logos. Text mentions of platform names within the image were explicitly not considered as logos.

  • Logos had to match official design, color, and proportions.

  • Only count logos that are clearly identifiable. Any partial, ambiguous, or unclear elements were excluded.

  • Specific rules were added for X: only consider the stylized ‘X’ logo of the social media platform, excluding other uses of ‘X’ (e.g., ‘X’ in FOX News logo, Xfinity logo)

  1. Post screenshot screenshots

  • Instructed the model to mark only actual user post screenshots, not interface elements like buttons, menus, or platform logos.

  • Visual cues such as profile pictures, usernames, timestamps, reactions, and layout elements could indicate a screenshot. However, these features alone do not guarantee that the image is an actual post screenshot.


  1. Confidence Scores

For both logos and screenshots,  we prompted the model to provide a confidence score from 0 to 1, indicating how certain it is about its detection. These scores were recorded but not yet used in the analysis; they will be considered in future work.

4. API Interaction

Each request consisted of a single user message containing:
1. The analysis prompt (text instructions)
2. The image (base 64 encoded) as an inline image_url.

Figure 3 shows a snippet of the code used to encode images and send requests to the API (full code).

with open(image_path, "rb") as image_file:

                    encoded_image = base64.b64encode(image_file.read()).decode('utf-8')


                messages = [

                    {

                        "role": "user",

                        "content": [

                            {"type": "text", "text": analysis_prompt},

                            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"}}

                        ]

                    }

                ]

                response = client.chat.completions.create(

                    model="gpt-4o",

                    messages=messages,

                    max_tokens=1000, # Set a reasonable max_tokens for response length

                    temperature=0.2

                )

                

                content = response.choices[0].message.content

                parsed_fields = parse_response(content)

                successful_request = True

                break # If successful, exit the outer retry loop


Figure 3: A snippet of the code

5. Response Parsing and Output

After the API returns a response for each frame, we parse the model’s output into a CSV file for each episode containing all six fields as listed in the prompt design and iterations section. We used a flexible regex-based parser that extracts all fields reliably, even if the model’s formatting varies slightly (L93-L159 of code). 


Next, we cleaned the ChatGPT output (code). The script standardizes file paths and normalizes binary columns (Social Media Logo and Social Media Post Screenshot) by converting variations of “Yes”, “No”, and “N/A” into a consistent format. It also normalizes platform names, replacing standalone “X” with “Twitter (X),” updating Twitter bird logos to “Twitter” to align with the labels in the gold standard dataset for evaluation. 


After cleaning each episode’s results, we combined them into a single CSV file per channel (code). The script iterates through all individual CSV files for a given channel and merges them into one consolidated CSV.

6. Evaluation

We started small, restricting our initial tests to a single news episode: Fox News at Night With Shannon Bream, March 13, 2020, 8-9 PM (results). This allowed us to experiment with different prompts before scaling to the full database. Across these runs, we varied both the prompt structure (Prompt v1–v4) and the decoding temperature (0.0 and 0.2). The decoding temperature controls randomness in LLM output. Here, lower values (such as 0.0 and 0.2) are more deterministic, higher values more creative. At temperature 0.0, the output is essentially greedy; the same input will likely produce the same output. For the final version, we ended up using temperature as 0.2 to allow some flexibility to interpret edge cases without introducing instability.  

Single Episode Evaluation Results

For logo detection, results from the single episode show a clear improvement across prompt versions. 


Prompt v1 (Runs 1–7): The baseline instruction set produced very high recall but extremely low precision, with many false positives (results). For example, in Run 4, the model achieved a recall of 0.9155 but precision of only 0.1167, yielding an overall F1-score of 0.2070.


Prompt v2 (Run 8): Refining the prompt substantially reduced false positives, increasing precision to 0.1700 while recall remained high at 0.9577 (results).


Prompt v3 (Run 9): Further improvements to the prompt elded a significant improvement in balance: precision rose to 0.3571 while recall remained strong (0.9155), resulting in an F1-score of 0.5138 (results)


Prompt v4 (Run 10): Explicitly narrowing scope to only logos and screenshots without any questions related to additional context improved our results drastically (results). This change increased the precision (0.9315) while maintaining high recall (0.9577), producing near-perfect accuracy (0.9978) and an overall F1 score (0.9444).


Table 2 shows the key results for logo detection. Results show a clear trajectory of improvement across prompt versions (v1–v4).


For screenshot detection (Table 3), performance was consistently perfect for this episode. The model maintained 100% accuracy, precision, recall, and F1-score across all versions. This also  suggests that screenshot detection is a relatively straightforward task compared to logo detection, at least for this particular episode.



Version

Accuracy

Precision

Recall

F1-score

Run 4 (Prompt v1)

0.8640

0.1167

0.9155

0.2070

Run 8 (Prompt v2)

0.9085

0.1700

0.9577

0.2887


Run 9 (Prompt v3)

0.9664

0.3571

0.9155

0.5138

Run 10 (Prompt v4) 

0.9978

0.9315

0.9577


0.9444


Table 2: Logo detection key results on a single episode (t=0.2)



Version

Accuracy

Precision

Recall

F1-score

Run 4 (Prompt v1)

1.0000


1.0000

1.0000

1.0000

Run 8 (Prompt v2)

1.0000

1.0000

1.0000

1.0000

Run 9 (Prompt v3)

1.0000

1.0000

1.0000

1.0000

Run 10 (Prompt v4) 

1.0000

1.0000

1.0000

1.0000


Table 3: Screenshot detection key results on a single episode (t=0.2)


With appropriate constraints, the model could reliably perform logo detection, while screenshot detection required minimal intervention.

All Episodes Evaluation Results

After establishing stability with Prompt v4, we scaled to all 36 episodes across three channels  (results). Performance metrics for logo detection are provided in Table 4, and those for screenshot detection are shown in Table 5. 


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9912

0.1705

0.9565

0.2895

FOX News

0.9903

0.5039

0.9324

0.6542

MSNBC

0.9931

0.5238

0.9649

0.6790

Table 4: Performance metrics for logo detection (Prompt version 4, all episodes)



Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9984

0.5405

0.8696

0.6667

FOX News

0.9922

0.1750

1.0000

0.2979

MSNBC

0.9986

0.8022

0.9605

0.8743

Table 5: Performance metrics for screenshot detection (Prompt version 4, all episodes)


The results showed:

  • CNN: Very high recall (>0.95) but extremely low precision for logo detection (0.1705), leading to a weak F1-score (0.2895). This reflects over-detection: the model flagged many elements as logos.

  • Fox News: Precision and recall were more balanced (precision 0.5039, recall 0.9324), producing an F1 of 0.6542.

  • MSNBC: The best performer, with precision = 0.5238, recall = 0.9649, and F1 = 0.6790.


For screenshot detection, MSNBC again outperformed, with a F1 = 0.8743. CNN (F1 = 0.667) and Fox News (F1 = 0.298) were more prone to over-detection. 


The same model and prompt performed better on MSNBC content. This may be related to differences in on-screen visual style, such as clearer or less ambiguous logo and screenshot cues, but this remains speculative and warrants further study.


We made further refinements to the prompt to improve precision: 


Prompt v5 (changes): This version of the prompt sets a fixed list of valid platforms, adds confidence scores for detections, and tightens logo detection rules with stricter visual checks.


Prompt v6 (changes): Explicit X logo rules were introduced which reduced false positives. We further clarified confidence score instructions to ensure consistent numeric outputs for all detections. Refined the screenshot criteria to include only user posts, reducing mislabeling; this marked the first substantial prompt change for screenshots, as their performance had previously been consistently stable. From Prompt v5 to Prompt v6, CNN saw a slight drop in logo precision but improved screenshot F1, while MSNBC showed minor gains in screenshot detection with stable logo performance.


Prompt v7 (changes): This final configuration produced the most stable results across channels. It simplifies the X logo rules by removing exceptions for black, white, or inverted colors, while keeping strict guidance to avoid other confusing logos (X in Xfinity logo, FOX News logo, or other random X letters). Clarified the confidence score questions to always require a numeric answer between 0 and 1, explicitly prohibiting “N/A” responses for consistency. It also explicitly states that OCR-detected platform names are not counted as logos.


Results from Prompt version 4 (Table 4 and 5) to Prompt version 5 (Table 6 and 7) show improved precision across all channels while maintaining high recall. Specifically,

  • CNN: logo F1 increased from 0.29 to 0.46.

  • Fox News: logo F1 improved from 0.65  to  0.73.

  • MSNBC: logo F1 rose from 0.68 to 0.78.

Screenshot detection remained stable. Overall, version 5 produced more balanced detections, reducing over-detection, particularly for logos (results).


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9954

0.3056

0.9167

0.4583

FOX News

0.9941

0.6595

0.8138

0.7286

MSNBC

0.9966

0.6800

0.9239

0.7834

Table 6: Performance metrics for logo detection (Prompt version 5, all episodes)


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9982

0.5263

0.8696

0.6557

FOX News

0.9926

0.1818

1.0000

0.3077

MSNBC

0.9986

0.7714

0.9474

0.8504

Table 7: Performance metrics for screenshot detection (Prompt version 5, all episodes)


Results from the final prompt version (Prompt version 7) are shown in Table 8 (for logos) and Table 9 (for screenshots). CNN shows an increase in logo F1 from 0.46 to 0.51 and screenshot F1 from 0.66 to 0.70. Fox News experiences a slight decrease in logo F1 (0.73 to 0.69) but an improvement in screenshot precision (0.30 to 0.39). MSNBC achieves a logo F1 of 0.89 (up from 0.78) and a screenshot F1 of 0.91 (up from 0.85). This version achieved the best balance of precision and recall, particularly for MSNBC (results). However, this also shows that no single prompt configuration is optimal for all channels; some adjustments may be required to maximize performance per channel.


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9965

0.3529

0.8889

0.5053

FOX News

0.9926

0.5963

0.8298

0.6940

MSNBC

0.9980

0.8535

0.9306

0.8904

Table 8: Performance metrics for logo detection (Prompt version 7, all episodes)


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9986

0.5789

0.8800

0.6984

FOX News

0.9945

0.2456

1.0000

0.3944

MSNBC

0.9988

0.8496

0.9697

0.9057

Table 9: Performance metrics for screenshot detection (Prompt version 7, all episodes)


Overall, these results underscore the value of iterative prompt engineering, temperature tuning, and task-specific constraints in achieving high-quality, reproducible detection outcomes in multimedia content.

Future Work

Several directions remain for extending this work. 


Prompt Refinement and Channel-Specific Tuning: We will continue refining the analysis prompt to increase accuracy and consistency in detecting social media logos and user post screenshots. Early observations suggest that performance varies across channels (and programs), likely due to unique ways in how social media is visually presented by them. This indicates that channel- or program-specific prompt tuning could further enhance results.


Decoding Temperature Exploration: While our experiments primarily used low decoding temperatures (0.0 and 0.2), future work can explore a range of temperatures to evaluate whether controlled increases in randomness improve recall in edge cases without significantly raising false positives.


Frame Selection Strategies: We conducted preliminary observations using different Hamming distance thresholds (t=5,4,3) to group similar frames and experimented with selecting the first, middle, or last frame from each group. While these initial explorations provided some insights, they were not systematically analyzed. Future work will investigate the effects of different frame selection strategies to determine the optimal approach for reducing redundancy without losing relevant content.


Confidence Scores: The confidence scores for logos and screenshots (ranging from 0 to 1) were recorded but not yet utilized. Future work will explore integrating these scores into the analysis to weigh detections and potentially improve precision.


Dataset Expansion: Future work includes manually labeling additional episodes from multiple days of prime-time TV news to expand the gold standard dataset. This will uncover more instances of social media references. We will also be able to evaluate the performance of our logo and screenshot detection pipeline across diverse broadcast content.


Advertisement Filtering: With access to advertisement segments, we plan to exclude ad images before the evaluation step. This will improve our results, as currently, the pipeline includes ads, so ChatGPT may label social media references in ads that are not annotated in the gold standard. As a result, some apparent false positives are actually correct detections, highlighting the need to filter ads for accurate evaluation.


Complementary Detection Methods: In addition to logo and screenshot detection, future work will focus on other approaches such as analyzing OCR-extracted text from video frames and analysing closed-caption transcripts for social media references. 


Compare Against Other Multimodal Models: We aim to explore other vision-language APIs, such as Google’s Gemini Pro to compare detection performance across different Large Language Models (LLMs).

Acknowledgement

I sincerely thank the Internet Archive and the Google Summer of Code Program for providing this amazing opportunity. Specially, I would like to thank Sawood Alam, Research Lead, and Will Howes, Software Engineer, at the Internet Archive’s Wayback Machine for their guidance and mentorship. I also acknowledge Mark Graham, Director of the Wayback Machine at the Internet Archive and Roger Macdonald, Founder of the Internet Archive’s TV News Archive for their invaluable support.  I am grateful to the TV News Archive team for welcoming me into their meetings, which allowed me to gain a deeper understanding of the archive and its work. I am especially grateful to Kalev Leetaru (Founder, the GDELT Project) for providing the necessary Internet Archive data which were processed through the GDELT project. Finally, I would like to thank my PhD advisors, Dr. Michele Weigle and Dr. Michael Nelson (Old Dominion University) and Dr. Alexander Nwala (William & Mary) for their continued guidance.



Comments