The Computer Science Graduate Society (CSGS) of the Old Dominion University organized a Hackathon from October 24 to November 1, 2024. The competition featured three teams from the master's category and three from the PhD category, each presenting innovative projects. Participants chose from five research topics provided by the organizing committee. Two teams from the Web Science and Digital Libraries (WSDL) research group participated in the hackathon: Binary Bandits (David Calano and Dominik Soos) and our team, Titans (Himarsha Jayanetti, Kritika Garg, and Kumushini Thennakoon).

We won the PhD category with our project, "Tracking Political Trends Around the US Presidential Election.” This was a mini project completed within a limited time frame, making it a fast-paced challenge. Despite the constraints, our team tackled various obstacles in collecting and analyzing data. In this blog post, we provide an overview of our project and highlight its key contributions.

For more details, you can explore our GitHub repository containing the code, detailed report and datasets used in our analysis.

CSGS Hackathon '24 was a huge success! 🎉 Congrats to Team NP (MSc) and Team Titans (PhD) for winning 1st place and taking home Apple Watches! Big thanks to our judges, Dr. Vikas Ashok, Dr. Lusi Li, and Dr. Pratip Rana. #CSGSHackathon2024 @oducs @ODUSCI @ODU pic.twitter.com/LxJAFKtBSe
— Computer Science Graduate Society - ODU (@CSGS_ODU) November 2, 2024

Introduction and Motivation

Social media platforms served as both a public forum and a digital newsstand, making the understanding of the evolving political landscape and more details during the 2024 U.S. presidential election. Platforms like X (formerly Twitter) offered a rich, real-time view of the public's sentiments, concerns, and conversations. The project leveraged data collected from trending hashtags on X in the days leading up to the election and used keyword analysis, sentiment analysis, and topic modeling techniques to capture public opinion, uncover emerging topics, and observe voters' sentiments.

Recent data was essential for capturing the dynamics of the political climate, as political discourse proved highly fluid. Intense campaigning, policy announcements, and significant events shaped public opinion in near real-time. Analyzing short-term data allows us to pinpoint these rapid shifts, helping to reveal immediate concerns, and spontaneous reactions, and possibly identify pivotal moments influencing voter behavior after events like presidential debates. Immediate data also enables the detection of micro-trends that could signal larger upcoming shifts in voter priorities, enhancing the relevance of our analysis as the election nears.

Large language models (LLMs) are an efficient method for analyzing text data because they can process large amounts of text and understand language, sentiment, and context. This is important for accurately capturing public opinion. By using LLMs to analyze Twitter data, we can identify key discussion topics and see which issues matter most to voters. This study will not only help us understand voter sentiment in real time but also show how social media can be a powerful tool for tracking political trends.

Collecting Twitter data

In this study, we collected data from X to gain insights into discussions and trends surrounding the U.S. presidential election in 2024. We used X because it appeared to us that X is the platform with the highest user engagement. We did look into platforms like Facebook but there the election-related groups did not have much engagement. We focused on seven popular hashtags each representing critical keywords that capture various dimensions of election-related discourse. We used a hashtag analytics website to identify the most trending election-related hashtags at the time for data collection. We used the following seven hashtags to collect data from X:

1. #donald

2. #trump

3. #kamala

4. #harris

5. #presidentialelection

6. #pennsylvania

7. #russia

We used the Selenium Wire Python library to develop an automated web scraping tool to gather tweets. We collected the tweets in the time interval from 2024-10-21 to 2024-10-27 to access the most recent data to come up with a better understanding of the political situation around the 2024 US presidential election. Figure 1 shows the data extraction process from tweets. We captured the network traffic for the LATEST tab search query (for example, https://x.com/search?q=%23trump&src=typed_query&f=live), filtering for GraphQL URLs that contain tweet details in JSON format. From these responses, we extracted key metadata, including Tweet ID, Username, Text, and Engagement metrics such as the number of Likes, Retweets (RTs), Quote Tweets (QTs), and Replies for each tweet.

We encountered several challenges during data collection, with the most significant one being the insufficient volume of data for analysis. We spent the first couple days of the hackathon attempting to collect data, but the stringent rate limits imposed by X made this process highly challenging. Our approach involved collecting tweets in reverse chronological order, starting from the most recent and working backward. However, due to the high volume of tweets for certain hashtags we reached the rate limit before we could gather a full week's worth of data. As a result, our dataset became heavily skewed toward the most recent tweets, with the majority coming from a single day rather than a balanced distribution over time. These constraints ultimately restricted the depth of our insights and the robustness of our findings.

Had the hackathon allowed for a longer timeframe, we could have taken a more gradual approach to data collection. However, given the one-week constraint, we had to work with the data we could obtain within the limited time, balancing collection efforts with the need to progress to the analysis phase and complete the project on schedule. Despite these limitations, we believe our approach and findings can still provide valuable guidance for future studies exploring similar topics.

Figure 1: Overview of the data collection step to extract text data from tweets.

Dataset

We were able to extract 3097 Tweets in total for all seven hashtags. Table 1 shows the number of data records we obtained for each Hashtag. In the data pre-processing phase, we cleaned the collected tweets. First, we removed the stopwords (e.g., ”and,” ”the”) to emphasize the meaningful content of each tweet, using the Natural Language Toolkit (NLTK) Python library. Next, we removed the special characters, including emojis and links using regular expressions. We removed the duplicate tweets by keeping only the first instance of each tweet. In order to analyze the hashtags that were used, we extracted hashtags. Additionally, we extracted the date of the tweet by using TweetedAt. After this cleaning and preprocessing stage, we were left with 2,742 unique tweets.

Hashtag	Total records
#donald	45
#trump	589
#kamala	619
#harris	600
#presidentialelection	666
#pennsylvania	403
#russia	175

Table 1: Hashtags and their respective tweet counts extracted from X.

Sentiment Analysis

To analyze sentiment on Twitter, we used the VADER (Valence Aware Dictionary for Sentiment Reasoning), a tool well-suited for the informal and emotive nature of social media content. Tweets were categorized based on VADER sentiment scores: positive for scores above 0.05, negative for scores below -0.05, and neutral for scores in between. This method provided a clear and balanced sentiment classification across the dataset. Figure 2 shows the distribution of sentiment of our dataset.

Figure 2: Sentiment distribution for overall data.

Our analysis focused on specific hashtags, such as #presidentialelection. As shown in Figure 3, the sentiment of tweets varied over time, with alternating periods of positive and negative sentiment. A notable peak occurred at 2:00 PM on October 25, 2024, driven by discussions about battleground state polling (e.g., Arizona and Wisconsin), rallies by candidates Kamala Harris and Donald Trump, celebrity endorsements, and concerns over election security. Two major national polls released that day revealed a neck-and-neck race between Harris and Trump, underscoring the competitiveness of the election.

Figure 3: Sentiment distribution of tweets over time for #presidentialelections tweets.

Figure 4 also illustrates the sentiment distribution of tweets over time for #kamala, showing periods where positive sentiment dominates, followed by periods with a higher proportion of negative sentiment. Expanding the dataset to include additional hashtags or topics, such as those related to Donald Trump, would enable a more comprehensive comparative analysis of public sentiment toward political figures and trending topics.

Figure 4: Sentiment distribution of aggregated tweets for #kamala and #harris hashtags over time.

Ideally, we would have performed similar temporal sentiment analysis on the major topics identified through topic modeling to understand the public sentiment on trending topics over time. However, due to limited data and time constraints during the hackathon, we showcased this analysis on trending hashtags instead.

Topic Modelling

We conducted topic modeling on Twitter data to uncover key themes related to the 2024 US presidential election. By categorizing tweets based on sentiment, we linked emotional tone with thematic content, providing a comprehensive view of public discourse.

Figure 5 shows an overview of the process to identify trending election-related topics among the public. For topic modeling, we utilized BERTopic, which leverages large language models (LLMs) to generate and fine-tune topic clusters. To address computational constraints, we incorporated vector databases for efficient data retrieval and applied techniques like 8-bit compression to reduce memory demands in models such as Llama 2. This enabled faster and more scalable analyses.

Figure 5: Overview of the topic modeling to identify trending election-related topics among the public

Tweets were transformed into numerical representations using Sentence Transformers, which efficiently generate embeddings for clustering. Using the BAAI/bge-small-en-v1.5 model, we pre-calculated these embeddings to streamline exploration and facilitate quick iteration over BERTopic’s hyperparameters. To address the challenges of high-dimensional data, we employed UMAP for dimensionality reduction, preserving both the local and global structure of the data, which is crucial for identifying clusters of semantically similar content.

Our topic modeling pipeline involved UMAP for dimensionality reduction and HDBSCAN for clustering. This process included extracting embeddings with Sentence Transformers, reducing dimensionality with UMAP, clustering embeddings with HDBSCAN, and fine-tuning topic representations using Llama-3-8b-instruct. Tools such as KeyBERT and quantized LLMs via the LlamaCPP library were used to extract and refine topics.

The results revealed emerging themes, characterized by keyword lists that highlighted public sentiment and trending election discussions.

Topics

Our analysis explored trending topics across positive, negative, and neutral sentiment tweets, revealing key themes tied to public discourse during the 2024 U.S. presidential election.

Positive Sentiment Topics:

Figure 6 shows the trending topics related to the positive sentiment tweets. Positive sentiment tweets primarily focused on:

Voter Engagement: Discussions on political campaigns, election news, and efforts to mobilize voters for Trump.
Yosha Crypto Presale: Promotions and discussions about the cryptocurrency Yosha, highlighting tokenomics and investment opportunities.

Figure 6: Trending topics related to the positive sentiment tweets. The positive tweets are mostly in favor of Donald Trump.

Negative Sentiment Topics:

Figure 7 shows the trending topics related to the negative sentiment. Negative sentiment tweets were centered on:

Politics and Violence: Conversations about political events involving violence or threats.
U.S. Politics: Discussions about elections, debates, and policy issues.
Israel-Iran Conflict: Focus on tensions between Israel and Iran, with implications for U.S. foreign policy.

Figure 7: Trending topics related to the negative sentiment.

Neutral Sentiment Topics:

Figure 8 shows the trending topics related to neutral sentiment. Neutral sentiment tweets captured a broader range of themes, including:

Kamala Harris: Analysis of her role in the election, policies, and public perception, often compared to Donald Trump.
Local Discussions: Political engagement in states like Pennsylvania, alongside cultural and historical discussions.
Geopolitical Tensions: Conversations about U.S. foreign policy and its global implications.

Figure 8: Trending topics related to neutral sentiment.

Observations and Challenges

During our analysis, we observed that some topics, particularly in positive and negative sentiment data, formed large clusters. This happens because the embedding process, using models like LLaMA, tends to group similar data points tightly together. When visualized in lower dimensions using tools like UMAP, these clusters become even more condensed, making it harder for HDBSCAN to break them into smaller, distinct groups. This often results in a single large cluster that combines closely related topics, especially when the text data shows minimal variation in context. To address this, we could fine-tune the parameters of the modeling process. For instance, adjusting UMAP settings to focus more on local relationships or allowing HDBSCAN to detect smaller clusters could help uncover unique subtopics. However, overly aggressive adjustments might lead to over-clustering, where redundant topics emerge, as seen in cases like “Kamala Harris” and “Kamala Harris Discussion,” which essentially represent the same theme.

Another challenge stemmed from the dataset’s size. With only 3,000 tweets divided into three subsets by sentiment, the model had limited data to detect diverse topics effectively. Splitting the dataset further reduced its ability to capture trends and nuances. To overcome this, we combined the subsets and reran the topic modeling on the entire dataset. This approach yielded 36 distinct subtopics and provided a clearer picture of the themes within the data as shown in Figure 10. By analyzing the full dataset, we gained a more comprehensive understanding of public discussions, their structure, and the relationships between trending election-related themes.

Figure 10: Topics on Twitter related to the US election 2024.

Conclusions

Our project, Tracking Political Trends Around the US Presidential Election, offered a small-scale glimpse into how political discourse unfolded on the social media platform Twitter (X) in the days leading up to the 2024 U.S. presidential election. While our dataset was limited due to web scraping challenges, the analysis of 2,742 unique tweets across seven election-related hashtags provided preliminary insights into sentiment shifts and dominant discussion themes.

Our analysis showed shifts in public sentiment driven by key political events. Notably, a peak on October 25, 2024 reflected discussions on battleground state polling, campaign rallies, celebrity endorsements, and election security concerns, highlighting the competitiveness of the race. Our topic modeling approach using BERTopic uncovered dominant themes, including debates on foreign policy, concerns over election integrity, and economic issues, demonstrating the power of large language models in processing and categorizing political discussions.

While our analysis was limited by the short timeframe of the hackathon, the results highlight the potential for real-time social media monitoring to capture political trends and voter sentiment. We believe that future research could build on our project by analyzing not just text but images, videos, and engagement metrics to better understand how political narratives spread. Our work lays the groundwork for studying the intersection of social media and politics, highlighting the value of computational methods in tracking public opinion.

We are grateful to the Computer Science Graduate Society (CSGS) at Old Dominion University for organizing this hackathon. Honored to have our efforts recognized with Apple Watches as prizes!

-- Kritika Garg (@kritika_garg), Kumushini Thennakoon (@KumushiniT), and Himarsha Jayanetti (@HimarshaJ)

Search This Blog

Web Science and Digital Libraries Research Group

2025-02-11: Tracking Political Trends Around US Presidential Election