2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)

The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g., www.georgewbush.com, www.johnkerry.com, www.gop.com, and www.democrats.org) or other newspapers like www.washingtonpost.com and www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:

NLTK to tokenize text
Stemming and removing stop words (involving TF-IDF weighting)
Gensim and Latent Dirichlet Allocation for topic modeling

After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State	Candidate	Frequency of mentioning the state	The most important topic
Mississippi	Kerry	85	Iraq
Mississippi	Bush	131	Energy
Oklahoma	Kerry	65	Jobs
Oklahoma	Bush	85	Retirement
Delaware	Kerry	53	Colleges
Delaware	Bush	2	Other
Minnesota	Kerry	155	Jobs
Minnesota	Bush	303	Colleges
Illinois	Kerry	86	Iraq
Illinois	Bush	131	Health
Georgia	Kerry	101	Energy
Georgia	Bush	388	Tax
Arkansas	Kerry	66	Iraq
Arkansas	Bush	42	Colleges
New Mexico	Kerry	157	Jobs
New Mexico	Bush	384	Tax
Indiana	Kerry	132	Tax
Indiana	Bush	43	Colleges
Maryland	Kerry	94	Jobs
Maryland	Bush	213	Energy
Louisiana	Kerry	60	Iraq
Louisiana	Bush	262	Tax
Texas	Kerry	195	Terrorism
Texas	Bush	1108	Tax
Tennessee	Kerry	69	Tax
Tennessee	Bush	134	Teacher
Arizona	Kerry	77	Iraq
Arizona	Bush	369	Jobs

...

[3] Interactive US map

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).

So exciting to see use of our LC web archives in use at the #hackarchives ! So inspiring! pic.twitter.com/DXksowhrn2
— Abbie Grotke (@agrotke) June 15, 2016

Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.

--Mohamed Aturban

Search This Blog

Web Science and Digital Libraries Research Group

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

Comments

Post a Comment