Monday, October 3, 2016

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?


"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)
The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g., www.georgewbush.com, www.johnkerry.com, www.gop.com, and www.democrats.org) or other newspapers like www.washingtonpost.com and www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State Candidate Frequency of mentioning
the state
The most important
topic
Mississippi Kerry
85
Iraq
Mississippi Bush
131
Energy
Oklahoma Kerry
65
Jobs
Oklahoma Bush
85
Retirement
Delaware Kerry
53
Colleges
Delaware Bush
2
Other
Minnesota Kerry
155
Jobs
Minnesota Bush
303
Colleges
Illinois Kerry
86
Iraq
Illinois Bush
131
Health
Georgia Kerry
101
Energy
Georgia Bush
388
Tax
Arkansas Kerry
66
Iraq
Arkansas Bush
42
Colleges
New Mexico Kerry
157
Jobs
New Mexico Bush
384
Tax
Indiana Kerry
132
Tax
Indiana Bush
43
Colleges
Maryland Kerry
94
Jobs
Maryland Bush
213
Energy
Louisiana Kerry
60
Iraq
Louisiana Bush
262
Tax
Texas Kerry
195
Terrorism
Texas Bush
1108
Tax
Tennessee Kerry
69
Tax
Tennessee Bush
134
Teacher
Arizona Kerry
77
Iraq
Arizona Bush
369
Jobs
         ...

[3]  Interactive US map 

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).


Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.


--Mohamed Aturban

No comments:

Post a Comment