|"Team Turtle" in Archive Unleashed in Washington DC|
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)
In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.
 Extract textual data from ARC files:
ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.
We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
- NLTK to tokenize text
- Stemming and removing stop words (involving TF-IDF weighting)
- Gensim and Latent Dirichlet Allocation for topic modeling
|State||Candidate||Frequency of mentioning |
|The most important |
|New Mexico||Kerry|| |
|New Mexico||Bush|| |
 Interactive US map
We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).
So exciting to see use of our LC web archives in use at the #hackarchives ! So inspiring! pic.twitter.com/DXksowhrn2— Abbie Grotke (@agrotke) June 15, 2016
Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.