2016-06-27: Archives Unleashed 2.0 Web Archive Hackathon Trip Report

Members from WSDL who participated in the Hackathon 2.0

Last week, June 13-15, 2016, six members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend the second Archives Unleashed 2.0 at the Library of Congress in Washington DC. This event is a follow-up to the Archives Unleashed (Web Archive Hackathon 1.0) held in March 2015 at the University of Toronto Library, Toronto, Ontario Canada. We (Mat Kelly, Alexander Nwala, John Berlin, Sawood Alam, Shawn Jones, and Mohamed Aturban) met with other participants, from various countries, who have different backgrounds -- librarians, historians, computer scientists, etc. The main goal of this event is to build tools for web archives as well as to support this kind of ongoing community to have a common vision of how to access and extract data from web collections.

This event was made possible with generous support from the National Science Foundation, the Social Sciences and Humanities Research Council of Canada, the University of Waterloo’s Department of History, the David R. Cheriton School of Computer Science and the University of Waterloo, and the School of Communication and Information at Rutgers University.

The event was organized by Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan (assistant professor, Department of History, University of Waterloo), Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo), Nicholas Worby (Government Information & Statistics Librarian, University of Toronto), and Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec). Here are some details about different activities over the three days of Hackathon 2.0.

Day 1 (June 13, 2016)

Our evening gathering in the first day of Hackathon 2.0 was at Gelman Library, George Washington University, and it was for (1) the participants to briefly introduce themselves and their area of research, and (2) forming multiple groups to work on different Hackathon projects. In order to form groups, all participants were encouraged to write a few words in three separate sticky notes describing a general topic they were interested in (e.g., topic modeling, extracting metadata, study tweets, and analysis of Supreme Court nominations), what kind of dataset they wanted to work on (e.g., collected tweets, and dataset from 2004/2008 election), and what they wished to accomplish with the selected dataset.

Just a few of the tiny topics my group has been discussing so far for #hackarchives #webarchives #ethics pic.twitter.com/yOOiI6gUEv
— Jessica Ogden (@jessogden) June 14, 2016

Participants were trying to put those sticky notes that had similar ideas together. After that, the initial groups were formed and every group was given a few minutes to introduce their project idea. By the end of the first day, we all went to a restaurant to have our dinner and socialize. Here is a list of the different groups formed initially during the first day after the brainstorming session, and I will explain later in some details what each group had accomplished:

Group Name	Members
Twitter Political Organization	Allie Kosterich, Nich Worby, John Berlin, Laura Wrubel, Gregory Wiedeman
Mojitos	Nathalie Casemajor, Federico Nanni, Alexander Nwala, Sylvain Rocheleau, Jana Hajzlerova, Petra Galuscakova
Museum	Ed Summers, Emily Maemura, Sawood Alam, Jefferson Bailey
I Know What You Hid Last Summer	Mat Kelly, Shawn Walker, Keesha Henderson, Jaimie Murdock, Jessica Ogden, Ramine Tinati, Niko Tsakalakis
The Supremes	Nicholas Taylor, Ian Milligan, Jimmy Lin, Patrick Rourke, Todd Suomela, Andrew Weber
Team Turtle	Mohamed Aturban, Niel Chah, Steve Marti, Imaduddin Amin
Counter-Terrorism	Daniel Kerchner, Emily Gade
Campaign: Origins	Allison Hegel,Debra A. Riley-Huff, Justin Littman, Shawn M. Jones, Kevin Foley, Ericka Menchen-Trevino, Nick Bennett, Louise Keen

Day 2 (June 14, 2016)

Colleen Shogan, from the Library of Congress, declared the Hackathon open. Colleen mentioned that researchers who have questions about politics, history, or any other aspects related to culture memory really need specialists like us to help them access data available in different repositories such as Internet Archive and the Library of Congress. She emphasized the importance of such events and finally she thanked people who made this event possible including the organizers and the steering committee.

At LOC: "Promote the capacity for cultural memory", of which web archives are an important part #hackarchives pic.twitter.com/g3gyFsZXtl
— Shawn M. Jones (@shawnmjones) June 14, 2016

Matthew Weber presented the agenda of the day including presentations, a brief tour at the Library of Congress, and revising groups formed the day before. Matthew gave an example related to his dissertation work in the past illustrating how difficult it is to use web archives to answer research questions without building tools. He stated that this ongoing community is to build a common vision for web archive development tools to help accessing and extracting data, and uncover important stories from web archives. Finally, Matthew listed several kind of datasets available for the participants to work on. 2004, 2008, and 2010 election data, and the Supreme Court nominations are example of such datasets.

.@docmattweber details what he hopes will be accomplished in the next couple of days at #hackArchives pic.twitter.com/yKu1VaQOfn
— Mat Kelly (@machawk1) June 14, 2016

Ian Milligan introduced Warcbase (slides) which was developed by a team of five historians, three computer scientists, and a network scholar. Ian showed how slow it is to browse web archives using the traditional way of entering a URL in the Wayback Machine (remembering that requiring the URL itself limits what you can find in the archives). Warcbase works beyond that where it can be used to access, extract, manage, and process data from WARC files (e.g., extracting names, locations, plain text, URIs, and others from WARC files and generating different formats like network graphs or metadata in JSON). Warcbase supports filtering data based on dates, domain names, languages, etc. In addition, Warcbase is scalable which means it may run on a laptop, a powerful desktop, or on a cluster. Users may use command line tools as well as an interactive web-based interface to run Warcbase.

@ianmilligan1 enlightening us on warcbase #hackarchives pic.twitter.com/FdS2wiq8QF
— Matthew Weber (@docmattweber) June 14, 2016

#hackArchives @ianmilligan1 and "Moving Beyond the Wayback Machine for scholarly access" at https://t.co/yzash1xF1C pic.twitter.com/DOK7nacEJA
— Shawn M. Jones (@shawnmjones) June 14, 2016

#hackarchives using Spark Notebook to help folks who aren’t comfortable on the command line analyze data https://t.co/nWcnOCdZly
— Kate Zwaard (@kzwa) June 14, 2016

Jefferson Bailey and Vinay Goel from the Internet Archive presented Archive Research Services Workshop. Jefferson mentioned that the Internet Archive focuses on collecting web resources and providing access to those collections. The Internet Archive does not allow researchers to access their infrastructure to do intensive research like data mining activities. The Internet Archive has huge web collections about 13 terabyte, and it collects about a billion URIs a week. Jefferson indicated also that WARC files are huge, and it is difficult to work with such files. Also, researchers might request huge collections in WARC format, but they may end up using only a small portion. For those reasons, Internet Archive is trying to support specific research questions, so instead of providing data in WARC format, they will allow users to have access to datasets in different formats like CDX which consists of metadata about the original WARC files. Other formats include Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). Having such formats allows us to have smaller datasets. For example, CDX is only one percent of the size of WARC files.

"Data Mining Web Archives" presented by @jefferson_bail & @vinaygo discussing supporting research #hackarchives pic.twitter.com/PVVjaK03yx
— Shawn M. Jones (@shawnmjones) June 14, 2016

The @internetarchive team of @vinaygo @jefferson_bail giving a great overview of research services. #hackarchives pic.twitter.com/ol8xzA8yyU
— Ian Milligan (@ianmilligan1) June 14, 2016

Now @jefferson_bail on the three derivative datasets, WAT, LGA, WANEs. Good rundown here: https://t.co/CrDsrGeeQm. #HackArchives
— Ian Milligan (@ianmilligan1) June 14, 2016

Next, Vinay Goel from the Internet Archive continued on the same topic (Archive Research Services Workshop). Vinay gave a quick overview of ArchiveSpark. The tool might help the community to search and filter Internet Archive collections (e.g., filtering could be based on date, MIME type, and HTTP Response code). A research paper about ArchiveSpark was accepted and will be presented at JCDL 2016.

Now @vinaygo giving a quick overview of ArchiveSpark, which you can check out at https://t.co/DWr7ADXHGW. Great user platform! #hackArchives
— Ian Milligan (@ianmilligan1) June 14, 2016

Archive Spark was announced by @vinaygo for web archive research #hackarchives https://t.co/nPK6oGU0Ud pic.twitter.com/IZE0IxNU3o
— Shawn M. Jones (@shawnmjones) June 14, 2016

I really think the ArchiveSpark approach to leveraging CDXes for quick retrieval is bang on. @helgeho @vinaygo #hackArchives
— Ian Milligan (@ianmilligan1) June 14, 2016

ArchiveSpark: start with CDXs and get more data from archives as necessary. WARCBASE: Load everything from beginning & filter. #hackarchives
— Justin Littman (@justin_littman) June 14, 2016

Abigail Grotke and Andrew Weber introduced Library of Congress Data Resources. Abigail and Andrew are working on web archiving team along with other members at the Library of Congress. They indicated that most of the crawling process is done at the Internet Archive. The Library of Congress has been collecting web resources for more than 16 years. They have made some collections available on the web. These collections are searchable (not full text search), indexed, and can be accessed by the Wayback Machine. In addition, the Library of Congress archive supports Memento. Most of the collections are not allowed to be accessed on the web due to copyright issues and permissions policy, and researchers must be physically there at the Library of Congress to access these collections.

Now @agrotke @atweber on the @librarycongress collections - thematic and event-based. #hackArchives pic.twitter.com/69xvT4rIXI
— Ian Milligan (@ianmilligan1) June 14, 2016

.@atweber and @agrotke discuss how LOC collects web archives, thematically & based on events #hackarchives pic.twitter.com/aaAq8pLRTo
— Jill Reilly James (@jillreillyjames) June 14, 2016

web content harvested by @librarycongress under range of rights conditions; complicates researcher data access, says @agrotke #hackarchives
— Nicholas Taylor (@nullhandle) June 14, 2016

.@atweber full of ideas for how we can use these collections, especially the Supreme Court nominations one... can't wait! #hackarchives
— Ian Milligan (@ianmilligan1) June 14, 2016

During the coffee break, we had the opportunity to make a short tour around the Library of Congress Jefferson Building which was the first building with electricity and an elevator in use in DC.

Touring the Jefferson building - really glad #hackArchives can be here (before heading to Madison building soon). pic.twitter.com/pXjNPo7pwe
— Ian Milligan (@ianmilligan1) June 14, 2016

Touring the Jefferson building built in 1897 and first DC building with electricity @librarycongress #hackarchives pic.twitter.com/ngOncisgbf
— Shawn M. Jones (@shawnmjones) June 14, 2016

After the coffee break, each group explained briefly their project idea, and what kind of dataset they were going to use. At this time, some participants moved to other groups as they found more interesting ideas.

Final batch of team forming at #hackArchives - wheelin' and dealin' in the shadow of Congress. pic.twitter.com/CPcV6pi1Fn
— Ian Milligan (@ianmilligan1) June 14, 2016

2 teams at #hackarchives using election tweets captured by @SocialFeedMgr.
— Justin Littman (@justin_littman) June 14, 2016

While having our lunch, we were listening to the five-minute lightning talks. Nicholas Taylor (from Stanford University) introduced WASAPI, Jefferson Bailey (Internet Archive) gave a short talk about Researcher Services, Ericka Menchen-Trevino (American University) presented Web Historian, Nathalie Casemajor, Petra Galuscakova, and Sylvain Rocheleau briefly explained NUAGE, Alexander Nwala (Old Dominion University) introduced the topic Generating Collections for Stories and Event, and finally John Berlin (Old Dominion University) presented Are Wails Electric?

. @acnwala presents "Generating Collection for Stories and Events" as a lightning talk #hackarchives pic.twitter.com/dkytHbr9B9
— Shawn M. Jones (@shawnmjones) June 14, 2016

. @johnaberlin is presenting WAIL: detailed here https://t.co/pm6MKJNEEb #hackarchives pic.twitter.com/uVFsPw8KgM
— Shawn M. Jones (@shawnmjones) June 14, 2016

After that, the groups were located in different rooms based on what kind of equipment every team might need to work on the project. Each group met and had the opportunity to work for about 5 hours (there were 30 minutes coffee break after the first 2 hours) on their project ideas. By the end of the second day, we all came together around 6 PM, and each group's representative gave an update about their team's progress. Then, all participants were invited for dinner.

team turtle working with @librarycongress election data #hackarchives pic.twitter.com/WtdcKg0kDb
— Jaime Mears (@JaimeMears) June 14, 2016

#hackArchives rooms are pretty quiet as people are hard at work pic.twitter.com/YUFYOjtuCn
— Abbie Grotke (@agrotke) June 14, 2016

Day 2 wrap up from the formidable teams @librarycongress #hackarchives 1 day to go to complete their inquiries! pic.twitter.com/fQwKRuob1T
— Jaime Mears (@JaimeMears) June 14, 2016

Day 3 (June 15, 2016)

Most of last day's time was for the groups to intensively work on their projects and produce the final results. From the time we had our breakfast at 8:30 AM at the Madison Atrium til the end of the day at 6:30, we were working on our projects except the times for the coffee break and lunch. Some participants gave five-minute lightning talks during the lunch time. The voice was not really clear at the Madison Atrium, Justin Littman was standing on a chair to deliver his talk, yet the voice still was not delivered clearly. For this reason, I will briefly mention what those talks were about.

tracking the provenance of a tweet with @justin_littman - creation, collection, and selection #hackarchives pic.twitter.com/1gfxZUNawC
— Jaime Mears (@JaimeMears) June 15, 2016

Laura Wrubel, Daniel Kerchner, and Justin Littman from George Washington University presented an introduction to the new Social Feed Manager, a sampling of research projects supported by Social Feed Manager, and the provenance of a tweet (as inspired by web archiving). Sawood Alam from Old Dominion University introduced MemGator – A Memento Aggregator CLI and Server in Go. Jaimie Murdock from Indiana, Polygraphic and Polymathic presented the Into Thomas Jefferson’s Mind. Finally, Mat Kelly from Old Dominion University gave a short talk about Exploring Aggregation of Personal, Private, and Institutional Web Archives.

2 more awesome presentations from the memento team - always impressive research #hackarchives @ibnesayeed @machawk1 pic.twitter.com/zAqerrqtT7
— Matthew Weber (@docmattweber) June 15, 2016

thx to the social feed manager team for sharing their work at #hackarchives @liblaura @justin_littman @DanKerchner pic.twitter.com/l2Pj0E39yR
— Matthew Weber (@docmattweber) June 15, 2016

Final presentations

By the end of the day, each group presented the findings of the project that they were working on for the last couple of days:

#hackarchives and we’re on to to the final presentations!! pic.twitter.com/TVBQvhCIfF
— Matthew Weber (@docmattweber) June 15, 2016

Mojitos (Slides)

The team's goal was to detect and track the events discussed between polar media in Cuba. This was done by processing news data from the state controlled Cuban media (Granma) and a media that caters to Cuba located in Florida (el Nuevo Herald).

team mojito at #hackarchives - challenges of analyzing the cuban web domain —> “spanish!” but well done! pic.twitter.com/Q9mBmBP2J1
— Matthew Weber (@docmattweber) June 15, 2016

Campaign: Origins (Slides)

Using tweets with #election2016, @realDonaldTrump, and @HillaryClinton, this team searched for narratives using the content of the web pages linked to from these tweets, rather than just the tweets themselves. The tweets were collected on June 14 - 15. The team's intention is to use the Internet Archive's Save Page Now feature to capture the web pages as they are tweeted so that such a study can be repeated on a larger set of tweets in the future. They produced the following streamgraph.

... and our last #hackarchives team looked at where the topics in election-based Twitter conversations originate! pic.twitter.com/wXsG2sJtBD
— Ian Milligan (@ianmilligan1) June 15, 2016

The Supremes (Slides and more details)

This group has tried to analyze web archived data, provided in ARC format by the Library of Congress, about the Supreme Court nominations for Justice Alito and Justice Roberts. The size of the datasets is 92 GB containing 2.2 million pages about both Alito and Roberts. The goal of the team was to explore and analyze the data and produce more possible research questions. They used Warcbase to extract datasets from the ARC files. In addition, Warcbase can produce files in a format that can be opened directly in other platforms like Gephi.

The 'Supremes,' who dug into the Roberts and Alito Supreme Court nominations. #hackArchives pic.twitter.com/4Xvx7znpoz
— Ian Milligan (@ianmilligan1) June 15, 2016

I Know What You Hid Last Summer (Slides)

The team took Twitter datasets from the UK and Canadian Parliament members, identified the deleted tweets, noted which tweets contained links, checked if those links died after the tweet was deleted, and tried to derive meaning from the deletion. Further visualization was also done.

Team I Know What You Hid Last Summer -> analysis of deleted content from UK Parliament pages #hackarchives pic.twitter.com/JhywfCBUhF
— Matthew Weber (@docmattweber) June 15, 2016

Museum

This team tried to analyze CDX files from the Internet Archive's IMLS Museums crawl consisting of over 219 million captures. They also utilized the Museum Universe Data File from IMLS to enrich their findings. They evaluated the proportion of various content-types (such as images or PDFs) that were crawled. They also quantified the term frequencies in the URLs of each content type. Additionally, they demonstrated the domain name distribution in the collection in a hierarchical chart (using tree-map). A part of their analysis is published on GitHub.

Next up: a team tackling the IMLS universe - American museums. WARCs too big, had to do CDX analysis. #hackarchives pic.twitter.com/LOZ5OcxDjf
— Ian Milligan (@ianmilligan1) 15 June 2016

imls museum data set team findings #hackarchives pic.twitter.com/klD7ecxluJ
— Jaime Mears (@JaimeMears) June 15, 2016

Counter-Terrorism (Slides)

This team collected 383,527 tweets (between 2013 and 2016) from 1,153 accounts of suspected extremists. Approximately, 300 people are associated with these accounts. The tweets are in mix of English and Arabic. The goal is to identify ISIS supporters by running them through an ideology classifier.

Team Counter-Terrorism: small team —> powerful analysis of ISIS supporters & their tweets #hackarchives pic.twitter.com/6wluMLPkef
— Matthew Weber (@docmattweber) June 15, 2016

Team Turtle (Slides)

The team used a dataset from 2004 Presidential Election provided by the Library of Congress. The dataset was collected during the day before the election, the election day, and the day after the election. The goal of this team is to answer questions like (1) if one candidate spends more time talking about issues related to a particular state than the other candidate does, would this lead him to win the state? (2) would candidates give more time to the "swing" states than others? and (3) what is the most important topic for each state? The dataset was available in ARC format. Warcbase tool is used to extract text from those files. After that, the dataset was analyzed using techniques like Stanford NER tagger to tag places, people, and organizations, and the LDA model and TF-IDF to identify topics. Finally, the team produced an interactive visualization using D3.js.

Hack complete! Highs-five all around.#teamturtle #hackarchives pic.twitter.com/VzEpC9dHdd
— Steve Marti (@stevemarti25) June 15, 2016

So exciting to see use of our LC web archives in use at the #hackarchives ! So inspiring! pic.twitter.com/DXksowhrn2
— Abbie Grotke (@agrotke) June 15, 2016

Twitter Political Organization (Slides)

The team created a timeline of mentions in candidate tweets to donations for the Service Employees International Union (SEIU) on twitter, graph of retweets per day of the candidates and sentiment analysis (Naive Bayes classifier) of the candidates tweets was performed in attempt to see if there was a correlation between donation amount over time to how positive or negatively the candidates tweeted.

first up, team Twit Political Orgs #hackarchives w a great analysis of trump pol tweets pic.twitter.com/FsMWU0rwn9
— Matthew Weber (@docmattweber) June 15, 2016

After all the groups presented their work, Jimmy Lin announced ArchivesUnleashed Inc. It is a Delaware non-profit corporation aiming to create knowledge around the scholarly use of web archives. The board of directors of this new organization includes:

Ian Milligan (assistant professor, Department of History, University of Waterloo)
Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University)
Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo)
Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec)
Nicholas Worby (Government Information & Statistics Librarian, University of Toronto)

.@lintool announcing our new Delaware non-profit corporation - ArchivesUnleashed Inc. Seriously. #hackarchives pic.twitter.com/3WfksgIgQS
— Ian Milligan (@ianmilligan1) June 15, 2016

The winning team

Ian Milligan announced the winning team Counter-Terrorism (Congratulations to Daniel Kerchner and Emily Gade). In addition, the top four teams (Counter-Terrorism, Team Turtle, I Know What You Hid Last Summer, and Mojitos) were selected to present their work during the next day event Saving the Web

Congrats to @DanKerchner and Emily Kaleh Gade on a winning #hackarchives project with tweets related to terrorism. pic.twitter.com/GplPRBDocf
— Laura Wrubel (@liblaura) June 15, 2016

... and #hackarchives comes to an end with closing remarks by Robert Shaffer, 24th Law Librarian of Congress. pic.twitter.com/2EEs2a4aTG
— Ian Milligan (@ianmilligan1) June 15, 2016

--Mohamed Aturban

Search This Blog

Web Science and Digital Libraries Research Group

2016-06-27: Archives Unleashed 2.0 Web Archive Hackathon Trip Report

Comments

Post a Comment