2018-05-15: Archives Unleashed: Toronto Datathon Trip Report


The Archives Unleashed team (pictured below) hosted a two-day datathon, April 26-27, 2018, at the University of Toronto’s Robarts Library. This time around, Shawn Jones and I were selected to represent the Web Science and Digital Libraries (WSDL) research group from Old Dominion University. This event was the first in a series of four planned datathons to give researchers, archivists, computer scientists, and many others the opportunity to get hands-on experience with the Archives Unleashed Toolkit (AUT) and provide valuable feedback to the team. The AUT facilitates analysis and processing of web archives at scale and the datathons are designed to help participants find ways to incorporate these tools into their own workflow. Check out the Archives Unleashed team on Twitter and their website to find other ways to get involved and stay up to date with the work they’re doing.

Archives Unleashed datathon organizers (left to right): Nich Worby, Ryan Deschamps, Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz

Day 1 - April 26, 2018
Ian Milligan kicked off the event by talking about why these datathons are so important to the Archives Unleashed project team. For the project to be a success, the team needs to: build a community, create a common vision for web archiving tool development, avoid black box systems that nobody really understands, and equip the community with these tools to be able to work as a collective.


Many presentations, conversations, and tweets during the event indicated that working with web archives, particularly WARC files, can be messy, intimidating, and really difficult. The AUT tries to help simplify the process by breaking it down into four parts:
  1. Filter - focus on a date range, a single domain, or specific content
  2. Analyze - extract information that might be useful such as links, tags, named entities, etc.
  3. Aggregate - summarize the analysis by counting, finding maximum values, averages, etc.
  4. Visualize - create tables from the results or files for use in external applications, such as Gephi


We were encouraged to use the AUT throughout the event to go through the process of filtering, analyzing, aggregating, and visualizing for ourselves. Multiple datasets were provided to us and preloaded onto powerful virtual machines, provided by Compute Canada, in an effort to maximize the time spent working with the AUT instead of fiddling with settings and data transfers.



Now that we knew the who, what, and why of the datathon, it was time to create our teams and get to work. We wrote down research questions (pink), datasets (green), and technologies/techniques (yellow) we were interested in using on sticky notes and posted them on a whiteboard. Teams started to form naturally from the discussion, but not very quickly, until we got a little help from Jimmy and Ian to keep things moving.

I worked with Jayanthy Chengan, Justin Littman, Shawn Walker, and Russell White. We wanted to use the #neveragain tweet dataset to see if we could filter out spam links and create a list of better quality seed URLs for archiving. Our main goal was to use the AUT without relying on other tools that we may have already been familiar with. Many of us had never even heard of Scala, the language that AUT is written in. We had all worked through the homework leading up to the datathon, but it still took us a few hours to get over the initial jitters and become productive.

Scala was a point of contention among many participants. Why not use Python or another language that more people are familiar with and can easily interface with existing tools? Jimmy had an answer ready, as he did for every question thrown at him over the course of the event.

Around 5pm, it was time for dinner at Duke of York. My team decided against trying to get everyone up and running on their local machines, to enjoy dinner, and come back fresh for day 2.



Day 2 - April 27, 2018 
Day 2 began with what felt like an epiphany for our team:



In reality, it was more like:


 

Either way, we learned from the hiccups of the first day and began working at a much faster pace. All of the teams worked right up until the deadline to turn in slides, with a few coffee breaks and lightning talks sprinkled throughout. I'll include more information on the lightning talks and team presentations as they become available.

Lightning Talks
  • Jimmy Lin led a brainstorming session about moving the AUT from RDD to DataFrames. Samantha Fritz posted a summary of the feedback received where you can participate in the discussion.
  • Nick Ruest talked about Warclight, a tool that helps with discovery within a WARC collection. He showed off a demo of it after giving us a little background information.
  • Shawn Jones presented the five minute version of a blog post he wrote last year that talks about summarizing web archive collections.
  • Justin Litmann presented TweetSets, a service that allows a user to derive their own Twitter dataset from existing ones. You can filter by any Tweet attributes such as text, hashtags, mentions, date created, etc.
  • Shawn Walker talked about the idea of using something similar to a credit score to warn users, in realtime, of the risk that content they're viewing may be misinformation.

At 3:30pm, Ian introduced the teams and we began final presentations right on time.

Team Make Tweets Great Again (Shawn Jones' team) used a dataset including tweets sent to @realdonaldtrump between June 2017 and now, along with tweets with #MAGA in them from June - October 2017. A few of the questions they had were:

  • As a Washington insider falls from grace, how quickly do those active in #MAGA and @realDonaldTrump shift allegiance?
  • Did sentiment change towards Bannon before and after he was fired by Trump?

They used positive or negative sentiment (emojis and text-based analysis) as an indicator of shifting allegiance towards a person. There was a decline in the sentiment rating for Steve Bannon when he was fired in August 2017, but the real takeaway is that people really love the 😂 emoji. Shawn worked with Jacqueline Whyte Appleby and Amanda Oliver. Jacqueline decided to focus on Bannon for the analysis, Amanda came up with the idea to use emojis, and Shawn used twarc to gather the information they would need.


Team Pipeline Research used datasets made up of WARC files of pipeline activism and Canadian political party pages, along with tweets (#NoASP, #NoDAPL,  #StopKM, #KinderMorgan). From the datasets, they were able to generate word clouds, find the image most frequently used, perform link analysis between pages, and analyze the frequency of hashtags used in the tweets. Through the analysis process, they discovered that some URLs had made it into the collection erroneously. 



Team Spam Links (my team) used a dataset including tweets with various hashtags related to the Never Again/March for Our Lives movement. The question we wanted to answer was “What is the best algorithm for extracting quality seed URLs from social media data?”. We created a Top 50 list of URLs tweeted in the unfiltered dataset and coded them as relevant, not relevant, or indeterminate. We then came up with multiple factors to filter the dataset by (users with/without the default Twitter profile picture, with/without bio in profile, user follower counts, including/excluding retweets, etc.) and generated a new Top 50 list each time. The original Top 50 list was then compared to each of the filtered Top 50 lists.


We didn’t find a significant change in the rankings of the spam URLs, but we think that’s because there just weren’t that many in the dataset’s Top 50 to begin with. Running these experiments against other datasets and expanding the Top 50 to maybe the Top 100 or more would likely yield better results. Special thanks to Justin and Shawn Walker for getting us started and doing the heavy lifting, Russell for coding all of the URLs, and Jayanthy for figuring out Scala with me.


Team BC Teacher Labour was the final group of the day and they used a dataset from Archive-It about the British Columbia Teachers’ Labour Dispute. While exploring the dataset with the AUT, they created word clouds showing the frequency of words compared between multiple domains, network graphs showing named entities and how they related to each other, and many others. The most interesting visual they created was an animated GIF that quickly showed the primary image from each memento, giving a good overview of the types of images in the collection. 



Team Just Kidding, There’s One More Thing was a team of one: Jimmy Lin. Jimmy was busy listening to feedback about Scala vs. Python and working on his own secret project. He created a new branch of the AUT running in a Python environment, enabling some of the things people were asking for at the beginning of Day 1. Awesome.



After Jimmy’s surprise, the organizers and teams voted for the winning project. All of the projects were great, but there can only be one winner and that was Team Make Tweets Great Again! I’m still convinced there’s a correlation between the number of emojis in their presentation, their team name, and the number of votes they received but 🤷🏻‍♂️. Just kidding 😂, your presentation was 🔥. Congratulations 🎊 to Shawn and his team! 

I’m brand new to the world of web archiving and this was my first time attending an event like this, so I had some trepidation leading up to the first day. However, I quickly discovered that the organizers and participants, regardless of skill level or background, were there to learn and willing to share their own knowledge. I would highly encourage anyone, especially if you’re in a situation similar to mine, to apply for the Vancouver datathon that was announced at the end of Day 2 or one of the 2019 datathons taking place in the United States.

Thanks again to the organizers (Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz, Ryan Deschamps, and Nich Worby), their partners, and the University of Toronto for hosting us. Looking forward to the next one!

- Brian Griffin

Comments