2018-11-30: Archives Unleashed: Vancouver Datathon Trip Report

The Archives Unleashed Datathon #Vancouver was a two day event from November 1 to November 2, 2018 hosted by the Archives Unleashed team in collaboration with Simon Fraser University Library and Key, SFU's big data initiative. This was second in a series of Archives Unleashed datathons to be funded by The Andrew W. Mellon Foundation. This is the first time for me, Mohammed Nauman Siddique of the Web Science and Digital Libraries research group (WS-DL) at Old Dominion University to travel to the datathon at Vancouver.

Day 1

An appropriately rainy Vancouver mornings to set up #hackarchives SFU. Badges ready, WiFi waiting, coffee’s on... pic.twitter.com/Vf2L1IDjfS
— Ian Milligan (@ianmilligan1) November 1, 2018

The event kicked off with Ian Milligan welcoming all the participants at the Archives Unleashed Datathon #Vancouver. It was followed by welcome speech from Gwen Bird, University Librarian at SFU and Peter Chow-White, Director and Professor at GeNA lab. After the welcome, Ian talked about the Archives Unleashed Project, why we care about the web archives, purpose of organizing the datathons, and the roadmap for future datathons.

The #archivesunleashed toolkit is an open-source platform for analyzing web archives with Apache Spark.....and is the underlying basis for the Archives Unleashed Cloud (AUK). Demo Time w/ @ruebot !! #webarchives #hackarchives
Tour also available here: https://t.co/bo9Ezgcz2u pic.twitter.com/c9se9Cn4ls
— The Archives Unleashed Project (@unleasharchives) November 1, 2018

Ian's talk was followed by Nick Ruest walking us through the details of the Archives Unleashed Toolkit and the Archives Unleashed Cloud. For more information about the Archives Unleashed Toolkit and the Archives Unleashed Cloud Services you can follow them on Twitter or check their website.

So many datasets to choose from! Many thanks to @ComputeCanada for the VMs, @ruebot + @ianmilligan for getting things prepped & @SFU_Libraries @bcit @ubclibrary @UVicLib for sharing some cool datasets! #WebArchives #HackArchives pic.twitter.com/cVmNUTmn9u
— The Archives Unleashed Project (@unleasharchives) November 1, 2018

For the purpose of the datathon, Nick had already loaded all the datasets onto six virtual machines provided by Compute Canada. We were provided with twelve options for our datasets courtesy of University of Victoria, University of British Columbia, Simon Fraser University, and British Columbia Institute of Technology.

Group formation is always a fun part about #HackArchives! Look at all the brainstorming going on. Excited to see how teams will develop. pic.twitter.com/XXWP5TnziL
— The Archives Unleashed Project (@unleasharchives) November 1, 2018

Next, the floor was open for us to decide our projects and form teams. We had to arrange our individual choices on the white board with information about the dataset we wanted to use in blue, tools we intended to use in pink, and research questions we cared about in yellow. The teams started to form quickly based on the datasets and purpose of the project. The first team, led by Umar Quasim, wanted to work on ubc-bc-wildfires dataset which was a collection of webpages related to wildfires in British Columbia. They wanted to understand and find relationships between the events and media articles related to wildfires. The second team, led by Brenda Reyes Ayala, wanted to work on improving the quality of archives pages using the uvic-anarchist-archives dataset. The third team, led by Matt Huculak, wanted to investigate on the politics of British Columbia using the uvic-bc-2017-candidates dataset. The fourth team, led by Kathleen Reed, wanted to work on ubc-first-nations-indigenous-communities to investigate about the history and its discourse in media about first nations indigenous communities.

I worked with Matt Huculak, Luis Menese, Emily Memura, and Shahira Khair on the British Columbia candidates dataset. Thanks to Nick, we had already been provided with the derivative files for our datasets which included list of all the captured domain names with their archival count, extracted text from all the WARC files with basic file metadata, and a Gephi file with network graph. It was the first time that the Archives Unleashed Team had provided the participating teams with derivative files, which saved us hours of wait time which would have been wasted in extracting all the information from the dataset WARC files. We continued to work on our projects through the day with a break for lunch. Ian moved around the room to check on all the teams, motivate us with his light humor, and providing us any help needed to get going on our projects.

#HackArchives lightning talks: research with web archives (@emilymaemura), challenges of working w/ WARCs & #webarchiving regular expressions (@jmhuculak ), #warclight demo, an interface to work w/ & discover #webarchives (@ruebot), & Congressional deleted tweets (@m_nsiddique)
— The Archives Unleashed Project (@unleasharchives) November 1, 2018

Around 4 pm, the floor was open for Day 1 talk session. The talk started with Emily Memura (PhD student at University of Toronto) presenting her research on understanding the use and impact of web archives. Emily's talk was followed by Matt Huculak (Digital Scholarship Librarian at University of Victoria) who talked about the challenges faced by libraries in creating web collections using Archive-It. He emphasized on the use of regular expressions in Archive-It and problems it poses to non-technical librarians and web archivists. Nick Ruest presented Warclight and its framework, the latest service released by the Archives Unleashed Team which was followed by a working demo of the service. Last but not the least, I presented my research work on Congressional Deleted Tweets talking about why we care about the deleted tweets, difficulties involved in curating the dataset for Members of Congress, and results about the distribution of deleted tweets in multiple services which can be used to track deleted tweets.

Congressional Deleted Tweets from Mohamed Nauman Siddique

We called it a day at 4:30 pm only to meet again for dinner at 5 pm at Irish Heather in Downtown Vancouver. At dinner Nick, Carl Cooper, Ian, and I had a long conversation ranging from politics to archiving to libraries. After dinner, we called it a day only to meet again fresh the next day.

Day 2

I dunno, this view from #hackarchives SFU might be the best datathon view ever... pic.twitter.com/D9gCfuF8Ux
— Ian Milligan (@ianmilligan1) November 2, 2018

The morning of Day 2 at Vancouver greeted us with a clear view of mountains across the Vancouver harbor which called for a perfect start to our morning. We continued on our project with the occasional distraction of clicking pictures of the beautiful view that lay in front of us. We did some brainstorming on our network graph and bubble chart visualizations from Gephi to understand the relationship between all the URLs in our dataset. We also categorized all the captured URLs into political party URLs, social media URLs and rest. While reading the list of crawled domains present in the dataset, we discovered a bias towards a particular domain which made up approximately 510k mementos out of approximately 540k mementos. The standout domain we refer to was westpointgrey.com, which was owned by Brian Taylor who ran as an independent candidate. We set out to investigate the reason behind that bias in our dataset, only to parse out and analyze the status codes from response headers of each WARC file. We realized that out of approximately 540k mementos only 10k mementos were of status code 200 OK and the rest were either 301s, 302s or 404s. Our investigation of all the URLs that showed up for westpointgrey.com led us to the conclusion that it was a calendar trap for crawlers.

Most relevant topics word frequency count in BC Candidates dataset

During lunch, we had three talks scheduled for Day 2. The first speaker was Umar Quasim from the University of Alberta who talked about the current status of web archiving in their university library and discussed some their future plans. The second presenter, Brenda Reyes Ayala, Assistant Professor at University of Alberta, talked about measuring archival damage and the metrics to evaluate them which had been discussed in her PhD dissertation. Lastly, Samantha Fritz talked about the future of the Archives Unleashed toolkit and cloud service. She mentioned in her talk that starting from 2019 computations using the Archives Unleashed toolkit will be a paid service.

We have #HackArchives presentations coming up soon! Absolutely loving the team names and can't wait to see what they found #archivesunleashed #webarchiving pic.twitter.com/MP1uHMWPTR
— The Archives Unleashed Project (@unleasharchives) November 2, 2018

Team BC 2017 Politics

They also found the mother of all crawler traps! pic.twitter.com/0INdqv55fH
— Ian Milligan (@ianmilligan1) November 2, 2018

We were first to start our presentation on the BC Candidates dataset. with a talk about the dataset we had at our disposal and different visualizations we had used to understand our dataset. We talked about relationships between different URLs and their connections. We also highlighted the issue of westpointgrey.com and the crawler trap issue. Our dataset comprised of 510k mementos of 540k mementos from a single domain westpointgrey.com. The reason for the large memento count from a single domain was due to a calendar crawler trap which was evident on analyzing all the URLs which had been crawled for this domain. Of the 510k mementos crawled, only six of them were 302s and seven of them were 200s, while the rest of the URLs returned a status code of 404. In a nutshell, we had a meager seven mementos with useful information from approximately 510k mementos crawled for this domain. We highlighted the fact that the dataset with approximately 540k mementos had only approximately 10k mementos with relevant information. Based on our brainstorming over the last two days, we summarized lessons learned and an advice for future historians who are curating seeds for creating collections on Archive-It.

Team IDG

And here they are with their final products - geocoding tons of comments! #Hackarchives pic.twitter.com/QyxhLvPBxB
— Ian Milligan (@ianmilligan1) November 2, 2018

Team IDG started off by talking about difficulties faced in settling for their final dataset by waling us through the different datasets they tried before settling for the final dataset (ubc-hydro-cite-c) used in their project. They presented visualization on top keywords based on frequency count and relationship between different keywords. They also highlighted the issue of extracting text from the tables and talked about their solution. They walked us through all the steps involved in plotting their events on a map. It started it using the table of processed text to create a geo-coding for their dataset and plot in onto a map showing the occurrences of the events. They also showed a timeline of how the events evolved over time by plotting it onto the map.

Team Wildfyre

The next team used the 2017 and 2018 BC Wildfire collections - worked on comparing web archives, comparing extracted locations, and affected areas and severity! #Hackarchives pic.twitter.com/91Fwky3aaR
— Ian Milligan (@ianmilligan1) November 2, 2018

Team Wildfyre opened their talk with description of their dataset and other datasets they used in their project. They talked about research questions and tools used in their project. They presented multiple visualizations showing top keywords, top named entity and geo-coded map of the events. They also had a heat map for distribution of datasets based on the domain names available in their dataset. They pointed out that even when analyzing named entities in the wild fire dataset, the most talked about entity during these events was Justin Trudeau.

Team Anarchy

And then the last team “Seeds of Anarchy” (used an Anarchy politics collection) looked at issues relating to crawl quality - one part of project looked at diminishing relevance the further crawls got from the seed URLs. #Hackarchives pic.twitter.com/kCWAEQ6mPe
— Ian Milligan (@ianmilligan1) November 2, 2018

Team Anarchy had split their project into two smaller projects. The first project undertaken by Ryan Deschamps was about finding linkages between all the URLs in the dataset. He presented a concentric circles graph talking about the linkage between pages from depth 0 to 5. They found that starting from the base URL to depth level 5 took them to a spam or a government website in most cases. He also talked about the challenges faced in extracting images from the WARC files and comparing them their live version counterparts. The second project undertaken by Brenda was about capturing archived pages and measuring the degree of difference from the live version of these pages. She showed multiple examples with varying degree of difference between the archived and their live pages.

The votes are in!!! Congrats to #HackArchives Team IDG!!!! Well done :) pic.twitter.com/yfd0GepIeU
— The Archives Unleashed Project (@unleasharchives) November 2, 2018

Once the presentations were done, Ian asked us all to write out our votes and the winner would be decided based on popular vote. Congratulations to Team IDG for winning the Archives Unleashed Datathon #Vancouver. For closing comments Nick talked about what to take away from these events and how to build a better web archiving research community. After all the suspense, the next edition of Archives Unleashed Datathon was announced.

We are pleased to announce our next #HackArchives will be in #WashingtonDC. We are so excited to be working with folks at George Washington University Libraries (@gelman)! CFP to follow next week. #webarchiving #archivesunleashed pic.twitter.com/k19lpDF9NB
— The Archives Unleashed Project (@unleasharchives) November 2, 2018

More information about Archives Unleashed Datathon #WashingtonDC can be found using their website or following the Archives Unleashed team on Twitter.

This was my first time at a Archives Unleashed Datathon. I went with the idea of meeting researchers, librarians, and historians all under one roof who propel the web archiving research domain. The organizers try to strike a perfect balance by inviting different research communities with the web archiving community with diverse background, and experience. It was an eye-opening trip for me, where I learned from my fellow participants about their work, how libraries build collections for web archives and the difficulties and challenges faced by them. Thanks, to Carl Cooper, Graduate Trainee at Bodlian Libraries- Oxford University for strolling down Vancouver Downtown with me. I am really excited and look forward to attending the next edition of Archives Unleashed Datathon at Washington DC.

View of Downtown Vancouver

Thanks again to the organizers (Ian Milligan, Rebecca Dowson, Nick Ruest, Jimmy Lin, and Samantha Fritz), their partners and SFU library for hosting us. Looking forward to see you all at future Archive Unleashed datathons.

Mohammed Nauman Siddique
@m_nsiddique

Search This Blog

Web Science and Digital Libraries Research Group