2020-04-30: Archives Unleashed: New York Datathon Report (From Home Edition)




The Archives Unleashed Datathon is a two-day event hosted by the Archive Unleashed team where participants from different research backgrounds collaborate together to explore web archive collections. The fourth Archives Unleashed datathon partnered with Columbia University Libraries was supposed to happen in New York City. However, as the spread of COVID-19 cases began to increase, the organizers had to make the tough decision of canceling the New York datathon.



In the same email that brought the news of event cancellation, Ian Milligan also mentioned the possibility of organizing the event online through Zoom and Slack. Within a few weeks, the Archived Unleashed team modeled the online datathon and provided us the new schedule. Samantha Fritz wrote a detailed report on how the team moved this event online. I appreciate the efforts of Samantha Fritz, who throughout this transition, stayed connected with us participants, and answered all of our doubts regarding the event and the canceled travel expenses.



The online datathon was modeled to be flexible, the mantra was “to go with the flow”. It was considerate towards difficulties in working from home. The datathon was joined by seven organizers and 13 researchers from different backgrounds. Over the years, the members of the Web Science and Digital Libraries research group (WS-DL) at Old Dominion University (ODU) have participated in several of the Archived Unleashed datathons (Vancouver 2018Toronto 2018, London 2017, San Francisco 2017, Toronto-II 2016Toronto-I 2016). This year, I (Kritika Garg) had the honor to represent the WSDL research group.

Day 1 (March 26, 2020)

The day started with my phone buzzing with the datathon reminder. I got ready to attend the first Zoom meeting of the day. Ian Milligan started the datathon by giving an introductory presentation. He introduced the organizing team and briefly described the datathon’s objective, the importance of web archives, and WARC files. We were informed that we will be using Google Collab Notebooks and Compute Canada VMs for computation. We also got an overview of the Archive Unleashed Project where they have been developing tools like Archive Unleashed Toolkit and Archive Unleashed Cloud to make it easier for users to access and analyze web archive collections. At the end of the introductory presentation, participants introduced themselves and talked about what they find interesting about web archives.
Afterward, Samantha Fritz and Alexander Thurman introduced the Ivy Plus datasets and Columbia University datasets (https://github.com/archivesunleashed/notebooks/tree/master/datathon-nyc#datasets). We got to work with the derivatives of these web archive collections thanks to Nick Ruest who made them easily accessible through Dataverse and Zenodo. Nick Ruest gave us the demo on how to set up Google Collab Notebooks. They provided us with two Archived Unleashed Notebook samples: PySpark notebook and text analysis notebook.


The team formation process was very creative. Jimmy Lin instructed us on how the teams will be formed. The AU team had prepared a google sheet with three different color columns: technique (red), question (yellow), and dataset (blue). The participants were encouraged to write down their ideas in these columns. For example, we had to write the dataset we are interested in exploring in the blue column, research questions that we could ask using a given dataset in the yellow column, and techniques we would use to analyze the given dataset in the red column. After an idea was written in the cell of the google doc, the participants interested in working along on that particular idea started leaving comments on that cell. The comments varied from “+1” indicating they are interested in the idea to participants commenting about the contribution they would like to make to the idea. This process helped to start the process of team formation, the real teams were actually formed later in the Slack channel. After some ideas started flowing and a good amount of clusters formed around that idea, we moved to Slack datathon channel for further communication. Each team with a number of members and a concrete research question started creating the separate Slack channel for their team communication. The ideas which could not gather enough interested participants were abandoned and participants started migrating to different groups. Through this novel process of team formation, we end up with four teams with 3-4 members in each.


I worked with Kae Bara Kratcha, Francis Kayiwa, and Wei Yin on Global Webcomics Web Archive. Our work was inspired by kae’s idea of analyzing the presence of queer webcomics within the archived webcomics collection. Working with a team online was a great experience. We used different tools and techniques to efficiently manage work from home. We used the Zoom and Slack for communications, We used shared google doc to document our work. Francis inspired us to use the Pomodoro technique to manage our time. We started with the Google Collab notebooks provided by the organizers. These notebooks were pre-equipped with Archive Unleashed Toolkit. The notebooks made it very easy to import the derivative files generated by the Archives Unleashed Cloud. We focused on the text_analysis notebook first and tried to understand its working. Our dataset was 30 GB in compressed form. When we tried to load it into the notebook, we got an indication that we are running out of space. Even when we were able to fully load the dataset, it left us with less than necessary space to perform any kind of computation.

Later at the Zoom “check-in” meeting, we came to know that all the teams are facing a similar problem while working with Collab notebooks. We came to know that Google has recently capped the usage limits and hardware availability and introduced a new monthly billing service called Colab Pro. This was the reason why we were being restricted while trying to load the data into the Collab notebook. Ian suggested that we should work with the smaller standard set of web archive derivatives files produced by the Archives Unleashed Cloud and to explore this dataset locally in our respective devices. After the Google Collab disaster, everyone called it a day. We gave ourselves the task to explore the smaller derivative file and meet the next morning with a fresh mind and perspective.

Day 2 (March 27, 2020)

The next morning, we all met again through a Zoom meeting. Each team summarized their work, they described what they were able to find by exploring the dataset overnight and what they plan to do going forward. Ian and Nick were constantly troubleshooting our problems. They were providing us resources, handling our doubts even after hours on Slack. The day 2 of the datathon went in a blink, as everyone rushed to gather the answers to their proposed research questions. Each team played around with Gephi, network files, and text datasets present in the standard derivatives. All the teams worked hard to gather the data and create visualizations that can be presented later in the final meeting.

Everyone once again met through Zoom during the midday to discuss the progress until now. Ian Milligan informed everyone on when and how the final presentations will happen. We were informed to submit the slide around half-an-hour before the final presentation. After the meeting, everyone worked on finalizing and submitting their presentations over the next hour. At 3:30 PM, the final presentations began.


This team analyzed the Latin American and Caribbean Contemporary Art Web Archive collections. They explored the network collection through Gephi. They reported that Wordpress is a widely used platform among the artists, followed by Vimeo. They also found that the color red is most frequently used by artists.



Team 2 worked with Contemporary Composers Web Archives collections. They used tools such as Powershell, Jupiter notebooks and GREL functions to explore their dataset. They visualized the network graph using Gephi. They discovered the prominence of social media domains with a higher number of in-nodes within the network.


We played around with Global Web Comics Web Archives collections. Our interest was to explore the queer content present within this dataset. We mainly used UNIX CLI to perform text analysis and Gephi to explore network data. We presented a comparison between top webcomics domains and top queer webcomics domains. We also calculated the most frequently used queer words. We found that "gay" and "polyamory" are the top two frequently used words. We used these term frequencies to build a word cloud. We evaluated our top queer domains manually and found that some of these domains do not contain queer content but they merely use queer words to attract the traffic. We feel it will be interesting to perform image analysis on this collection as that would reveal the real content in these webcomics.


Team 4: Stonewall  
Team 4 analyzed the Stonewall 50 Web Archives collections. They wanted to explore the top stories around the stonewall’s 50th anniversary in 1969. They focused on the corpus of Making Gay History mini-episodes. While analyzing the text, they found that the Stonewall collection has around 29 K  🌈rainbow emojis. 



The completion of final presentations marked the official end of the AU datathon. With that, everyone said their goodbyes and thanked one another for the amazing learning experience.

This fourth Archives Unleashed datathon was a unique datathon. Throughout the datathon, we were graced by the surprise guests such as cute kids, cats, and dogs. I am glad I had the pleasure to be part of this unique experience and hopefully, I will get to meet these people in person sooner than later. I applaud the organizers for their hard work and their dedication to continuing this event online. The event felt as normalcy amidst this chaos. It bought me joy to work and collaborate with such a wonderful community.

-- Kritika Garg (@kritika_garg)


Comments