2022-06-27: Computation + Journalism 2022 Trip Report

The 2022 Computation + Journalism (C+J) Symposium was held in-person at Columbia University and online, June 9-11. The proceedings are now available with links to PDFs of the referred submissions. I virtually attended several of the sessions on Thursday and Friday, but unfortunately missed the Saturday sessions. After the success of the fully online C+J 2021, the organizers again used the ohyay.co platform for the online attendees and speakers. In general it worked great again. My only complaint is that the speakers' slides were sometimes too small because of the space given to speaker and panel video and other decorative elements (or maybe it was just too small to view on my laptop screen). Most of the sessions were parallel, so like last year, I had a little trouble deciding which sessions to attend; they all contained interesting work. I didn't take comprehensive notes, but I'll briefly link to the work from some of the sessions that I was able to catch. There were also some tweets about the conference using the #cj2022 hashtag

Thursday, June 9 was workshop day. 

In the first set of workshops, I attended "AI for Everyone: Learnings from the Local News Challenge". This was a series of reports from the first cohort of the NYC Media Lab's AI and Local News Challenge. First up was Zhouhan Chen from NYU presenting Information Tracer. This is a nice interface to explore information spread across multiple platforms. You can search by URL, hashtag, or string and see how it's been spread across Facebook, Reddit, Twitter, YouTube, and Gab. This includes viewing the retweet and reply networks that have been detected. I found it interesting that the tool includes the relatively new Gab platform and also that it includes information about how the queried item is shared in Facebook groups. A Python API library for Information Tracer is available on GitHub. 

Jessica Davis from Gannett next presented "Localizer: Impactful journalism at scale". The goal of this work is to help create localized news stories for distribution. The author would write a story template and then use natural language generation to insert local statistics. Jessica offered an example story from the Detroit Free Press with local housing market statistics. 

Christopher Brennan presented Overtone (at Substack), which provides a score that tries to assess the quality of local news. It uses original reporting, sourcing, and perspectives as features factoring into the score. 

Swapneel Mehta from NYU wrapped up the workshop session by presenting SimPPL, a tool for newsrooms that helps to monitor engagement with news stories, and also provides information about how articles from other news sources perform so that newsrooms can see what others might be doing differently. The goal is to allow newsrooms to develop better strategies for distributing and sharing content. 

The second workshop I attended focused on tools from the Observatory on Social Media (OSoME) at Indiana University, "Identifying social media manipulation with OSoMe tools". This was led by Kai-Cheng Yang and Christopher Torres-Lugo. They demoed several of OSoME's popular social media tools:
  • Botometer - the highly-popular tool that provides a score representing the likelihood of the given account being a bot
  • BotAmp - a new tool that helps users see what type of information bots are trying to amplify by comparing likely bot activity in two sets of tweets. A user can input two different queries and BotAmp will compare bot activity on the results (or the user's home timeline)
  • Hoaxy - tool that visualizes the network of tweet spread based on a hashtag, colors are based on the account's Botometer score.
  • Network Tool - similar to Hoaxy in that it allows for exploration on how information spreads on Twitter, but includes different data. A user can specify start date and end dates and explore the network based on retweets/quotes, mentions/replies, or co-occurrence. The dataset comes from the OSoMe decahose archive with 30 billion tweets from the past 3 years.
OSoME has made APIs available for several of these tools through RapidAPI (Decahose Archive API, Botometer Pro API, Hoaxy API). 

Friday, June 10 was the first day of paper and invited sessions. 

In the "Conflict Reporting" Invited Session, Charlotte Godart from Bellingcat presented "Mapping Incidents of Civilian Harm in Ukraine". Related to our group's research interests, she talking about the importance of archiving because conflict content disappears. They used Mnemonic for forensic archiving, which hashes the links and saves the data on a server. They also developed their own archiver, auto-archiver, that utilizes Google Sheets. There are several archivers that are used, including one for the Wayback Machine, using their Save Page Now service. The results of their work are available at https://ukraine.bellingcat.com/, providing maps and context of incidents of civilian harm in Ukraine. Journalists from Texty.org.ua next presented "How data journalism is responding to the war: What can satellite images say? How do we detect disinformation? What other data can we use?". Roman Kulchynskyj, Peter Bodnar, and Illya Samoilovych discussed Russian disinformation and how their organization tracks this. They have an English Twitter account that summarizes their data journalism, @Textyorgua_Eng

The next session I attended was a papers session on "News Analysis". The first paper, "Studying Local News at Scale" was presented by Marianne Aubin Le Quéré from Cornell Tech, with co-authors Ting-Wei Chiang and Mor Naaman. The presentation was based on their paper "Understanding Local News Social Coverage and Engagement at Scale during the COVID-19 Pandemic" from ICWSM 2022. They gathered a large dataset of local news (available at https://github.com/stechlab/local-news-dataset with interactive exploration website). Using this data, they analyzed local news coverage of COVID-19 and county political affiliation and analyzed local vs. national effects. This work is related to Alexander Nwala's (WS-DL alum) work on the Local Memory Project (paper at JCDL 2017) that can be used to build collections of stories from local news sources. The next paper in the session was "Mining Questions for Answers: How COVID-19 Questions Differ Globally" by Jenna Sherman, Smriti Singh, Scott Hale and Darius Kazemi from Meedan Digital Health Lab and the Harvard T.H. Chan School of Public Health. They used a database of Bing search queries to investigate how people in different parts of the world were searching about COVID-19. 

The last paper I saw in this session was "Vulnerable Visualizations: How Data Visualizations Are Used to Promote Misinformation Online" by Maxim Lisnic, Marina Kogan and Alexander Lex from the University of Utah. They defined a "vulnerable visualization" as a legitimate visualization that could be used to support a common misconception. They studied the use of visualization in misinformation and found that explicit manipulation of visualization (i.e., creating false or misleading charts on purpose) was not the main factor in visual manipulation online. They reported that at least half of data visualizations shared by COVID skeptics were using existing charts from government agencies and news articles. The problem is that these skeptics were taking objectively true data visualizations and misinterpreting them, possibly willfully. This plays into the phenomenon of confirmation bias, where we interpret data to align with our existing beliefs. The authors proposed one way to make visualizations less vulnerable to this of manipulation was to specify additional variables on the chart itself that would directly refute the misconception. 

During this session time, I also checked in on a parallel session, "AI and Investigations" with Bahareh Heravi (Univ of Surrey), Julia Angwin (The Markup), Meredith Broussard (NYU), and Hilke Schellmann (NYU). Julia talked about one of The Markup's pieces on Google (Google’s Top Search Result? Surprise! It’s Google), which I saw a couple days later being cited by John Oliver on Last Week Tonight. The final session I attended was an Invited Session on "COVID Reporting". The first presentation was "How The Economist estimated the pandemic's true death toll" by Sondre Ulvund Solstad and Martín González. They computed a metric of excess deaths, deaths observed minus deaths expected to observe, to highlight the impact of COVID-19 around the world. They used data from countries where they did have data to help predict excess deaths for countries where they didn't have data (like China). Their work produced excess death estimates that are 3-4x the official reporting. The code for the model they developed is available at https://github.com/TheEconomist/covid-19-the-economist-global-excess-deaths-model.
Next was "Managing the challenges of data reporting and visualization in year three of the COVID-19 pandemic at The New York Times" presented by Lisa Waananen Jones, Aliza Aufrichtig, and Tiff Fehr. This was a fascinating behind-the-scenes look at how they collect data and produce the NY Times COVID-19 case charts that everyone I know has depended on. They showed some of the initial charts that were developed in early Feb/Mar/Apr 2020 and how they evolved into the tracker charts and other visualizations we have today.
One thing they encountered, which I'd also heard about in an interview with someone from CDC, is the patchwork state of health statistics in the US. The NY Times folks gathered data from individual counties, sometimes manually because some counties reported data in non-standard formats, like infographics and PDFs. All of their data is available at https://github.com/nytimes/covid-19-data/, which is a great resource. They built hundreds of scrapers to grab the data from the different counties and even had to build monitoring for the scrapers to catch things when they inevitably broke (likely due to formatting changes on the websites). I hope some of these presentations will be made available later, because my data visualization students would greatly benefit from hearing about real-world challenges with gathering, cleaning, and analyzing real data.
The final presentation I attended was "The Covid Financial Crisis" by Nick Thieme from The Atlanta Journal-Constitution (now with The Baltimore Banner). He used a machine learning approach to investigate bankruptcies in Georgia due to COVID-19. He noted the enormous amount of work that it took to take the data that was publicly available and turn it into something that could be analyzed. Even though I wasn't able to attend all of the sessions, everything I saw was super interesting and many of the works combined my own interests in web archiving, social media, visualization, disinformation, and current events. This is definitely a venue that I will continue to track. 

-Michele

Comments