2020-08-10: StoryGraph, reading the news for three years, a look at the past and the future

Three connected components from three different StoryGraphs, representing three different news stories. The first connected component represents the news story about North Korea considering firing missiles at Guam, the second, the royal wedding of Prince Harry and Meghan, and the third, AG William Barr's release of his summary of the Mueller Report.
Fig. 1: Three connected components from three different StoryGraphs, representing three different news stories. The first connected component represents the news story about North Korea considering firing missiles at Guam, the second, the royal wedding of Prince Harry and Meghan, and the third, AG William Barr's release of his summary of the Mueller Report.

A lot has happened in three years. We've seen threats of war, hurricanes Harvey/Irma/Maria, upsets in  electionsa royal wedding, an impeachment, a pandemic, etc. For all these stories and many more, for three years, every 10-minutes, StoryGraph has been reading the news, generating news similarity graphs, and quantifying the level of attention news stories receive. August 8, 2020 marked the third year since StoryGraph went live. In this blogpost, I will take a retrospective look at the studies StoryGraph has enabled and a promise multiple services and studies for the future.

StoryGraph's Past: studies and services 

StoryGraph generates beautiful and interesting news similarity graphs like the figure above. But the most valuable part of StoryGraph is the news data it saves every 10-minutes and RSS feeds it archives.

Archived RSS and News datasets

Every 10-minutes, StoryGraph extracts the first five news articles from the RSS feeds of 17 left, center, and right US news organizations. Next it saves the RSS feeds at the Internet Archive. This means that for over three years, StoryGraph has been archiving the RSS feeds (e.g., Washington Examiner RSS on August 9, 2020 at 3:54 PM ET) of these 17 news organizations, resulting in about 2,680,560 mementos (archived copies) of publicly accessible RSS feeds. This count assumes that every request to save some RSS is successful, but this is not the case, so the actual count should be under 2.6 million. The count was calculated as follows: 17 US news organizations × 144 captures for 1 day × 365 days for 1 year × 3 years  = ~2,680,560 RSS. The archiving process of StoryGraph is the first effort that I know of, that persistently preserves the RSS of these news organizations, and it provides a valuable dataset for other researchers.

In addition to the saved RSS feeds, every 10-minutes, StoryGraph saves news data locally. For each news article, it saves (e.g., news data captured on August 9, 2020 at 3:54 PM ET) the plaintext, entities, title, publication date, etc. The information from the news dataset, specifically the entities are used to generate news similarity graphs. Figure 2 below visualizes an extracts from a single news similarity graph. StoryGraph's longitudinal news dataset consists of about 13,402,800 graphs (17 US news organizations × 5 news articles per news source × 144 captures for 1 day × 365 for 1 year × 3 years).                             

Fig. 2: Illustration of some fields from a single StoryGraph news similarity graph JSON file showing the URL (link) of 1 out of 85 new articles, entities (e.g., PERSON, LOCATION, DATE), plaintext, etc. StoryGraph saves this data every 10-minutes for 85 news articles. 

365 dots in 2018 and 2019

Leveraging StoryGraph's longitudinal news dataset, we conducted two studies to identify the top news story every day in 2018 and in 2019.  According to StoryGraph, in 2018, the Kavanaugh hearings was the top news story with a GCC of average degree of 25.85 on September 27, 2018. In 2019, the top news story was AG William Barr's release of his principal conclusions of the Mueller Report with a GCC of average degree 22.93 on March 24, 2019.

Fig. 3: (Click on image to expand) Top News stories for 365 days in 2018. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represent the average degree of the selected GCC.
 
Fig. 4: (Click on image to expand) 365 dots in 2019 - Top News stories for 365 days in 2019. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represent the average degree of the GCC.

StoryGraphToolkit (sgtk)

StoryGraphToolkit is a software suite created to query the StoryGraph longitudinal news dataset. Currently, the toolkit provides two utilities. The first utility data, retrieves news similarity graphs for a defined date range. The second utility maxgraph, retrieves the graph that contains the connected component with the largest average degree. In other words, the connected component associated with the "biggest news story" of the day (e.g., the three stories in Figure 1). maxgraph also clusters multiple connected components (representation of news story) that belong to the same news story. The biggest news story according to maxgraph can be defined either as the story (collection of connected components) with the connected component with the tallest peak (average degree), or longest story - story that persists the longest (e.g., 10 hours) through the day.

The toolkit has not been released for public use since the StoryGraph dataset is not public yet. See section titled "Public dataset and software" for plans to publish the StoryGraph dataset and toolkit. 

SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)

As part of his PhD research, Shawn M. Jones has been developing multiple important tools for producing and visualizing stories through the intelligent sampling of representative mementos from Web archive collections under the umbrella of the Dark and Stormy Archives Project (DSA, @StormyArchives). 

Shawn led the effort to develop SHARI (tech report), by combining StoryGraph, Hypercane, and ArchiveNow. Hypercane is a DSA tool that provides multiple methods for selecting the "best"mementos. ArchiveNow, a tool developed by Dr. Mohamed Aturban, provides a single access point to push resources into six public Web archives for preservation. 

SHARI combines these tools to visualize the biggest story for a given date. First, maxgraph returns the  graph that represents the biggest news story for the day. Second, Hypercane uses ArchiveNow to save the URLs from the biggest news story. Third, from the list of URLs, Hypercane identifies the most common terms, entities, and the highest quality images for social media storytelling. Finally, Raintaile (another DSA tool), utilizes the output of Hypercane to create a visualization for the biggest news story (e.g., Figure 5) for the day.

Fig. 5: SHARI's biggest news story for June 1, 2020, highlighting protests surrounding the death of George Floyd and clashes between protesters and law enforcement. This SHARI story was generated from a single connected component (with largest average degree) from a single graph (from multiple graphs that belong to the same story cluster) selected by sgtk's maxgraph

StoryGraph's Future: studies and services

I believe we have only scratched the surface of the potential StoryGraph's longitudinal dataset offers. Consider the following promises for future services and resources.

Adding context to graphs

I envision the provision of services that will provide additional context to graphs. Currently, news similarity graphs are primarily processed as isolated slices of news activities represented by graphs. Adding context to graphs involves gluing multiple slices of graphs that belong to the same news story. For example, the Coronavirus pandemic cannot be represented by just one StoryGraph since it has been in the news for multiple months. If we select and visualize a single StoryGraph (e.g., StoryGraph on March 23, 2020 at 5 AM EDT), currently, there's no way of telling how long (e.g., hours, days, months) the story has been in the news. This is a challenging problem because news stories are organic and evolve, so clustering them is challenging. We envision adding context to graphs, this could help address questions such as: When did the Coronavirus pandemic begin to receive serious coverage from the media? Or How did the coverage change over time? Sgtk's maxgraph's story clustering capability is a step toward adding context to graphs.

Public dataset and software

It has always been our intent to make the StoryGraph news similarity dataset publicly available. The current concern is whether we can share large datasets (e.g., Figure 2) consisting of the plaintext of news articles without violating copyright. We welcome feedback from legal experts or those with relevant experience. After resolving the copyright problem, we will release the full news similarity dataset and the StoryGraphToolkit to query it. We welcome researchers to use this dataset.

@StoryGraphBot

The "Bot" in "StoryGraphBot" --- the Twitter account of StoryGraph --- has not been realized. So far, tweets from that account have been from a human (me) not a Twitter bot. I envision implementing a bot that would track developing news stories and tweet updates reporting the evolution of news stories. Please follow @StoryGraphBot to receive those updates.

-- Alexander C. Nwala (@acnwala, @storygraphbot)

Comments