2021-05-10: Chronicling the life-cycle of top new stories with StoryGraphBot

Fig. 1: Fig. 1 (Click on image to expand): Story Attention Dynamics chart illustrating the life-cycle of two top news stories from May 18, 2018 -- May 19, 2018. Each line (red or blue) represents a top news story. The x-axis represents time while the y-axis represents the average degree of Connected Components (representation of story). Within our window of observation, the Santa Fe High School Shooting story received peak attention on Friday May 18, 2018 at 4:40PM, this attention waned with the lowest point coinciding with the rise of a new story, the Royal Wedding of Prince Harry and Meghan Markle.

News stories are born expected or unexpected, big or small, compete for attention with sibling stories or enjoy the spotlight alone, live short or long lives, and exit through death or hibernation. 

Since August 2017, every 10-minutes, StoryGraph has been quantifying the attention given to news stories. In the past three years, we have seen threats of war, hurricanes Harvey/Irma/Maria, upsets in a senate election in Alabama, a royal wedding, an impeachment, a pandemic, upsets in two senate elections in Georgia, another impeachment, etc. For all these and many other stories, StoryGraph has been generating news similarity graphs which encase stories in Connected Components (CCs). The current interface of StoryGraph is useful for exploring the attention given to stories at specific times, or finding what stories were "trending" at a specific time. However, it is incapable --- since this was not the design intent --- of presenting news stories as a continuum; a collection of CCs across multiple graphs. As a result, we lose the context of the life-cycle of stories browsing with StoryGraph. To address this limitation, we are happy to announce the deployment of @StoryGraphBot.
StoryGraphBot is a Twitter bot that runs every hour, tracking top news stories and creating tweet threads (collections of tweets) that report updates (rising/falling/same attention) of the stories. For example, the first of 13 tweets from the thread for the Royal Wedding story announces the breaking of the story ("Breaking story..."). The tweet also includes a representative title (Inside Prince Harry and Meghan Markle's royal wedding - ABC News) selected from the CC node with the highest degree. All tweets in the thread report the level of attention (attention score AKA average degree) garnered by the story and its age at the time the tweet was posted. For example, the first tweet reported an attention score of 5.00 and age of 1 hour while the second tweet reported an update; rising attention score of 5.40. 
 
Recall that news stories are represented as Connected Components which reside in graphs. The attention given to a news story can be approximated by the average degree  (y-axis in Fig. 1) of its Connected Component.  StoryGraphBot stitches the multiple states of a story represented by all its CCs (dots in Fig. 1) from different graphs into stories (red/blue lines in Figs. 1, 2, & 5). As a result, the tweet threads created by StoryGraphBot provides an elegant summary of the life-cycle of a top new story, helping us see preludes, developments, and the various states (rising/falling/same) of attention received by the news story from 17 media organizations.
Fig. 2 (Click on image to expand): Story Attention Dynamics chart illustrating the life-cycle of two top news stories from January 6, 2021 -- January 9, 2021. The red line represents the Storming of the US Capitol story which was preceded by the Georgia Senate Election. This story lived for three days before resulting in the birth of the second top story, the Democrats discuss Second Impeachment of President Trump.

Even though StoryGraphBot runs hourly as time progresses, it can create threads for past stories. This capability enables us to re-replay the news cycle, in other words travel back in time to see how the news unfolded hour-by-hour. For example, Fig. 2, is a retrospective look at the Storming of the US Capitol story which was preceded by updates from the Georgia senate elections. The same figure also illustrates StoryGraph's capability for tracking top stories that live beyond a single day.
Top story definition and Activation degree
Given a collection of news stories represented by their respective CCs, there are two different ways the top news story can be selected. We could select the story with the highest average degree (magnitude) or select the story which lived longest (duration). StoryGraphBot uses the magnitude criteria since it tracks stories hourly. Also, using the duration criteria requires waiting for the end of the day to see what news story lived the longest - Shawn Jones's SHARI system already does this and tweets the StoryGraphBot's top stories of the previous day

Once a news story is selected as the top news story, if the average degree exceeds a pre-defined threshold of 4.0, the activation degree (vertical lines in figures), StoryGraphBot starts posting tweets for the story. This means StoryGraphBot would never create threads for a top news story that does not attain an average degree of at least 4.0 over the course of its lifetime. We determined the threshold of 4.0 empirically, based on the observation that it is the sweet spot between a not quite big story, but not quite small also.

Once a news story becomes a top news story, it could be lose that status to a newer and taller top story (story with a higher average degree). However, StoryGraphBot does not abandon ex-top stories, it continues to report developments even after they have lost their status. 

Decoding StoryGraphBot's Tweet/Thread structure
Recall that StoryGraphBot creates a tweet thread to report the breaking and developments of top news stories. The first tweet (parent tweet) of every thread announces the breaking of a new top story, while all subsequent tweets report developments over the life-cycle of the story.
Fig. 3 (click on image to expand): Seven different components of a StoryGraphBot tweet.

Fig 3. labels the main components of a tweet thread posted by StoryGraphBot. The following is a summary of the seven main components:
  1. Story publication date: The story publication date (YYYY-MM-DD) refers to the date the news story was collected by StoryGraph, not the date the story was published by StoryGraphBot.
  2. Average degree of CC and meter: Recall that we represent news stories as  Connected Component. The average degree is simply the average number of edges (lines) per node (news articles) in the graph (our graph in this case is the Connected Component). An average degree of 4.0 means, on average every news article is connected to 4 other news articles. This simple metric can help quantify how connected a CC is, and we use it to approximate the level of attention given to a news story. One might ask, how big or small is an attention score (average degree) of 4.0? We provided a meter to address this question. The meter, represented by empty (○) and filled dots (●) provides further perspective about the level of attention received by a top news story. The number of filled dots in the meter approximates the story's level of attention. So a stories with average degree greater than 4.0, but less that 5.0 would have a meter with 4 filled dots (|●●●●○○○○○○○○○○○○○○○○○○○○|). The meter can represent a maximum average degree of 24. Even though stories with average degrees exceeding 24 are possible, they are very rare.
  3. Approximate age: Just as the name implies, the approximate age refers to how long the story has persisted.
  4. Hashtags: There are currently two possible hashtags used to tag parent tweets of StoryGraphBot, namely, #SGBotBreakingNews and #SGBotTimetravel. The former is used to tag tweets for which the story publication date is the same as the tweet publication, otherwise the later is used.
  5. Representative title: Each top news story needs a name or a title to identify it. Since StoryGraphBot runs automatically, we do not have the luxury of employing human editors to craft precise titles for top news stories. Consequently, StoryGraphBot selects a representative title to label top stories and their respective update stories. The titles given to stories are extracted from a single news article from the set of news article nodes that form the CC of the news story. This single article or node maps to the node that has the highest degree, in other words, the news article that has the most entities in common with other news articles. The title of a story could change as the story evolves. For example, the representative titles of the Storming of the US Capitol story top story (red line) in Fig. 2 evolves as follows:
    01. Election Updates 2021: Read The Latest News On Georgia Senate Races, Electoral College Count...
    02. Election Updates 2021: Read The Latest News On Georgia Senate Races, Electoral College Count...
    03. Raphael Warnock Projected To Win Georgia Senate Seat
    ...
    13. Chuck Schumer Tells Americans Help Is On The Way Now That Democrats Control The Senate
    14. Mike Pence won’t throw out electoral votes for Joe Biden in Senate election certification
    16. US Capitol lockdown: Trump supporters storm the US Capitol
    
  6. Link to parent graph containing CC: Every story is represented by a Connected Component and every Connected Component is a child of a parent graph. Links to parent graphs are included in every tweet reporting the breaking or an update of a top news story. 
  7. Story update attention level: As a story develops, over the course of its life-cycle, the attention changes. StoryGraphBot reports that the status of the attention received by a top news story relative to its previous state. The attention could rise (rising), fall (falling), or remain the same (same).
Limitations of StoryGraphBot
Firstly, StoryGraphBot relies on a clustering algorithm that clusters multiple news articles into individual new stories. The idea of clustering is based on the assumption that news articles discuss a single news top, but this is not always the case because a single news article can discuss multiple topics. Since our clustering algorithm assumes each article belongs to a single topic bucket, this assumption could lead to impurities; different topics within a single story. Additionally, our clustering algorithm clusters entities (e.g., people, locations, organizations) resident in the news articles. It is possible for news articles about different topics to share a common set of entities, also resulting in impurities within stories. Impurities manifest as stories with multiple topics. For example, the tweet thread below features a story with two topics described by these titles: Trump Loses Again As Facebook Oversight Board Upholds Ban and Liz Cheney chooses truth over power -- a lonely path in Trump's GOP.
Fig. 4 further illustrates the impurities problem during clustering  by showing that the CC of the story can be partitioned into two communities (topics) discussing two different topics: House GOP considers replacing Liz Cheney and Facebook Oversight Board Upolds Trump Ban. Representative titles for stories with multiple topics are often misleading since the title only accounts for one of the multiple topics of the story.
 
Fig. 4: A story (represented by its connected component) in which different topics are being discussed. Representative titles for stories with multiple topics are often misleading since the title only accounts for one of the multiple topics of the story.
 
Secondly, even though StoryGraphBot clusters stories accross multiple days, it is possible for a story to be fragmented for various reasons which are beyond the scope of this blog to explain.. As Fig. 5 illustrates the multi-day 2020 US Presidential Election story was fragmented into two parts (part 1 and part 2).

Fig. 5: StoryGraphBot can cluster top news stories across multiple days such as the 2020 US Presidential Election which was fragmented into two parts. However, a story can be fragmented (e.g., Election part 1 and Election part 2) for various reasons which are beyond the scope of this blog to explain.

Thirdly, in the current implementation of StoryGraphBot, once a top story is being tracked, we do not store any information about the second, or third top stories. This is problematic because if the second top story becomes the new top story, we lose the early CCs for this story because they existed during the reign of the initial top story. This phenomena can be explained by the sequential layout of the lines (e.g., Fig. 1) which incorrectly implies that the second top news story (blue) line existed only after the first (red line) stopped. We recognized this flaw shortly after deploying StoryGraphBot and would address it soon.

Fourthly, Twitter's 280-character limit restricts the amount of information we can include in a tweet. Consequently, we had to abbreviate, summarize, or entirely remove additional text to conform to this restriction. All of these sacrifice some addtional clarity of the thread.

Potential utility of StoryGraphBot
Threads produced by StoryGraphBot could be useful in Journalism/Media research/education. Additionally, they could be a useful source for Web Archiving endeavors that require quality seeds to generate collections that memorialize important stories and events. 
 
We welcome suggestions about how to improve StoryGraphBot and encourage you to follow the Twitter account - @StoryGraphBot.

2021-06-08 Edit:
We have addressed the third limitation of StoryGraphBot - tracking one top story at a time. Currently, StoryGraphBot tracks the top five stories simultaneously, to avoid missing early states (CCs) of lower ranked top stories that eventually become the new top story. However, it should be noted that Twitter threads are created only for stories that achieve average degree of 4.0 sometime during their life-cycle. Fig 6. illustrates this difference.
Fig. 6: Unlike the previous version, the current version of StoryGraphBot can track k top stories simultaneously, to avoid missing early CCs if for example, the second or third top story eventually becomes the new top story.
 
 -- Kritika Garg (@kritika_garg) and Alexander C. Nwala (@acnwala

Comments