2020-04-01: SHARI: StoryGraph Hypercane ArchiveNow Raintale Integration -- Combining WS-DL Tools For Current Events Storytelling

This screenshot from DSA Puddles demonstrates the story produced by the SHARI process for the largest StoryGraph component on March 23, 2020. Here we see news stories discussing the COVID-19 pandemic on that day.

My research focuses on summarizing existing web archive collections through social media storytelling. For this effort, we developed Raintale to tell the stories produced by a selection of mementos. Collections exist at various web archives, like Archive-It and the UK Web Archive. As shown by Klein et al., we can build collections of mementos by conducting focused crawling of web archives. Raintale works well for these cases involving existing mementos, but what if we want to make a story about live web resources, like current events from the news?

Nwala's StoryGraph for March 23, 2020. Here we see edges connecting the largest connected component - the biggest story of the day.
Nwala's StoryGraph for March 23, 2020, showing how one can highlight individual nodes and view their titles and sources.

WS-DL members have addressed other parts of the problem. Alexander Nwala’s research has centered on finding seeds within search engine result pages (SERPs), social media stories, and news feeds. As part of his news research, Nwala developed StoryGraph, a tool that analyzes multiple news sources every hour and automatically determines the news story or stories that dominate the media landscape at that time. Mohamed Aturban developed ArchiveNow, a tool that accepts live web URI-Rs and submits them to web archives to produce memento URI-Ms. I partnered with Alexander Nwala to discuss how to tie StoryGraph together with tools from the Dark and Stormy Archives Toolkit to produce stories summarizing the biggest StoryGraph story of a given day. To honor the WS-DL tools used to generate these stories, I named this the StoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) process.

To experience these SHARI stories, we invite you to visit the DSA Puddles web site. That site contains the stories produced from this and other experiments of the Dark and Stormy Archives Project. For updates on the DSA Puddles site and other DSA projects, follow @StormyArchives on Twitter and @DarkAndStormyArchives on Facebook.

The DSA Puddles web site demonstrates stories produced by the Dark and Stormy Archives Project. In this screenshot of the current Page 2, the top row consists of StoryGraph's biggest story of the day. The bottom two rows contain links leading to  example stories generated for our CIKM study. In the future, we will publish other types of stories to this site.

The SHARI process

The SHARI process for producing DSA web archive stories from StoryGraph.

StoryGraph-Hypercane-ArchiveNow-Raintale Integration (SHARI) is a storytelling process for automatically creating stories summarizing news for a day. Two of its components are not yet released: the StoryGraph Toolkit and Hypercane. SHARI consists of the following steps, shown in the diagram above:
  1. with the StoryGraph Toolkit, query the StoryGraph service for the rank r story of the day
  2. submit these URI-Rs to Hypercane's hc identify mementos command to convert URI-Rs into URI-Ms, by
  3. generate an entity report from the content of those URI-Ms by running Hypercane's hc report entities command
  4. generate a terms report from the content of those URI-Ms by running Hypercane's hc report terms command that itself calls Nwala's sumgram as a library
  5. generate an image report by processing all images referenced in these URI-Ms by running Hypercane's hc report image-data command
  6. order the URI-Ms by publication date with Hypercane's hc order pubdate-else-memento-datetime command
  7. consolidate the reports from steps #3 - #5 and the URI-M list from #6 to generate Raintale data in JSON format via Hypercane's hc synthesize raintale-story command
  8. run Raintale's tellstory command to generate a Jekyll HTML file for the day's rank r story based on inputs from:
  9. Publish to GitHub Pages
The script that brings this all together is available on GitHub.

This process works because each component tries to be loosely coupled, have high cohesion, have explicit interfaces, and engage in information hiding. StoryGraph does not need to know about GitHub Pages to make this work. Each command passes data in the expected format to the next. For example, the StoryGraph Toolkit provides URI-Rs to Hypercane. Hypercane does not need to know about how StoryGraph generated them. Raintale receives story data in a JSON formatted file; it does not need to know that Hypercane produced it. MementoEmbed only works with single mementos, whereas Raintale can consider how to assemble the whole story. The diagram below indicates what each tool contributes to the story.

A diagram displaying how each of these tools contributes to SHARI stories.


StoryGraph is a valuable resource that I believe has additional unrealized potential. While developing the SHARI process, I experimented with interesting dates from Nwala's "365 dots in 2018" and "365 dots in 2019." We are not only able to create stories for today or yesterday, but all the way back to August 8, 2017, when StoryGraph was first created. As seen below, we can see how the world has evolved each year on StoryGraph's birthday.

On StoryGraph's date of birth, the news was reporting North Korea's nuclear weapons.
On StoryGraph's first birthday, the news was discussing the results of several US Congressional and gubernatorial primaries and other elections taking place on that date.

On StoryGraph's second birthday, the news was discussing the shootings in El Paso and Dayton, and their aftermath.
To get a perspective from another set of dates, we can see how the world has evolved each year on my birthday, shown below.

On my birthday in 2018, the biggest story of the day was about US President Trump's animosity toward the FBI's investigation of him.
On my birthday in 2019, the biggest story of the day was about presumptive Democratic presidential primary candidate Joe Biden choosing a running mate long before the Democratic primaries.

This year, the news on my birthday was about COVID-19 and Trump's response to the crisis.

SHARI produces a familiar yet novel method of viewing news for a day in the past. It is different from other storytelling services like Wakelet because SHARI is entirely automated. The stories produced by SHARI are different from services like Google News or Flipboard because a user did not customize the story topics. Because StoryGraph samples content from multiple sides of the political spectrum, the SHARI process can provide a visualization of articles not tied to one interest area or even a single side's terminology. Historians, journalists, and other researchers could use this method to get a glance of the biggest story on a given day.

SHARI is not without its issues. While it is clear how to use StoryGraph to produce the biggest news story of the day, we're still discussing how to produce and render the second biggest, third biggest, and other news stories for a given day.  Some resources are skipped in the SHARI process, and it tries to complete its story despite this. Due to a variety of reasons, ArchiveNow cannot create mementos from some live web pages. Sometimes mementos are still being preserved by an archive and hence do not have the proper headers to be evaluated later in the process. Sometimes MementoEmbed unearths images that were never preserved and thus SHARI cannot evaluate them for the story. We are still working on fixing issues, such as better stopword choices for sumgrams, story page performance, and implementing the lazy loading of images in the final story. Our eventual goal is to have the SHARI process produce these StoryGraph stories weekly or possibly even daily.

The DSA Puddles site exists to showcase the stories produced by the Dark and Stormy Archives Project. The input for these stories could be Archive-It collections, or it might come from other sources, like StoryGraph. StoryGraph stories all use the same Jekyll and Raintale templates. Stories for other data sources may need different templates to help users better understand their content. Below are examples of other types of stories that exist on the site.

Here is an example story told with browser thumbnails, an example from our CIKM 2019 paper.
This is a reproduction of one of Yasmin AlNoamany's original human generated stories from Storify, re-created by submitting data from her original experiment to Hypercane and Raintale.
Embeds of the Tweets produced by Raintale as featured in the blog post where we introduced Raintale.

The SHARI process is possible due to the attempts by these tools to engage in loose coupling, high cohesion, explicit interfaces, and information hiding. Parts of the process would not be possible without the Memento standard. Most of these tools are available now, and the Dark and Stormy Archives project will release Hypercane later this year. SHARI is one of many different tool combinations possible with the output of the WS-DL group and the Memento standard. How can we improve the stories produced by SHARI? What other combinations can we build?

For more information on these components, please consult:

-- Shawn M. Jones