2020-06-03: Hypercane Part 1: Intelligent Sampling of Web Archive Collections


This image by NASA is licensed under NASA's Media Usage Guidelines

Yasmin AlNoamany experimented with summarizing a web collection by choosing a small number of exemplars and then visualizing them with social media storytelling. This is in contrast to approaches that try to account for all members of the collection. When I took over the Dark and Stormy Archives project from her in 2017, the goal was to improve upon her excellent work. Her existing code relied heavily upon the Storify platform to render its stories. Storify was discontinued in May 2018. We discovered that other platforms rendered mementos poorly, so we developed MementoEmbed to render individual surrogates and later Raintale to render whole stories. We discovered that cards are probably the best surrogate for stories. We now publish stores to the DSA-Puddles web site on a regular basis. Up to this point, we have relied upon sources such as Nwala's StoryGraph or human selection to generate the list of mementos rendered in our stories. The document selection is key to the entire process. What tool can we rely on to automate the selection of mementos for these stories and other purposes? Hypercane.

The goal of the DSA project: to summarize a web archive collection by selecting a small number of exemplars and then visualize them with social media storytelling techniques.

How do you sample mementos? What if there were thousands to choose among? What if we only want a small subset like 28 for our story?
Hypercane does this work for you.
This image by Obsidian Soul is licensed under CC BY-SA 3.0.

Humans can choose mementos from a collection, but doing so is difficult if they are unfamiliar with the collection. In the appendix of the preprint version of Social Cards Probably Provide For Better Understanding of Web Archive Collections, we detail how difficult it is to understand a web archive collection with the existing Archive-It interface. The issue is scale. Most web archive collections consist of thousands of documents. In that same work, we show that most collections contain insufficient metadata to assist users in making choices about which mementos to sample. Depending on the algorithm chosen, Hypercane takes into account the structural features of the collection and the content of the collection's mementos to make its decisions.
The story for Archive-It Collection Novel Coronovirus (COVID-19), rendered by Raintale, but the input was provided by Hypercane.

Hypercane automatically generates a sample from a collection of mementos. The screenshot above shows a story generated from an Archive-It collection. It was rendered by Raintale but the striking image, entities, sumgrams, title, curator, and mementos were all discovered and selected by Hypercane. It is the entry point for the Dark and Stormy Archives' automated storytelling process. It relies upon other components of the DSA toolkit to provide it with information, but it performs the decision-making for memento selection, story striking image selection, metadata, etc.

Hypercane is the entry point to the storytelling process that includes the other tools in the DSA Toolkit.

Hypercane was designed to be as modular as possible so that storytellers can employ existing algorithms or build their own. This modularity creates complexity, so we also balance this complexity with features that group common functionality together. Hypercane cannot be described in a single blog post, so I will discuss it in three parts.

Usage

All items in the DSA Toolkit were designed for automation. With this in mind, Hypercane is a command line application, allowing it to be called in scripts and other automation workflows. The Hypercane command is hc.

It can be installed via pip as well as with Docker. More information on installing Hypercane can be found in its GitHub repository.

Inputs and Outputs

Hypercane supports several types of input across all of its commands. An input type is supplied with the -i argument. An output file is supplied with the -o argument. For each input type, the -a argument specifies the collection identifier or the file containing the input.

All Hypercane commands accept these values for the -i argument:
  • archiveit - an Archive-It collection
  • mementos - a tab-separated file containing a list of mementos identified by their URI-Ms
  • timemaps - a tab-separated file containing a list of TimeMaps identified by their URI-Ts
  • original-resources - a tab-separated file containing a list of live web resources identified by their URI-Rs
A hypothetical Hypercane workflow shows a user providing a list of TimeMap URI-Ts as input to a sample command. The output list of Memento URI-Ms from that command can be used as input to subsequent commands who then feed others.
Most Hypercane commands will output a tab-separated file containing a list of mementos identified by URI-M. All Hypercane commands will also accept this file format, allowing most Hypercane commands to feed data into each other. As shown in the hypothetical diagram above, we can execute a sample action using one algorithm and feed the result into another sample action using a different algorithm. The resulting mementos can then be fed into an order action before finally feeding the result into a synthesize action to generate input for a Raintale story containing only those mementos. Many combinations of actions are possible.

Actions

Hypercane commands consists of one action with arguments that affect how that action is executed. Below is an example of Hypercane's main focus: the sample action.

# hc sample true-random -i archiveit -a 2950 -o myoutput.tsv -k 10

2020-04-23 09:28:15,154 [INFO] hypercane.actions.sample: Starting random sampling of URI-Ms.
2020-04-23 09:28:15,154 [INFO] hypercane.identify: processing input for type archiveit
2020-04-23 09:28:15,154 [INFO] hypercane.identify: discovering mementos for input type archiveit
2020-04-23 09:28:49,982 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:02,289 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,315 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,339 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,824 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,854 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,944 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,041 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,121 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,191 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,366 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,385 [INFO] hypercane.identify: discovered 30529 URIs
2020-04-23 09:29:04,391 [INFO] hypercane.actions.sample: Executing select true random algorithm
2020-04-23 09:29:04,392 [INFO] hypercane.actions.sample: Writing sampled URI-Ms out to myoutput.tsv
2020-04-23 09:29:04,392 [INFO] hypercane.utils: attempting to write 10 URIs with resource data to myoutput.tsv
2020-04-23 09:29:04,393 [INFO] hypercane.utils: fieldnames will be ['URI-R']
2020-04-23 09:29:04,398 [INFO] hypercane.actions.sample: Done sampling.

That command asks Hypercane to execute the sample action to randomly sample 10 mementos from Archive-It collection 2950 and write the output to myoutput.tsv.

Hypercane supports the following actions:
  • sample - for sampling from the collection with a preset algorithm
  • synthesize - for converting the input into other formats, like WARC, Raintale story JSON, or a directory of boilerplate-free text files
  • report - for generating datasets and other reports based on the documents specified in input, such as lists of entities, image scores, or collection metrics
  • identify - for generating lists of URI-Ms, URI-Ts, or URI-Rs from the documents specified in the input
  • filter - for filtering the documents specified in the input based on some criteria, such as removing duplicate documents
  • cluster - for clustering the documents specified in the input
  • score - for computing scores for the documents specified in the input based on some scoring algorithm, like BM25 or AlNoamany's scoring function
  • order - for sorting the documents specified in the input based on some criteria, such as publication date or the scores supplied by the score action
Information on each action is available via the --help argument:

# hc report --help

'hc report' is used print reports about web archive collections

    Supported commands:
    * metadata - for discovering the metadata associated with seeds
    * image-data - for generating a report of the images associated with the mementos found in the input
    * terms - generates corpus term frequency, probability, document frequency, inverse document frequency, and corpus TF-IDF for the terms in the collection
    * entities - generates corpus term frequency, probability, document frequency, inverse document frequency, and corpus TF-IDF for the named entities in the collection
    * seed-statistics - calculates metrics on the original resources discovered from the input
    * growth - calculates metrics based on the growth of the TimeMaps discovered from the input

    Examples:
    
    hc report metadata -i archiveit -a 8788 -o 8788-metadata.json -cs mongodb://localhost/cache

    hc report entities -i mementos -a memento-file.tsv -o entity-report.json

    hc report seed-statistics -i original-resources -a urirs.tsv -o seedstats.json

This blog post will focus on the sample and report actions.

Caching

To ensure that Hypercane commands can be executed in any order, there is not an initial "load collection" command that must be executed first. Instead, it relies on caching to ease the load on web archives. Hypercane does support the common HTTP_PROXY and HTTPS_PROXY environment variables, but also provides a fast database cache.

All commands support the -cs (for cache storage) option that allows us to specify where to store the cache.

For example, to use a MongoDB cache with one of the above commands, do the following: # hc sample true-random -i mementos -a selected-mementos.tsv \ -o random-mementos.tsv -k 5 \ -cs mongodb://localhost/MyCacheDB

Instead of using the -cs option with each command, one can also specify the database's URL with the HC_CACHE_STORAGE environment variable. For brevity, we will omit this option for the rest of this post.

Currently only MongoDB is supported for caching, but I am working with Sawood Alam to support more standard caches in the future.

Sampling


Hypercane's primary goal is to automatically produce an intelligent sample from a web archive collection. This web archive collection might be formally defined, like an Archive-It collection, or it might be informal, like a list of URI-Ms. Hypercane currently supports the following algorithms with the sample action:
  • true-random - samples k mementos from the input, randomly
  • filtered-random - removes off-topic mementos, near-duplicates, and then samples k mementos from the remainder, randomly
  • dsa1 - executes an updated version of Yasmin AlNoamany's original sampling algorithm
  • alnoamany - another way to execute dsa1
Hypercane's sample action can execute an updated version of AlNoamany's Algorithm, which we have named DSA1.

Below are some examples of executing Hypercane's sample action on different inputs.

To filter out the off-topic and near-duplicate mementos from a list of URI-Ts in the file timemap-list.tsv and return a file filtered-mementos.tsv containing a list of 20 randomly sampled URI-Ms:
# hc sample filtered-random -i timemaps -a timemap-list.tsv -o filtered-mementos.tsv -k 20
To employ AlNoamany's Algorithm on Archive-It collection 13529 and write the output to selected-mementos.tsv:
# hc sample dsa1 -i archiveit -a 13529 -o selected-mementos.tsv
To randomly sample 5 mementos from the output of the previous command:
# hc sample true-random -i mementos -a selected-mementos.tsv -o random-mementos.tsv -k 5
Hypercane makes heavy use of the Memento protocol to discover TimeMaps from original resources, and mementos from TimeMaps. If original resources are specified as an input, Hypercane will use the Memento protocol to first search for a memento and then fall back to calling ArchiveNow to mint new mementos if none exist for those resources. Thus, one can supply a list of URI-Rs and get a sample of URI-Ms from those URI-Rs.

This command supplies a list of original resource URI-Rs in a file named webpage-list.tsv. Hypercane will discover the associated mementos, sample them with AlNoamany's Algorithm, and then return the reduced set of memento URI-Ms in the file selected-mementos.tsv.

# hc sample dsa1 -i original-resources -a webpage-list.tsv -o selected-mementos.tsv
We will be adding more algorithms as we continue our research. Development on DSA2 is underway. My dissertation will involve evaluating several candidates to see which produces the best stories for different types of collections.

Reporting

Hypercane produces a few reports for use in storytelling and rudimentary collection analysis.
  • metadata - the metadata scraped from an Archive-It collection - output is a JSON file
  • image-data - provides information about all embedded images discovered in the input and ranks them so Raintale has a striking image for the story - output is a JSON file
  • seed-statistics - calculates metrics on the original resources discovered in the input, as mentioned in The Many Shapes of Archive-It - output is a JSON file
  • growth - calculates metrics on the collection growth, as mentioned in The Many Shapes of Archive-It - output is a JSON file
  • terms - provides all terms discovered in the input, including their frequency, document frequency, probability, and corpus-wide TF-IDF; if the --sumgrams option is also supplied then some of these features do not make sense, so only document frequency and term rate are provided - output is a tab-delimited file
  • entities - provides a list of all entities discovered in the input, including frequency, probability, and corpus-wide TF-IDF - output is a tab-delimited file
To acquire the metadata for collection 13529:
# hc report metadata -i archiveit -a 13529 -o metadata.json
To generate the named entities, their frequency, probability, document frequency, and corpus-wide TF-IDF for the mementos specified in story-mementos.tsv:
# hc report entities -i mementos -a story-mementos.tsv -o entities.json
To generate a list of sumgrams, along with metrics, for those same mementos: # hc report terms -i mementos -a story-mementos.tsv -o sumgrams.json --sumgrams
To generate a report on all embedded images within those mementos, including ranking them by their features:
# hc report image-data -i mementos -a story-mementos.tsv -o imagedata.json
To generate a collection growth curve for Archive-It collection 13529 like the one below run:
# hc report growth -i archiveit -a 13529 -o 13529-growthstats.json --growth-curve-file 13529.png
After running this command, the file 13529-growthstats.json will contain some statistics about the curve and 13529.png will contain a curve like the one shown below. Alsum et al. demonstrated the efficacy of these curves with samples from web archives in Profiling web archive coverage for top-level domain and content language. In an effort to understand the differences between collections, we adapted the growth curve concept to full web archive collections in The Many Shapes of Archive-It.
A growth curve for IIPC's Archive-It collection 13529 about COVID-19.

From this curve, we see that new resources (green line) are being added throughout the collection's lifespan and mementos (red line) are being captured on a regular basis. As we described in The Many Shapes of Archive-It, most collection curators specify all original resources at the beginning of the collection's life and thus their green line is shifted into the upper left corner. The visualization above demonstrates a lot of curatorial involvement with collection 13529 over its life. In contrast, the curve below comes from a different COVID-19 collection created by Kansas City Public Libraries. Note how that collection's curve shows a burst in activity later in its life, leading to the green line being shifted farther into the lower right quadrant.

A growth curve for Kansas City Public Library's Archive-It collection 13734, also about COVID-19.


Hypercane can accept these reports as input to hc synthesize raintale-story in order to augment the input that will be rendered with Raintale.

Tying it all together to automatically generate a story summarizing Archive-It collection 13529


We will cover the synthesize action in more detail in the next blog post in this series, but here we provide a simplified example of how to combine the Hypercane (hc) and Raintale (tellstory) commands together to generate the story above for Archive-It collection 13529.
# cat create-story-for-archiveit-13529.sh
#!/bin/bash

export HC_CACHE_STORAGE=mongodb://localhost/cache13529

hc sample dsa1 -i archiveit -a 13529 -o story-mementos.tsv

hc report metadata -i archiveit -a 13529 -o metadata.json

hc report entities -i mementos -a story-mementos.tsv -o entities.json

hc report terms -i mementos -a story-mementos.tsv -o sumgrams.json --sumgrams

hc report image-data -i mementos -a story-mementos.tsv -o imagedata.json

hc synthesize raintale-story -i mementos -a story-mementos.tsv \
     --imagedata imagedata.json --termdata sumgrams.json \
     --entitydata entities.json --collection_metadata metadata.json \ 
     --title "Archive-It Collection" -o raintale-story.json

tellstory -i raintale-story.json --storyteller template \
     --story-template raintale-templates/archiveit-collection-template1.html \
     --generated-by "AlNoamany's Algorithm" \
     -o story-post.html

We combined these by way of a shell script. We could have combined all of this together into one Hypercane command, but the separation of commands allows for more flexible summarization and storytelling. The user has the option of analyzing the output of one of these commands separately (e.g., with Unix tools like grep, sort, or head) before reintegrating its output back into the process.

Summary and Comparisons

Hypercane is the latest tool in the DSA toolkit. It provides us with the ability to submit an Archive-It collection identifier, a list of mementos, a list of TimeMaps, or a list of original resources and sample a smaller number from them. It does so via the sample action. It also provides reports that are useful for storytelling via the report action. We are still adding new functionality, developing its documentation, and addressing defects, so please report issues to our issue tracker.

Hypercane's input is either an Archive-It collection identifier or a file containing URIs. The goal of Hypercane is to allow us to generate a list of URI-Ms for future use. Its reports are geared toward generating stories and evaluating the output. It does not support WARCs as input. For WARCs, tools such as ArchiveSpark and Archives Unleashed Toolkit allow users to perform domain analysis, perform more in depth image analysis, and generate many different reports based on WARC content. The primary output of Archives Unleashed Toolkit is aggregated data and not URI-Ms. Even though it generates some reports for storytelling and evaluation purposes, Hypercane's primary output is the URI-Ms of HTML pages. Hypercane was built with the assumption that the user wants to produce a sample of URI-Ms from a collection for which they only have web access. Thus Hypercane can be run on any public Archive-It collection and any web archive that supports the Memento protocol.

Hypercane is not a search engine because its goal is to help users produce a sample from a collection without knowing much about it a priori. It does not build or maintain indices and is not a replacement for Solr or anything like it. Producing the sample above for collection 13529 took more than four hours, 75% of which was downloading content, so Hypercane is not designed to respond quickly or provide anything close to real-time computing. Because Hypercane is archive-aware, it provides insights not possible with other search engine tools, but its focus is sampling for summarization not searching.

Hypercane was designed with modularity and interoperability in mind. It makes heavy use of the Memento protocol to identify and discover archived web pages within one archive or among several archives. Without the Memento Protocol this kind of exploration would not be possible. In the next blog post I will highlight how to use Hypercane with Raintale, Archives Unleashed Toolkit and other tools.

-- Shawn M. Jones
Acknowledgements:
I would like to thank Ian Milligan and Nick Ruest for their feedback on this blog post.

Comments