2020-06-03: Hypercane Part 1: Intelligent Sampling of Web Archive Collections
This image by NASA is licensed under NASA's Media Usage Guidelines |
Yasmin AlNoamany experimented with summarizing a web collection by choosing a small number of exemplars and then visualizing them with social media storytelling. This is in contrast to approaches that try to account for all members of the collection. When I took over the Dark and Stormy Archives project from her in 2017, the goal was to improve upon her excellent work. Her existing code relied heavily upon the Storify platform to render its stories. Storify was discontinued in May 2018. We discovered that other platforms rendered mementos poorly, so we developed MementoEmbed to render individual surrogates and later Raintale to render whole stories. We discovered that cards are probably the best surrogate for stories. We now publish stores to the DSA-Puddles web site on a regular basis. Up to this point, we have relied upon sources such as Nwala's StoryGraph or human selection to generate the list of mementos rendered in our stories. The document selection is key to the entire process. What tool can we rely on to automate the selection of mementos for these stories and other purposes? Hypercane.
The goal of the DSA project: to summarize a web archive collection by selecting a small number of exemplars and then visualize them with social media storytelling techniques. |
How do you sample mementos? What if there were thousands to choose among? What if we only want a small subset like 28 for our story? Hypercane does this work for you. This image by Obsidian Soul is licensed under CC BY-SA 3.0. |
Humans can choose mementos from a collection, but doing so is difficult if they are unfamiliar with the collection. In the appendix of the preprint version of Social Cards Probably Provide For Better Understanding of Web Archive Collections, we detail how difficult it is to understand a web archive collection with the existing Archive-It interface. The issue is scale. Most web archive collections consist of thousands of documents. In that same work, we show that most collections contain insufficient metadata to assist users in making choices about which mementos to sample. Depending on the algorithm chosen, Hypercane takes into account the structural features of the collection and the content of the collection's mementos to make its decisions.
The story for Archive-It Collection Novel Coronovirus (COVID-19), rendered by Raintale, but the input was provided by Hypercane. |
Hypercane automatically generates a sample from a collection of mementos. The screenshot above shows a story generated from an Archive-It collection. It was rendered by Raintale but the striking image, entities, sumgrams, title, curator, and mementos were all discovered and selected by Hypercane. It is the entry point for the Dark and Stormy Archives' automated storytelling process. It relies upon other components of the DSA toolkit to provide it with information, but it performs the decision-making for memento selection, story striking image selection, metadata, etc.
Hypercane is the entry point to the storytelling process that includes the other tools in the DSA Toolkit. |
Hypercane was designed to be as modular as possible so that storytellers can employ existing algorithms or build their own. This modularity creates complexity, so we also balance this complexity with features that group common functionality together. Hypercane cannot be described in a single blog post, so I will discuss it in three parts.
- Hypercane Part 1: Intelligent Sampling of Web Archive Collections - this post
- Hypercane Part 2: Synthesizing Output For Other Tools - here I will discuss how Hypercane can generate output for Archives Unleashed Toolkit, Raintale, Gensim, and other tools
- Hypercane Part 3: Building Your Own Algorithms - this is where we highlight Hypercane's identify, filter, cluster, score, and order actions
Usage
All items in the DSA Toolkit were designed for automation. With this in mind, Hypercane is a command line application, allowing it to be called in scripts and other automation workflows. The Hypercane command ishc
.It can be installed via pip as well as with Docker. More information on installing Hypercane can be found in its GitHub repository.
Inputs and Outputs
Hypercane supports several types of input across all of its commands. An input type is supplied with the-i
argument. An output file is supplied with the -o
argument. For each input type, the -a
argument specifies the collection identifier or the file containing the input.All Hypercane commands accept these values for the
-i
argument:archiveit
- an Archive-It collectionmementos
- a tab-separated file containing a list of mementos identified by their URI-Mstimemaps
- a tab-separated file containing a list of TimeMaps identified by their URI-Tsoriginal-resources
- a tab-separated file containing a list of live web resources identified by their URI-Rs
Most Hypercane commands will output a tab-separated file containing a list of mementos identified by URI-M. All Hypercane commands will also accept this file format, allowing most Hypercane commands to feed data into each other. As shown in the hypothetical diagram above, we can execute a
sample
action using one algorithm and feed the result into another sample
action using a different algorithm. The resulting mementos can then be fed into an order
action before finally feeding the result into a synthesize
action to generate input for a Raintale story containing only those mementos. Many combinations of actions are possible.Actions
Hypercane commands consists of one action with arguments that affect how that action is executed. Below is an example of Hypercane's main focus: thesample
action.
# hc sample true-random -i archiveit -a 2950 -o myoutput.tsv -k 10
2020-04-23 09:28:15,154 [INFO] hypercane.actions.sample: Starting random sampling of URI-Ms.
2020-04-23 09:28:15,154 [INFO] hypercane.identify: processing input for type archiveit
2020-04-23 09:28:15,154 [INFO] hypercane.identify: discovering mementos for input type archiveit
2020-04-23 09:28:49,982 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:02,289 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,315 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,339 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,824 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,854 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:03,944 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,041 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,121 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,191 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,366 [WARNING] urllib3.connectionpool: Connection pool is full, discarding connection: wayback.archive-it.org
2020-04-23 09:29:04,385 [INFO] hypercane.identify: discovered 30529 URIs
2020-04-23 09:29:04,391 [INFO] hypercane.actions.sample: Executing select true random algorithm
2020-04-23 09:29:04,392 [INFO] hypercane.actions.sample: Writing sampled URI-Ms out to myoutput.tsv
2020-04-23 09:29:04,392 [INFO] hypercane.utils: attempting to write 10 URIs with resource data to myoutput.tsv
2020-04-23 09:29:04,393 [INFO] hypercane.utils: fieldnames will be ['URI-R']
2020-04-23 09:29:04,398 [INFO] hypercane.actions.sample: Done sampling.
That command asks Hypercane to execute the
sample
action to randomly sample 10 mementos from Archive-It collection 2950 and write the output to myoutput.tsv
.
Hypercane supports the following actions:
sample
- for sampling from the collection with a preset algorithmsynthesize
- for converting the input into other formats, like WARC, Raintale story JSON, or a directory of boilerplate-free text filesreport
- for generating datasets and other reports based on the documents specified in input, such as lists of entities, image scores, or collection metricsidentify
- for generating lists of URI-Ms, URI-Ts, or URI-Rs from the documents specified in the inputfilter
- for filtering the documents specified in the input based on some criteria, such as removing duplicate documentscluster
- for clustering the documents specified in the inputscore
- for computing scores for the documents specified in the input based on some scoring algorithm, like BM25 or AlNoamany's scoring functionorder
- for sorting the documents specified in the input based on some criteria, such as publication date or the scores supplied by thescore
action
--help
argument:
# hc report --help
'hc report' is used print reports about web archive collections
Supported commands:
* metadata - for discovering the metadata associated with seeds
* image-data - for generating a report of the images associated with the mementos found in the input
* terms - generates corpus term frequency, probability, document frequency, inverse document frequency, and corpus TF-IDF for the terms in the collection
* entities - generates corpus term frequency, probability, document frequency, inverse document frequency, and corpus TF-IDF for the named entities in the collection
* seed-statistics - calculates metrics on the original resources discovered from the input
* growth - calculates metrics based on the growth of the TimeMaps discovered from the input
Examples:
hc report metadata -i archiveit -a 8788 -o 8788-metadata.json -cs mongodb://localhost/cache
hc report entities -i mementos -a memento-file.tsv -o entity-report.json
hc report seed-statistics -i original-resources -a urirs.tsv -o seedstats.json
This blog post will focus on the
sample
and report
actions.
Caching
To ensure that Hypercane commands can be executed in any order, there is not an initial "load collection" command that must be executed first. Instead, it relies on caching to ease the load on web archives. Hypercane does support the commonHTTP_PROXY
and HTTPS_PROXY
environment variables, but also provides a fast database cache.
All commands support the
-cs
(for cache storage) option that allows us to specify where to store the cache.For example, to use a MongoDB cache with one of the above commands, do the following:
# hc sample true-random -i mementos -a selected-mementos.tsv \
-o random-mementos.tsv -k 5 \
-cs mongodb://localhost/MyCacheDB
Instead of using the
-cs
option with each command, one can also specify the database's URL with the HC_CACHE_STORAGE
environment variable. For brevity, we will omit this option for the rest of this post.Currently only MongoDB is supported for caching, but I am working with Sawood Alam to support more standard caches in the future.
Sampling
Hypercane's primary goal is to automatically produce an intelligent sample from a web archive collection. This web archive collection might be formally defined, like an Archive-It collection, or it might be informal, like a list of URI-Ms. Hypercane currently supports the following algorithms with the
sample
action:true-random
- samples k mementos from the input, randomlyfiltered-random
- removes off-topic mementos, near-duplicates, and then samples k mementos from the remainder, randomlydsa1
- executes an updated version of Yasmin AlNoamany's original sampling algorithmalnoamany
- another way to executedsa1
Hypercane's sample action can execute an updated version of AlNoamany's Algorithm, which we have named DSA1. |
Below are some examples of executing Hypercane's
sample
action on different inputs.To filter out the off-topic and near-duplicate mementos from a list of URI-Ts in the file
timemap-list.tsv
and return a file filtered-mementos.tsv
containing a list of 20 randomly sampled URI-Ms:
# hc sample filtered-random -i timemaps -a timemap-list.tsv -o filtered-mementos.tsv -k 20
To employ AlNoamany's Algorithm on Archive-It collection 13529 and write the output to
selected-mementos.tsv
:
# hc sample dsa1 -i archiveit -a 13529 -o selected-mementos.tsv
To randomly sample 5 mementos from the output of the previous command:
# hc sample true-random -i mementos -a selected-mementos.tsv -o random-mementos.tsv -k 5
Hypercane makes heavy use of the Memento protocol to discover TimeMaps from original resources, and mementos from TimeMaps. If original resources are specified as an input, Hypercane will use the Memento protocol to first search for a memento and then fall back to calling ArchiveNow to mint new mementos if none exist for those resources. Thus, one can supply a list of URI-Rs and get a sample of URI-Ms from those URI-Rs.
This command supplies a list of original resource URI-Rs in a file named
webpage-list.tsv
. Hypercane will discover the associated mementos, sample them with AlNoamany's Algorithm, and then return the reduced set of memento URI-Ms in the file selected-mementos.tsv
.
# hc sample dsa1 -i original-resources -a webpage-list.tsv -o selected-mementos.tsv
We will be adding more algorithms as we continue our research. Development on DSA2 is underway. My dissertation will involve evaluating several candidates to see which produces the best stories for different types of collections.
Reporting
Hypercane produces a few reports for use in storytelling and rudimentary collection analysis.metadata
- the metadata scraped from an Archive-It collection - output is a JSON fileimage-data
- provides information about all embedded images discovered in the input and ranks them so Raintale has a striking image for the story - output is a JSON fileseed-statistics
- calculates metrics on the original resources discovered in the input, as mentioned in The Many Shapes of Archive-It - output is a JSON filegrowth
- calculates metrics on the collection growth, as mentioned in The Many Shapes of Archive-It - output is a JSON fileterms
- provides all terms discovered in the input, including their frequency, document frequency, probability, and corpus-wide TF-IDF; if the--sumgrams
option is also supplied then some of these features do not make sense, so only document frequency and term rate are provided - output is a tab-delimited fileentities
- provides a list of all entities discovered in the input, including frequency, probability, and corpus-wide TF-IDF - output is a tab-delimited file
# hc report metadata -i archiveit -a 13529 -o metadata.json
To generate the named entities, their frequency, probability, document frequency, and corpus-wide TF-IDF for the mementos specified in
story-mementos.tsv
:
# hc report entities -i mementos -a story-mementos.tsv -o entities.json
To generate a list of sumgrams, along with metrics, for those same mementos:
# hc report terms -i mementos -a story-mementos.tsv -o sumgrams.json --sumgrams
To generate a report on all embedded images within those mementos, including ranking them by their features:
To generate a collection growth curve for Archive-It collection 13529 like the one below run:
After running this command, the file
# hc report image-data -i mementos -a story-mementos.tsv -o imagedata.json
To generate a collection growth curve for Archive-It collection 13529 like the one below run:
# hc report growth -i archiveit -a 13529 -o 13529-growthstats.json --growth-curve-file 13529.png
After running this command, the file
13529-growthstats.json
will contain some statistics about the curve and 13529.png
will contain a curve like the one shown below. Alsum et al. demonstrated the efficacy of these curves with samples from web archives in Profiling web archive coverage for top-level domain and content language. In an effort to understand the differences between collections, we adapted the growth curve concept to full web archive collections in The Many Shapes of Archive-It.
A growth curve for IIPC's Archive-It collection 13529 about COVID-19. |
From this curve, we see that new resources (green line) are being added throughout the collection's lifespan and mementos (red line) are being captured on a regular basis. As we described in The Many Shapes of Archive-It, most collection curators specify all original resources at the beginning of the collection's life and thus their green line is shifted into the upper left corner. The visualization above demonstrates a lot of curatorial involvement with collection 13529 over its life. In contrast, the curve below comes from a different COVID-19 collection created by Kansas City Public Libraries. Note how that collection's curve shows a burst in activity later in its life, leading to the green line being shifted farther into the lower right quadrant.
Hypercane can accept these reports as input to
A growth curve for Kansas City Public Library's Archive-It collection 13734, also about COVID-19. |
Hypercane can accept these reports as input to
hc synthesize raintale-story
in order to augment the input that will be rendered with Raintale.
Tying it all together to automatically generate a story summarizing Archive-It collection 13529
synthesize
action in more detail in the next blog post in this series, but here we provide a simplified example of how to combine the Hypercane (hc
) and Raintale (tellstory
) commands together to generate the story above for Archive-It collection 13529.# cat create-story-for-archiveit-13529.sh
#!/bin/bash
export HC_CACHE_STORAGE=mongodb://localhost/cache13529
hc sample dsa1 -i archiveit -a 13529 -o story-mementos.tsv
hc report metadata -i archiveit -a 13529 -o metadata.json
hc report entities -i mementos -a story-mementos.tsv -o entities.json
hc report terms -i mementos -a story-mementos.tsv -o sumgrams.json --sumgrams
hc report image-data -i mementos -a story-mementos.tsv -o imagedata.json
hc synthesize raintale-story -i mementos -a story-mementos.tsv \
--imagedata imagedata.json --termdata sumgrams.json \
--entitydata entities.json --collection_metadata metadata.json \
--title "Archive-It Collection" -o raintale-story.json
tellstory -i raintale-story.json --storyteller template \
--story-template raintale-templates/archiveit-collection-template1.html \
--generated-by "AlNoamany's Algorithm" \
-o story-post.html
grep
, sort
, or head
) before reintegrating its output back into the process.
Summary and Comparisons
Hypercane is the latest tool in the DSA toolkit. It provides us with the ability to submit an Archive-It collection identifier, a list of mementos, a list of TimeMaps, or a list of original resources and sample a smaller number from them. It does so via thesample
action. It also provides reports that are useful for storytelling via the report
action. We are still adding new functionality, developing its documentation, and addressing defects, so please report issues to our issue tracker.Hypercane's input is either an Archive-It collection identifier or a file containing URIs. The goal of Hypercane is to allow us to generate a list of URI-Ms for future use. Its reports are geared toward generating stories and evaluating the output. It does not support WARCs as input. For WARCs, tools such as ArchiveSpark and Archives Unleashed Toolkit allow users to perform domain analysis, perform more in depth image analysis, and generate many different reports based on WARC content. The primary output of Archives Unleashed Toolkit is aggregated data and not URI-Ms. Even though it generates some reports for storytelling and evaluation purposes, Hypercane's primary output is the URI-Ms of HTML pages. Hypercane was built with the assumption that the user wants to produce a sample of URI-Ms from a collection for which they only have web access. Thus Hypercane can be run on any public Archive-It collection and any web archive that supports the Memento protocol.
Hypercane is not a search engine because its goal is to help users produce a sample from a collection without knowing much about it a priori. It does not build or maintain indices and is not a replacement for Solr or anything like it. Producing the sample above for collection 13529 took more than four hours, 75% of which was downloading content, so Hypercane is not designed to respond quickly or provide anything close to real-time computing. Because Hypercane is archive-aware, it provides insights not possible with other search engine tools, but its focus is sampling for summarization not searching.
Hypercane was designed with modularity and interoperability in mind. It makes heavy use of the Memento protocol to identify and discover archived web pages within one archive or among several archives. Without the Memento Protocol this kind of exploration would not be possible. In the next blog post I will highlight how to use Hypercane with Raintale, Archives Unleashed Toolkit and other tools.
-- Shawn M. Jones
Acknowledgements:
I would like to thank Ian Milligan and Nick Ruest for their feedback on this blog post.
Comments
Post a Comment