2020-06-17: Hypercane Part 3: Building Your Own Algorithms

This image by NASA is licensed under NASA's Media Usage Guidelines

In Part 1, we introduced Hypercane, a tool for automatically sampling mementos from web archive collections. Web archive collections consist of thousands of documents, and humans need tools to intelligently select mementos for a given purpose. Hypercane's goal is to supply us with a list of memento URI-Ms derived from the input we provide. In Part 2, I highlighted how Hypercane's synthesize action converts its input into other formats like JSON for Raintale stories, WARCs for Archives Unleashed Toolkit, or boilerplate-free files for Gensim. This post focuses on the primitive advanced actions that make up Hypercane's sampling algorithms. We can mix and match different primitives to arrive at the sample that best meets our needs.

The DSA project's goal is to summarize a web archive collection by selecting a small number of exemplars and then visualize them with social media storytelling techniques. Hypercane performs this sampling, Raintale renders the visualization of the summary.

Our roadmap of Hypercane posts is as follows:

Hypercane Part 1: Intelligent Sampling of Web Archive Collections - an introduction to Hypercane
Hypercane Part 2: Synthesizing Output For Other Tools - how Hypercane can generate output for Archives Unleashed Toolkit, Raintale, Gensim, and other tools
Hypercane Part 3: Building Your Own Algorithms - this post

Here we cover Hypercane's advanced actions:

identify - for producing one type of Memento Protocol object (e.g., URI-R) from another (e.g., URI-M)
filter - filters the documents from the input based on some criteria
cluster - clusters the documents from the input based on a given algorithm and features and returns a file containing URI-Ms and their cluster assignments
score - computes scores for the documents based on some scoring algorithm and returns a file containing their URI-Ms and scores
order - sorts the documents based on some criteria, such as publication date or the scores supplied by score

The goal of Hypercane is to easily allow users to execute these commands from shell scripts or batch files in the order required for their task.

Hypercane's Advanced Actions

Where synthesize (Part 2) converts Hypercane outputs into other formats, Hypercane's advanced actions all output a tab-separated list of URLs. Except for identify, all advanced actions output a tab-separated file containing a list of URI-Ms. This tab-separated file contains multiple fields. As shown below, the first field indicates the type of Memento object at the given URL: URI-M, URI-T, or URI-R. The subsequent fields indicate which clusters and scores a prior run of Hypercane had assigned to these documents.

cat 13529-ranked.tsv


URI-M   Cluster Rank---DSA1-Score
http://wayback.archive-it.org/13529/20200407171321/http://52325544.nhd.weebly.com/      30~~~0  1.135
http://wayback.archive-it.org/13529/20200221185335/http://brazil.shafaqna.com/EN/AL/6353898     0~~~0   1.2437288647413784
http://wayback.archive-it.org/13529/20200221195316/http://caglecartoons.com/topic/coronavirus   2~~~0   1.1840936088558152
http://wayback.archive-it.org/13529/20200319104520/http://chuangcn.org/2020/02/social-contagion/        16~~~0  1.2866246806893566
http://wayback.archive-it.org/13529/20200329220102/http://cityofwesthaven.com/332/Coronavirus-Disease-2019-COVID-19/    26~~~0  1.23802216122549
http://wayback.archive-it.org/13529/20200314211444/http://cmajnews.com/2020/01/30/coronavirus-1095846/  11~~~0  1.435299877600979
http://wayback.archive-it.org/13529/20200315030839/http://cmajnews.com/2020/02/20/covid19-testing-1095852/      14~~~0  1.4351469251741895
http://wayback.archive-it.org/13529/20200221192410/http://coronavirus.fr/       1~~~0   1.135
... the other 1911 lines omitted for brevity ...

Hypercane only processes the fields appropriate to the action requested. If the clusters and scores are present in the input but not needed by the requested action, then Hypercane will ensure that they remain attached to their URI-M in the output. Thus, if a user executes a cluster action followed by a score action and a filter action, then the filtered list of URI-Ms will still retain their cluster assignments and scores for future use.

identify

The identify action produces one type of Memento Protocol object from another. With identify, a user can submit a list of URI-Ts and get the full list of URI-Ms or URI-Rs discovered in those TimeMaps. They can also submit a list of URI-Ms and produce the corresponding URI-Ts or URI-Rs. Hypercane discovers this information via the Memento Protocol.

A simplified view of the processes used by the identify action to discover one Memento resource type from another.

To generate a list of URI-Ms from a file timemap-list.tsv containing a list of TimeMaps, a user types the following:
# hc identify mementos -i timemaps -a timemap-list.tsv -o memento-list.tsv
Likewise, for the opposite (URI-Ts from mementos):
# hc identify timemaps -i mementos -a memento-list.tsv -o timemap-list.tsv
For a list of original resources corresponding to a list of URI-Ms:
# hc identify original-resources -i mementos -a memento-list.tsv -o original-resource-list.tsv
As a convenience, Hypercane also accepts an Archive-It collection identifier as input. Hypercane calls the AIU library to scrape Archive-It collection pages, discover the collection's seeds, and construct their URI-Ts. Once URI-Ts have been discovered, Hypercane then converts them to the appropriate Memento protocol object. To generate a list of all seed mementos from Archive-It collection 13529: # hc identify mementos -i archiveit -a 13529 -o seed-mementos.tsv
Sometimes seed mementos are not enough and we want to discover the deep mementos within a collection. The identify action accepts an optional (and experimental) --crawl-depth parameter with a number specifying the depth to crawl. If this is specified, then Hypercane will invoke Scrapy to crawl the input to the given depth and then employ the Memento Protocol to discover the desired output. I thank Mohamed Aturban for his experience and insight into this process. I also adopted ideas from the focused crawls run by Klein et al. Each diagram below contains the four input types at the top.

A flowchart demonstrating how Hypercane produces a list of URI-Ms from a crawl with one of the different input types shown at the top. This flowchart documents how hc identify mementos functions when we use the --crawl-depth argument.

A flowchart demonstrating how Hypercane produces a list of URI-Rs from a crawl with one of the different input types shown at the top. This flowchart documents how hc identify original-resources functions when we use the --crawl-depth argument.

A flowchart demonstrating how Hypercane produces a list of URI-Ts from a crawl with one of the different input types shown at the top. This flowchart documents how hc identify timemaps functions when we use the --crawl-depth argument.

Most Hypercane commands have a preferred Memento Protocol object. For example, filtering for off-topic mementos requires TimeMaps, whereas scoring requires mementos. For flexibility and extensibility, we do not want to prescribe different default inputs for different commands. Thus, the functionality of identify is built into all other Hypercane commands, allowing a user with a list of one type to still perform the desired action (e.g., submitting URI-Rs to score mementos). There are situations when identify must be separately executed. For example, the SHARI process requires an explicit call to identify to produce URI-Ms from URI-Rs because StoryGraph Toolkit provides URI-Rs as output. It is also conceivable that a user might want to generate a list of Memento Protocol objects for a third-party tool.

filter

Filtering is key to building our sample. With filter an archivist can include or exclude mementos based on some criteria. The filter commands do not remove mementos from the collection. Each command outputs a list of URI-Ms identifying the mementos that meet that criteria.

Examples of some of the many different scenarios where a page can go off topic.
This Figure is borrowed from Jones et al.'s The Off-Topic Memento Toolkit.

For example, as shown above, sometimes web resources deviate from the topic of the original collection and crawlers continue to capture these off-topic pages. To identify these off-topic pages, Hypercane leverages the Off-Topic Memento Toolkit (OTMT) as a library. By executing the command below we produce a file named ontopic.tsv that contains only mementos that are on-topic for a collection.

# hc filter include-only on-topic -i timemaps -a timemap-file.tsv -o ontopic.tsv

Thumbnails of duplicate mementos, grouped by color. Mementos outlined in red are the same, green are the same, etc.
This figure is borrowed from AlNoamany et al.'s Generating Stories From Archived Collections.

Web archive collections are different from other types of document collections because they contain many different versions of the same web page. These differences provide insight into how a given source changes over time. We only want to keep the changed content. As shown in the Figure above, a crawler will continue to repeatedly capture the same content and we must filter out these duplicates to create an intelligent sample for storytelling. We can do so with the following Hypercane command:

# hc filter exclude near-duplicates -i mementos -a memento-file.tsv -o ontopic.tsv
Hypercane's filter action accepts either include-only and exclude as subactions. Many of the include-only commands have exclude counterparts. For example, hc filter include-only on-topic has a counterpart command of hc filter exclude off-topic.

Hypercane also supports filtering by language and the highest scoring memento in each cluster. We recently added experimental filters to discover mementos whose content contains a given pattern, mementos whose URI-R contains a given pattern, and mementos whose memento-datetime fits within a given range.

cluster

Sipos, Li, Zhang, and AlNoamany employed clustering as part of their summarization algorithms. Clustering is also how StoryGraph discovers its stories and how our SHARI process finds the biggest story of the day.

Hypercane supports AlNoamany's time slice clustering with the following command:

# hc cluster time-slice -i mementos -a mementos.tsv -o sliced-mementos.tsv
Hypercane supports AlNoamany's Simhash-based clustering with this command:

# hc cluster dbscan -i mementos -a mementos.tsv -o clustered.tsv --feature tf-simhash
Hypercane provides experimental commands for clustering a collection based on topic modeling with Latent Dirichlet Allocation (LDA). It supports commands for clustering a collection by DBSCAN and K-Means with memento-datetime as a feature.

score

Scoring is also an essential staple of summarization. Scoring by algorithms such as PageRank and BM25 help us with search results. Sentence scoring is a common component of summarization. Scoring is used by Mihalcea's TextRank, Dolan's news summarization work, LexRank, and Luhn's automatic "literature abstracts." AlNoamany's Algorithm also incorporates a scoring function for mementos.

Currently, Hypercane supports AlNoamany's scoring function as:

# hc score dsa1-score -i mementos -a memento-list.tsv -o scored-mementos.tsv
AlNoamany's scoring function is

where D_m is the Memento Damage score for the memento, M_l is the path depth of the URI-R, and M_c is a score applied based on the type of web page from Padia's 2012 work — e.g., news sources score 0.7 while social media scores 0.5. The w_d, w_l, and w_c terms are weights. The list of web sites fitting into Padia's categories has changed since 2012. Hypercane currently scores social media domains based on a list from Adobe, image sharing sites based on a list from Wikipedia, video sharing sites based on another list from Wikipedia, and news websites based on lists from Pew Research and W3Newspapers. We intend to make these lists configurable in the future.

Hypercane currently has an experimental BM25 implementation. We are researching additional scoring functions for different types of stories.

order

Sorting content is critical for conveying meaning. For storytelling, we often want articles to flow in chronological order. Hypercane leverages the newspaper3k library to discover the publication date of a given memento. If newspaper3k cannot find a publication date, Hypercane falls back to the memento-datetime of the resource. To order the mementos this way, we execute the following command.

# hc order pubdate-else-memento-datetime -i mementos -a memento-list.tsv -o ordered-mementos.tsv
Other types of stories may require different ordering. We also support ordering URI-Ms by memento-datetime. If the output of this command is fed into the synthesize command detailed in Part 2, then hc synthesize preserves this order.

Constructing Algorithms with Hypercane's Advanced Actions

Archive-It Collection 13529: Novel Coronavirus (COVID-19) as visualized on the DSA Puddles website as a subset of 34 exemplar mementos selected by the DSA toolkit's implementation of AlNoamany's Algorithm.

AlNoamany's Algorithm provides the first path toward producing an intelligent sample from a web archive collection. We visualized this intelligent sample of 34 exemplars chosen from more than 23,000 mementos as the story above. The goal is that a user can get the gist of a collection from this visualization.

AlNoamany's Algorithm for storytelling. Hypercane implements the steps from "exclude off-topic pages" to "order pages by time" as a series of advanced actions. Raintale implements the "visualize" step.

Above is AlNoamany's diagram of her algorithm for storytelling. Below we show the series of Hypercane commands that implement each part of this algorithm. The output of each command feeds into the next. For example, the english-only.tsv file generated by hc filter include-only languages serves as input to hc cluster time-slice.

cat 13529-with-dsa1.sh


hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv

hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv

hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv

hc filter include-only languages --lang en -i mementos -a non-duplicates.tsv -o english-only.tsv

hc cluster time-slice -i mementos -a english-only.tsv -o sliced.tsv

hc cluster dbscan -i mementos -a sliced.tsv -o sliced-and-clustered.tsv

hc score dsa1-scoring -i mementos -a sliced-and-clustered.tsv -o scored.tsv

hc filter include-only highest-score-per-cluster -i mementos -a scored.tsv -o highest-scored.tsv

To further reuse our existing work, when a user executes the hc sample dsa1 command documented in Part 1, they are asking Hypercane to execute this series of advanced actions as a shell script.

The goal of Hypercane was not merely to implement AlNoamany's Algorithm; but also to give archivists and us a tool to customize summarizations. An archivist could easily only include Japanese documents, or skip the hc cluster dbscan step altogether. AlNoamany's Algorithm gives us a guide of what is initially possible in this problem space, but it is by no means the only sampling solution.

We implemented a filtered random algorithm that borrows parts of AlNoamany's Algorithm and then randomly chooses from the result by using the hc sample true-random command mentioned in Part 1.

cat 13529-with-filtered-random.sh


hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv

hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv

hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv

hc sample true-random -i mementos -a non-duplicates.tsv -o filtered-random-sample.tsv

A user can also request this combination by executing hc sample filtered-random, which calls a shell script that executes these commands in the background.

Additionally, AlNoamany described four different types of possible stories with web archive collections:

Fixed Page Fixed Time (FPFT) - the sample consists of mementos representing different versions of the same URI-R rendered by different user agents (e.g., mobile vs. desktop)
Fixed Page Sliding Time (FPST) - the sample consists of mementos from the same URI-R over time, similar to the types of visualizations/stories produced by TMVis and What Did It Look Like
Sliding Page Fixed Time (SPFT) - the sample consists of mementos captured during a specific time range within the collection, similar to our SHARI process, which produces an intelligent sample of StoryGraph's output from a specific date
Sliding Page Sliding Time (SPST) - the sample consists of mementos drawn from any time and any URI-R in the collection

So far, we have focused on SPST stories. We can leverage other filter commands to tell other types of stories. In the example below, we use the experimental hc filter include-only near-datetime command to leverage parts of AlNoamany's Algorithm to tell an SPFT story for collection 13529 that only includes mementos with Memento-Datetimes on March 19, 2020.

cat 13529-with-dsa1-and-timefilter.sh


hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv

hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv

hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv

hc filter include-only languages --lang en -i mementos -a non-duplicates.tsv -o english-only.tsv

hc filter include-only near-datetime \
    -i mementos -a english-only.tsv \
    --start-datetime 2020-03-19T00:00:00 \
    --end-datetime 2020-03-19T23:59:00 \
    -o time-filtered.tsv

hc cluster time-slice -i mementos -a time-filtered.tsv -o sliced.tsv

hc cluster dbscan -i mementos -a sliced.tsv -o sliced-and-clustered.tsv

hc score dsa1-scoring -i mementos -a sliced-and-clustered.tsv -o scored.tsv

hc filter include-only highest-score-per-cluster -i mementos -a scored.tsv -o highest-scored.tsv

We can tell an FPST story with the experimental hc filter include-only containing-url-pattern command to tell a story about the Pakistani government's changing response to COVID-19.

cat 13529-with-dsa1-and-urlfilter.sh


hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv

hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv

hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv

hc filter include-only languages --lang en -i mementos -a non-duplicates.tsv -o english-only.tsv

hc filter include-only containing-url-pattern \
    -i mementos -a english-only.tsv \
    --urir-pattern "http://covid.gov.pk/" \
    -o urir-filtered.tsv

hc cluster time-slice -i mementos -a urir-filtered.tsv -o sliced.tsv

hc cluster dbscan -i mementos -a sliced.tsv -o sliced-and-clustered.tsv

hc score dsa1-scoring -i mementos -a sliced-and-clustered.tsv -o scored.tsv

hc filter include-only highest-score-per-cluster -i mementos -a scored.tsv -o highest-scored.tsv

With many of these combinations, other types of stories are possible. For example, an archivist can exclude all pages containing references to Donald Trump by inserting hc filter exclude containing-pattern --pattern "Donald Trump" at the desired point in their summarization process.

Hypercane's Place in Storytelling with Web Archives

The DSA Toolkit automated storytelling process relies upon Hypercane for its input and Raintale for its output, with additional data provided by OTMT, AIU, and MementoEmbed.

As I thought more about how my dissertation research would unfold, it became clear that storytelling with web archives could loosely leverage the Model-view-adapter pattern. Each component communicates with each other through explicit interfaces.

The Model. Consider web archives to be a datastore and the Memento Protocol to be its access interface.
The Adapter. Retrieve data from the datastore and filter it for the view.

Hypercane discovers resources with the Memento Protocol and intelligently samples them for Raintale. It uses AIU to discover input for Archive-It collections and the Off-Topic Memento Toolkit in one of its filters.
MementoEmbed answers Raintale's requests for information about the specific mementos fed to it.

The View. Raintale renders the view: the story summarizing the collection.

The components of the DSA Toolkit, loosely placed into the Model-view-adapter pattern.

This separation of concerns allows each component to focus on one problem area. Hypercane intelligently samples from collections. MementoEmbed extracts information about individual mementos. Raintale renders the information for many mementos. Raintale knows nothing of the Memento Protocol and relies upon the tools in the Adapter to give it the information it needs.

The separation of concerns also allowed me to focus on one problem at a time during my dissertation research. With AIU, I was able to discover The Many Shapes of Archive-It and present those results at iPres 2018. The Off-Topic Memento Toolkit, also presented at iPres 2018, only focused on the off-topic problem. MementoEmbed, and an early version of Raintale, allowed me to determine that Social Cards Probably Provide Better Understanding of Web Archive Collections, as presented at CIKM 2019. Hypercane will play a role in future research.

And this separation of concerns allows us to use each component individually to solve problems. Raintale can render stories based on human selection. We can generate a single card with MementoEmbed for a blog post. We explore collections with Hypercane without having to render them as stories. As I covered in Part 2, Hypercane can synthesize WARCs for Archives Unleashed Toolkit (AUT). Thus we can have Hypercane sample from a large collection and then do more in depth analysis with AUT on that smaller sample.

We extend this idea of separation of concerns to Hypercane itself, by making the sample action, documented in Part 1, a wrapper for the smaller advanced actions documented in this post. Through explicit interfaces and information hiding, all actions can be combined in a new and potent ways.

Summary and Future Work

Hypercane is the latest component of the Dark and Stormy Archives toolkit. It allows us to intelligently sample from web archive collections. Hypercane provides multiple actions, allowing a user to tailor their intelligent sample and synthesize it into other formats. Hypercane's list of actions is inspired by other summarization and information retrieval work. The list is as follows:

sample - runs an existing sampling algorithm against the collection of mementos
report - generates reports about the collection of mementos
synthesize - converts a collection of mementos into formats for other tools
identify - discovers a list of one type of Memento Protocol resources from another
filter - filter mementos that meet a given criteria
cluster - cluster mementos with a given algorithm and features
score - score mementos with the given function
order - sort mementos based on the given criteria

In Part 1 of this series of blog posts, we introduced the sample action and demonstrated how a user can produce a random sample or a sample with AlNoamany's Algorithm. We also showed how the report action can produce a variety of reports on the collection.

In Part 2, we described how the synthesize action can create outputs for Raintale for storytelling. We also generated WARCs for use with Archives Unleashed Toolkit. Finally, we produced a directory of boilerplate-free files for processing with Gensim.

In this blog post, we covered the advanced actions that we use to develop algorithms. With these advanced actions an archivist can customize their intelligent sample outside of the sample command. All of these actions accept lists of URI-Ms so they can feed into one another in different ways. They can even feed back into the commands mentioned in Parts 1 and 2.

Despite a 3-part series of blog posts, Hypercane is quite incomplete. In addition to addressing defects, we need to provide documentation. We also have plans to add a wide range of functionality in the future. By leveraging the ability to synthesize output for different tools, we can evaluate many of our algorithm ideas before they become advanced actions. A new combination of actions forms a new sampling algorithm. Our goal is to make sure that the added functionality is useful to our research and useful to the archivists creating these samples.

New filters are on the horizon, such as filtering based on the output of the values from score. We intend to expand the features available to the k-means and dbscan commands provided by the cluster action. We are currently researching how to score mementos based on their images, sentences, and metadata. And we intend to add new ways to order mementos by features like Carbon Date. Some combination of actions will become the algorithms we name DSA2, DSA3, ... DSAn.

Hypercane is another complementary tool for the web archivist's toolkit. We initially conceived it to summarize web archive collections for storytelling, but along the way realized that we might want to reuse these samples with other tools as well. In what other ways could you use an intelligent sample? What gems will you discover in a web archive collection?

-- Shawn M. Jones

Search This Blog

Web Science and Digital Libraries Research Group