2020-06-17: Hypercane Part 3: Building Your Own Algorithms
This image by NASA is licensed under NASA's Media Usage Guidelines |
In Part 1, we introduced Hypercane, a tool for automatically sampling mementos from web archive collections. Web archive collections consist of thousands of documents, and humans need tools to intelligently select mementos for a given purpose. Hypercane's goal is to supply us with a list of memento URI-Ms derived from the input we provide. In Part 2, I highlighted how Hypercane's
synthesize
action converts its input into other formats like JSON for Raintale stories, WARCs for Archives Unleashed Toolkit, or boilerplate-free files for Gensim. This post focuses on the primitive advanced actions that make up Hypercane's sampling algorithms. We can mix and match different primitives to arrive at the sample that best meets our needs.
The DSA project's goal is to summarize a web archive collection by selecting a small number of exemplars and then visualize them with social media storytelling techniques. Hypercane performs this sampling, Raintale renders the visualization of the summary. |
Our roadmap of Hypercane posts is as follows:
- Hypercane Part 1: Intelligent Sampling of Web Archive Collections - an introduction to Hypercane
- Hypercane Part 2: Synthesizing Output For Other Tools - how Hypercane can generate output for Archives Unleashed Toolkit, Raintale, Gensim, and other tools
- Hypercane Part 3: Building Your Own Algorithms - this post
identify
- for producing one type of Memento Protocol object (e.g., URI-R) from another (e.g., URI-M)filter
- filters the documents from the input based on some criteriacluster
- clusters the documents from the input based on a given algorithm and features and returns a file containing URI-Ms and their cluster assignmentsscore
- computes scores for the documents based on some scoring algorithm and returns a file containing their URI-Ms and scoresorder
- sorts the documents based on some criteria, such as publication date or the scores supplied byscore
Hypercane's Advanced Actions
Wheresynthesize
(Part 2) converts Hypercane outputs into other formats, Hypercane's advanced actions all output a tab-separated list of URLs. Except for identify
, all advanced actions output a tab-separated file containing a list of URI-Ms. This tab-separated file contains multiple fields. As shown below, the first field indicates the type of Memento object at the given URL: URI-M, URI-T, or URI-R. The subsequent fields indicate which clusters and scores a prior run of Hypercane had assigned to these documents.
cat 13529-ranked.tsv
URI-M Cluster Rank---DSA1-Score
http://wayback.archive-it.org/13529/20200407171321/http://52325544.nhd.weebly.com/ 30~~~0 1.135
http://wayback.archive-it.org/13529/20200221185335/http://brazil.shafaqna.com/EN/AL/6353898 0~~~0 1.2437288647413784
http://wayback.archive-it.org/13529/20200221195316/http://caglecartoons.com/topic/coronavirus 2~~~0 1.1840936088558152
http://wayback.archive-it.org/13529/20200319104520/http://chuangcn.org/2020/02/social-contagion/ 16~~~0 1.2866246806893566
http://wayback.archive-it.org/13529/20200329220102/http://cityofwesthaven.com/332/Coronavirus-Disease-2019-COVID-19/ 26~~~0 1.23802216122549
http://wayback.archive-it.org/13529/20200314211444/http://cmajnews.com/2020/01/30/coronavirus-1095846/ 11~~~0 1.435299877600979
http://wayback.archive-it.org/13529/20200315030839/http://cmajnews.com/2020/02/20/covid19-testing-1095852/ 14~~~0 1.4351469251741895
http://wayback.archive-it.org/13529/20200221192410/http://coronavirus.fr/ 1~~~0 1.135
... the other 1911 lines omitted for brevity ...
cluster
action followed by a score
action and a filter
action, then the filtered list of URI-Ms will still retain their cluster assignments and scores for future use.
identify
Theidentify
action produces one type of Memento Protocol object from another. With identify
, a user can submit a list of URI-Ts and get the full list of URI-Ms or URI-Rs discovered in those TimeMaps. They can also submit a list of URI-Ms and produce the corresponding URI-Ts or URI-Rs. Hypercane discovers this information via the Memento Protocol.A simplified view of the processes used by the identify action to discover one Memento resource type from another. |
To generate a list of URI-Ms from a file
timemap-list.tsv
containing a list of TimeMaps, a user types the following:
# hc identify mementos -i timemaps -a timemap-list.tsv -o memento-list.tsv
Likewise, for the opposite (URI-Ts from mementos):
# hc identify timemaps -i mementos -a memento-list.tsv -o timemap-list.tsv
For a list of original resources corresponding to a list of URI-Ms:
# hc identify original-resources -i mementos -a memento-list.tsv -o original-resource-list.tsv
As a convenience, Hypercane also accepts an Archive-It collection identifier as input. Hypercane calls the AIU library to scrape Archive-It collection pages, discover the collection's seeds, and construct their URI-Ts. Once URI-Ts have been discovered, Hypercane then converts them to the appropriate Memento protocol object. To generate a list of all seed mementos from Archive-It collection 13529:
# hc identify mementos -i archiveit -a 13529 -o seed-mementos.tsv
Sometimes seed mementos are not enough and we want to discover the deep mementos within a collection. The
identify
action accepts an optional (and experimental) --crawl-depth
parameter with a number specifying the depth to crawl. If this is specified, then Hypercane will invoke Scrapy to crawl the input to the given depth and then employ the Memento Protocol to discover the desired output. I thank Mohamed Aturban for his experience and insight into this process. I also adopted ideas from the focused crawls run by Klein et al. Each diagram below contains the four input types at the top.
Most Hypercane commands have a preferred Memento Protocol object. For example, filtering for off-topic mementos requires TimeMaps, whereas scoring requires mementos. For flexibility and extensibility, we do not want to prescribe different default inputs for different commands. Thus, the functionality of
identify
is built into all other Hypercane commands, allowing a user with a list of one type to still perform the desired action (e.g., submitting URI-Rs to score mementos). There are situations when identify
must be separately executed. For example, the SHARI process requires an explicit call to identify
to produce URI-Ms from URI-Rs because StoryGraph Toolkit provides URI-Rs as output. It is also conceivable that a user might want to generate a list of Memento Protocol objects for a third-party tool.
filter
Filtering is key to building our sample. Withfilter
an archivist can include or exclude mementos based on some criteria. The filter commands do not remove mementos from the collection. Each command outputs a list of URI-Ms identifying the mementos that meet that criteria.Examples of some of the many different scenarios where a page can go off topic. This Figure is borrowed from Jones et al.'s The Off-Topic Memento Toolkit. |
For example, as shown above, sometimes web resources deviate from the topic of the original collection and crawlers continue to capture these off-topic pages. To identify these off-topic pages, Hypercane leverages the Off-Topic Memento Toolkit (OTMT) as a library. By executing the command below we produce a file named
ontopic.tsv
that contains only mementos that are on-topic for a collection.
# hc filter include-only on-topic -i timemaps -a timemap-file.tsv -o ontopic.tsv
Thumbnails of duplicate mementos, grouped by color. Mementos outlined in red are the same, green are the same, etc. This figure is borrowed from AlNoamany et al.'s Generating Stories From Archived Collections. |
Web archive collections are different from other types of document collections because they contain many different versions of the same web page. These differences provide insight into how a given source changes over time. We only want to keep the changed content. As shown in the Figure above, a crawler will continue to repeatedly capture the same content and we must filter out these duplicates to create an intelligent sample for storytelling. We can do so with the following Hypercane command:
# hc filter exclude near-duplicates -i mementos -a memento-file.tsv -o ontopic.tsv
Hypercane's filter action accepts either
include-only
and exclude
as subactions. Many of the include-only
commands have exclude
counterparts. For example, hc filter include-only on-topic
has a counterpart command of hc filter exclude off-topic
.
Hypercane also supports filtering by language and the highest scoring memento in each cluster. We recently added experimental filters to discover mementos whose content contains a given pattern, mementos whose URI-R contains a given pattern, and mementos whose memento-datetime fits within a given range.
cluster
Sipos, Li, Zhang, and AlNoamany employed clustering as part of their summarization algorithms. Clustering is also how StoryGraph discovers its stories and how our SHARI process finds the biggest story of the day.Hypercane supports AlNoamany's time slice clustering with the following command:
# hc cluster time-slice -i mementos -a mementos.tsv -o sliced-mementos.tsv
Hypercane supports AlNoamany's Simhash-based clustering with this command:
# hc cluster dbscan -i mementos -a mementos.tsv -o clustered.tsv --feature tf-simhash
Hypercane provides experimental commands for clustering a collection based on topic modeling with Latent Dirichlet Allocation (LDA). It supports commands for clustering a collection by DBSCAN and K-Means with memento-datetime as a feature.
score
Scoring is also an essential staple of summarization. Scoring by algorithms such as PageRank and BM25 help us with search results. Sentence scoring is a common component of summarization. Scoring is used by Mihalcea's TextRank, Dolan's news summarization work, LexRank, and Luhn's automatic "literature abstracts." AlNoamany's Algorithm also incorporates a scoring function for mementos.Currently, Hypercane supports AlNoamany's scoring function as:
# hc score dsa1-score -i mementos -a memento-list.tsv -o scored-mementos.tsv
AlNoamany's scoring function is
where Dm is the Memento Damage score for the memento, Ml is the path depth of the URI-R, and Mc is a score applied based on the type of web page from Padia's 2012 work — e.g., news sources score 0.7 while social media scores 0.5. The wd, wl, and wc terms are weights. The list of web sites fitting into Padia's categories has changed since 2012. Hypercane currently scores social media domains based on a list from Adobe, image sharing sites based on a list from Wikipedia, video sharing sites based on another list from Wikipedia, and news websites based on lists from Pew Research and W3Newspapers. We intend to make these lists configurable in the future.
Hypercane currently has an experimental BM25 implementation. We are researching additional scoring functions for different types of stories.
order
Sorting content is critical for conveying meaning. For storytelling, we often want articles to flow in chronological order. Hypercane leverages the newspaper3k library to discover the publication date of a given memento. If newspaper3k cannot find a publication date, Hypercane falls back to the memento-datetime of the resource. To order the mementos this way, we execute the following command.# hc order pubdate-else-memento-datetime -i mementos -a memento-list.tsv -o ordered-mementos.tsv
Other types of stories may require different ordering. We also support ordering URI-Ms by memento-datetime. If the output of this command is fed into the
synthesize
command detailed in Part 2, then hc synthesize
preserves this order.
Constructing Algorithms with Hypercane's Advanced Actions
Archive-It Collection 13529: Novel Coronavirus (COVID-19) as visualized on the DSA Puddles website as a subset of 34 exemplar mementos selected by the DSA toolkit's implementation of AlNoamany's Algorithm. |
AlNoamany's Algorithm provides the first path toward producing an intelligent sample from a web archive collection. We visualized this intelligent sample of 34 exemplars chosen from more than 23,000 mementos as the story above. The goal is that a user can get the gist of a collection from this visualization.
AlNoamany's Algorithm for storytelling. Hypercane implements the steps from "exclude off-topic pages" to "order pages by time" as a series of advanced actions. Raintale implements the "visualize" step. |
Above is AlNoamany's diagram of her algorithm for storytelling. Below we show the series of Hypercane commands that implement each part of this algorithm. The output of each command feeds into the next. For example, the
english-only.tsv
file generated by hc filter include-only languages
serves as input to hc cluster time-slice
.
cat 13529-with-dsa1.sh
hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv
hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv
hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv
hc filter include-only languages --lang en -i mementos -a non-duplicates.tsv -o english-only.tsv
hc cluster time-slice -i mementos -a english-only.tsv -o sliced.tsv
hc cluster dbscan -i mementos -a sliced.tsv -o sliced-and-clustered.tsv
hc score dsa1-scoring -i mementos -a sliced-and-clustered.tsv -o scored.tsv
hc filter include-only highest-score-per-cluster -i mementos -a scored.tsv -o highest-scored.tsv
To further reuse our existing work, when a user executes the
hc sample dsa1
command documented in Part 1, they are asking Hypercane to execute this series of advanced actions as a shell script.
The goal of Hypercane was not merely to implement AlNoamany's Algorithm; but also to give archivists and us a tool to customize summarizations. An archivist could easily only include Japanese documents, or skip the
hc cluster dbscan
step altogether. AlNoamany's Algorithm gives us a guide of what is initially possible in this problem space, but it is by no means the only sampling solution.
We implemented a filtered random algorithm that borrows parts of AlNoamany's Algorithm and then randomly chooses from the result by using the
hc sample true-random
command mentioned in Part 1.
cat 13529-with-filtered-random.sh
hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv
hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv
hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv
hc sample true-random -i mementos -a non-duplicates.tsv -o filtered-random-sample.tsv
hc sample filtered-random
, which calls a shell script that executes these commands in the background.
Additionally, AlNoamany described four different types of possible stories with web archive collections:
- Fixed Page Fixed Time (FPFT) - the sample consists of mementos representing different versions of the same URI-R rendered by different user agents (e.g., mobile vs. desktop)
- Fixed Page Sliding Time (FPST) - the sample consists of mementos from the same URI-R over time, similar to the types of visualizations/stories produced by TMVis and What Did It Look Like
- Sliding Page Fixed Time (SPFT) - the sample consists of mementos captured during a specific time range within the collection, similar to our SHARI process, which produces an intelligent sample of StoryGraph's output from a specific date
- Sliding Page Sliding Time (SPST) - the sample consists of mementos drawn from any time and any URI-R in the collection
hc filter include-only near-datetime
command to leverage parts of AlNoamany's Algorithm to tell an SPFT story for collection 13529 that only includes mementos with Memento-Datetimes on March 19, 2020.
cat 13529-with-dsa1-and-timefilter.sh
hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv
hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv
hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv
hc filter include-only languages --lang en -i mementos -a non-duplicates.tsv -o english-only.tsv
hc filter include-only near-datetime \
-i mementos -a english-only.tsv \
--start-datetime 2020-03-19T00:00:00 \
--end-datetime 2020-03-19T23:59:00 \
-o time-filtered.tsv
hc cluster time-slice -i mementos -a time-filtered.tsv -o sliced.tsv
hc cluster dbscan -i mementos -a sliced.tsv -o sliced-and-clustered.tsv
hc score dsa1-scoring -i mementos -a sliced-and-clustered.tsv -o scored.tsv
hc filter include-only highest-score-per-cluster -i mementos -a scored.tsv -o highest-scored.tsv
We can tell an FPST story with the experimental
hc filter include-only containing-url-pattern
command to tell a story about the Pakistani government's changing response to COVID-19.
cat 13529-with-dsa1-and-urlfilter.sh
hc identify timemaps -i archiveit -a 13529 -o timemaps.tsv
hc filter include-only on-topic -i timemaps -a timemaps.tsv -o ontopic.tsv
hc filter exclude near-duplicates -i mementos -a ontopic.tsv -o non-duplicates.tsv
hc filter include-only languages --lang en -i mementos -a non-duplicates.tsv -o english-only.tsv
hc filter include-only containing-url-pattern \
-i mementos -a english-only.tsv \
--urir-pattern "http://covid.gov.pk/" \
-o urir-filtered.tsv
hc cluster time-slice -i mementos -a urir-filtered.tsv -o sliced.tsv
hc cluster dbscan -i mementos -a sliced.tsv -o sliced-and-clustered.tsv
hc score dsa1-scoring -i mementos -a sliced-and-clustered.tsv -o scored.tsv
hc filter include-only highest-score-per-cluster -i mementos -a scored.tsv -o highest-scored.tsv
hc filter exclude containing-pattern --pattern "Donald Trump"
at the desired point in their summarization process.
Hypercane's Place in Storytelling with Web Archives
The DSA Toolkit automated storytelling process relies upon Hypercane for its input and Raintale for its output, with additional data provided by OTMT, AIU, and MementoEmbed. |
As I thought more about how my dissertation research would unfold, it became clear that storytelling with web archives could loosely leverage the Model-view-adapter pattern. Each component communicates with each other through explicit interfaces.
- The Model. Consider web archives to be a datastore and the Memento Protocol to be its access interface.
- The Adapter. Retrieve data from the datastore and filter it for the view.
- Hypercane discovers resources with the Memento Protocol and intelligently samples them for Raintale. It uses AIU to discover input for Archive-It collections and the Off-Topic Memento Toolkit in one of its filters.
- MementoEmbed answers Raintale's requests for information about the specific mementos fed to it.
- The View. Raintale renders the view: the story summarizing the collection.
The components of the DSA Toolkit, loosely placed into the Model-view-adapter pattern. |
This separation of concerns allows each component to focus on one problem area. Hypercane intelligently samples from collections. MementoEmbed extracts information about individual mementos. Raintale renders the information for many mementos. Raintale knows nothing of the Memento Protocol and relies upon the tools in the Adapter to give it the information it needs.
The separation of concerns also allowed me to focus on one problem at a time during my dissertation research. With AIU, I was able to discover The Many Shapes of Archive-It and present those results at iPres 2018. The Off-Topic Memento Toolkit, also presented at iPres 2018, only focused on the off-topic problem. MementoEmbed, and an early version of Raintale, allowed me to determine that Social Cards Probably Provide Better Understanding of Web Archive Collections, as presented at CIKM 2019. Hypercane will play a role in future research.
And this separation of concerns allows us to use each component individually to solve problems. Raintale can render stories based on human selection. We can generate a single card with MementoEmbed for a blog post. We explore collections with Hypercane without having to render them as stories. As I covered in Part 2, Hypercane can synthesize WARCs for Archives Unleashed Toolkit (AUT). Thus we can have Hypercane sample from a large collection and then do more in depth analysis with AUT on that smaller sample.
We extend this idea of separation of concerns to Hypercane itself, by making the
sample
action, documented in Part 1, a wrapper for the smaller advanced actions documented in this post. Through explicit interfaces and information hiding, all actions can be combined in a new and potent ways.
Summary and Future Work
Hypercane is the latest component of the Dark and Stormy Archives toolkit. It allows us to intelligently sample from web archive collections. Hypercane provides multiple actions, allowing a user to tailor their intelligent sample and synthesize it into other formats. Hypercane's list of actions is inspired by other summarization and information retrieval work. The list is as follows:sample
- runs an existing sampling algorithm against the collection of mementosreport
- generates reports about the collection of mementossynthesize
- converts a collection of mementos into formats for other toolsidentify
- discovers a list of one type of Memento Protocol resources from anotherfilter
- filter mementos that meet a given criteriacluster
- cluster mementos with a given algorithm and featuresscore
- score mementos with the given functionorder
- sort mementos based on the given criteria
sample
action and demonstrated how a user can produce a random sample or a sample with AlNoamany's Algorithm. We also showed how the report
action can produce a variety of reports on the collection.
In Part 2, we described how the
synthesize
action can create outputs for Raintale for storytelling. We also generated WARCs for use with Archives Unleashed Toolkit. Finally, we produced a directory of boilerplate-free files for processing with Gensim.
In this blog post, we covered the advanced actions that we use to develop algorithms. With these advanced actions an archivist can customize their intelligent sample outside of the
sample
command. All of these actions accept lists of URI-Ms so they can feed into one another in different ways. They can even feed back into the commands mentioned in Parts 1 and 2.
Despite a 3-part series of blog posts, Hypercane is quite incomplete. In addition to addressing defects, we need to provide documentation. We also have plans to add a wide range of functionality in the future. By leveraging the ability to
synthesize
output for different tools, we can evaluate many of our algorithm ideas before they become advanced actions. A new combination of actions forms a new sampling algorithm. Our goal is to make sure that the added functionality is useful to our research and useful to the archivists creating these samples.New filters are on the horizon, such as filtering based on the output of the values from
score
. We intend to expand the features available to the k-means
and dbscan
commands provided by the cluster
action. We are currently researching how to score
mementos based on their images, sentences, and metadata. And we intend to add new ways to order
mementos by features like Carbon Date. Some combination of actions will become the algorithms we name DSA2, DSA3, ... DSAn.
Hypercane is another complementary tool for the web archivist's toolkit. We initially conceived it to summarize web archive collections for storytelling, but along the way realized that we might want to reuse these samples with other tools as well. In what other ways could you use an intelligent sample? What gems will you discover in a web archive collection?
-- Shawn M. Jones
Comments
Post a Comment