2018-07-02: The Off-Topic Memento Toolkit

Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.

Web archive collections are created with a specific purpose in mind. A curator will supply seeds for the collection and create multiple versions of these seeds in order to study the evolution of a web page over time. This is valuable for following the changes in an organization or the events in a news story. Unfortunately, depending on the curator's intent, sometimes these seeds go off-topic. Because web archive crawling software has no way to know that a page is off-topic, these mementos are added to the collection. Below I list a few examples of off-topic pages within Archive-It collections.

This memento from the Human Rights collection at Archive-It created by the Columbia University Libraries is off-topic. The page ceased to be available at some point and produced this "404 Page Not Found" response with a 200 HTTP status.

This memento from the Egypt Revolution and Politics collection at Archive-It created by the American University in Cairo is off-topic. The web site began having database problems.

It is important to note that the OTMT does not delete potentially off-topic mementos, but rather only flags them for curator review. Detecting such mementos allows us to exclude them from consideration or flag them for deletion by some downstream tool, which is important to our collection summarization and storytelling efforts. The OTMT detects these mementos using a variety of different similarity measures. One could also use the OTMT to detect and study off-topic mementos.

Installing the software

The OTMT requires Python 3.6. Once you have met that requirement, install OTMT by typing:


# pip install otmt

This installs the necessary libraries and provides the system with a new detect-off-topic command.

A simple run

To perform an off-topic run with the software on Archive-It collection 1068, type:


# detect-off-topic -i archiveit=1068 -tm cosine,bytecount -o myoutputfile.json

This will find all URI-Rs (seeds) related to Archive-It collection 1068, download their timemaps (URI-Ts), download the mementos within each timemap, process those mementos via the default similarity measures, and write the results in JSON format out to a file named outputfile.json.

The JSON output looks like the following.

Each URI-T serves as a key containing all URI-Ms within that timemap. In this example the timemap at URI-T http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/ contains several mementos. For brevity, we are only showing results for the memento at http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/.

The key "timemap measures" contains all measures run against the memento. In this case I used the two measures "cosine" and "bytecount". Each measure entry indicates which preprocessing has been performed against that memento (e.g., stemmed, tokenized, and removed boilerplate). Under "comparison score" is that measure's score. Under "topic status" is a verdict on whether or not the memento is on or off-topic. Finally, the "overall topic status" indicates if any of the measures determined that the memento is off-topic.

The OTMT uses an input-measure-output architecture. This way the tool separates the concerns of input, (e.g., how to process Archive-It collection 1068 for mementos), from measure (e.g., how to process these mementos using cosine and byte count similarity measures), and output (e.g., how to produce the output in JSON format and write it to the file outputfile.json). This architecture is extensible, providing interfaces allowing for more input types, measures, and output types to be added in the future.

The -i (for specifying the input) and -o (for specifying the output) options are the only required options. The following sections detail the different command line options available to this tool.

Input and Output

The input type is supplied by the -i option. OTMT currently supports the following input types:

an Archive-It collection ID (keyword: archiveit)
one or more TimeMap URIs (URI-T) (keyword: timemap)
one or more WARCs (keyword: warc)

An output file is supplied by the -o option. Output types are specified by the -ot option. OTMT currently supports the following output types:

JSON as shown above (the default) (keyword: json)
a comma-separated file consisting of the same content found in the JSON file (keyword: csv)

To specify multiple WARCs, list them after the warc option like so:


# detect-off-topic -i warc=mycrawl1.warc.gz,mycrawl2.warc.gz -o myoutputfile.json

Likewise, for multiple TimeMaps, list them with the timemap argument and separate their URI-Ts with commas, like so:


# detect-off-topic -i timemap=https://archive.example.org/urit/http://example.org,https://archive.example.org/urit/http://example2.org -o myoutputfile.json

To use the comma-separated file format instead of json use the -ot option as follows:


# detect-off-topic -i archiveit=3936 -o myoutputfile.csv -ot csv

For better processing, we want to eliminate any interference from HTML and JavaScript associated with archive-specific branding. In the case of TimeMaps and Archive-It collections, raw mementos will be downloaded where available. While any TimeMap may be specified for processing, raw mementos are preferred as they do not contain the additional banner information and other augmentations supplied by many web archives. These augmentations may skew the off-topic results. Currently, only raw mementos from Archive-It are detected and processed. WARC files, of course, are "raw" by their nature, so removing web-archive augmentations like banners is not needed for WARC files.

Measures

OTMT supports the following measures with the -tm (for "timemap measure") option:

Cosine Similarity of document vectors informed by TF-IDF with scikit-learn (default) (keyword: cosine)
Word Count (keyword: wordcount)
Byte Count (keyword: bytecount)
Simhash on the raw memento content with (keyword: raw_simhash)
Simhash on the term frequencies of the raw memento content (keyword: tf_simhash)
Jaccard Distance (keyword: jaccard)
Sørensen-Dice Distance (keyword: sorensen)
Cosine similarity of document vectors informed by Latent Semantic Indexing with Gensim (keyword: gensim_lsi)

Each of these measures considers the first memento in a TimeMap to be on-topic and evaluates all other mementos in that TimeMap against that first memento.

Measures and thresholds can be supplied on the command line, separated by commas. For example, to use Jaccard with a threshold of 0.15, separate the measure name and the threshold value, like so:

# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15

Multiple measures can also be used, separated by commas. For example, to use jaccard and cosine similarity, type the following:


# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15,cosine=0.10

The default thresholds for these measures have been derived from testing using a gold standard dataset of on and off-topic mementos originally generated by AlNoamany. This dataset is now available at: https://github.com/oduwsdl/offtopic-goldstandard-data/. We used this dataset as a standard and selected thresholds that produced the best F₁ score for each measure. I will present the details of how we arrived at these thresholds at iPres 2018. Our study is available as a preprint available on arXiv.

Other options

Optionally, one may also change the working directory (-d) and the logging file (-l). By default, the software uses the directory /tmp/otmt-working for its work and logs to the screen with stdout.

The Future

I am still researching several features that will make it into future releases. I have separated the capabilities into library modules for use with future Python applications, but the code is currently volatile and I expect changes to come in the following months as new features are added and defects are fixed.

The software does not currently offer an algorithm utilizing the Web-based kernel function specified in AlNoamany's paper. This algorithm augments terms from the memento with terms from search engine result pages (SERPs), pioneered by Sahami and Heilman. Due to the sheer number of mementos to be evaluated by the OTMT and Google's policy on blocking requests to its SERPs, I will likely not implement this feature unless it is requested by the community.

I am also interested in the concept of "collection measures". I created the "timemap measures" key in the JSON output to differentiate one set of measure results from another eventual category of collection-wide measures that would test each memento against the topic of an entire collection. Preliminary work using the Jaccard Distance in this area was not fruitful, but I am considering other ideas.

The Off-Topic Memento Toolkit is available at https://github.com/oduwsdl/off-topic-memento-toolkit. Please give it a try and report any issues encountered and features desired. Although developed with an eye toward Archive-It collections, we hope to increase its suitability for all themed collections of archived web pages, such as personal collections created with webrecorder.io.

-- Shawn M. Jones

Search This Blog

Web Science and Digital Libraries Research Group