2018-07-02: The Off-Topic Memento Toolkit
Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.
Web archive collections are created with a specific purpose in mind. A curator will supply seeds for the collection and create multiple versions of these seeds in order to study the evolution of a web page over time. This is valuable for following the changes in an organization or the events in a news story. Unfortunately, depending on the curator's intent, sometimes these seeds go off-topic. Because web archive crawling software has no way to know that a page is off-topic, these mementos are added to the collection. Below I list a few examples of off-topic pages within Archive-It collections.
It is important to note that the OTMT does not delete potentially off-topic mementos, but rather only flags them for curator review. Detecting such mementos allows us to exclude them from consideration or flag them for deletion by some downstream tool, which is important to our collection summarization and storytelling efforts. The OTMT detects these mementos using a variety of different similarity measures. One could also use the OTMT to detect and study off-topic mementos.
The OTMT requires Python 3.6. Once you have met that requirement, install OTMT by typing:
This installs the necessary libraries and provides the system with a new
To perform an off-topic run with the software on Archive-It collection 1068, type:
This will find all URI-Rs (seeds) related to Archive-It collection 1068, download their timemaps (URI-Ts), download the mementos within each timemap, process those mementos via the default similarity measures, and write the results in JSON format out to a file named
The JSON output looks like the following.
Each URI-T serves as a key containing all URI-Ms within that timemap. In this example the timemap at URI-T http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/ contains several mementos. For brevity, we are only showing results for the memento at http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/.
The key "timemap measures" contains all measures run against the memento. In this case I used the two measures "cosine" and "bytecount". Each measure entry indicates which preprocessing has been performed against that memento (e.g., stemmed, tokenized, and removed boilerplate). Under "comparison score" is that measure's score. Under "topic status" is a verdict on whether or not the memento is on or off-topic. Finally, the "overall topic status" indicates if any of the measures determined that the memento is off-topic.
The OTMT uses an input-measure-output architecture. This way the tool separates the concerns of input, (e.g., how to process Archive-It collection 1068 for mementos), from measure (e.g., how to process these mementos using cosine and byte count similarity measures), and output (e.g., how to produce the output in JSON format and write it to the file outputfile.json). This architecture is extensible, providing interfaces allowing for more input types, measures, and output types to be added in the future.
The
The input type is supplied by the
Likewise, for multiple TimeMaps, list them with the
To use the comma-separated file format instead of json use the
For better processing, we want to eliminate any interference from HTML and JavaScript associated with archive-specific branding. In the case of TimeMaps and Archive-It collections, raw mementos will be downloaded where available. While any TimeMap may be specified for processing, raw mementos are preferred as they do not contain the additional banner information and other augmentations supplied by many web archives. These augmentations may skew the off-topic results. Currently, only raw mementos from Archive-It are detected and processed. WARC files, of course, are "raw" by their nature, so removing web-archive augmentations like banners is not needed for WARC files.
OTMT supports the following measures with the
Each of these measures considers the first memento in a TimeMap to be on-topic and evaluates all other mementos in that TimeMap against that first memento.
Measures and thresholds can be supplied on the command line, separated by commas. For example, to use Jaccard with a threshold of 0.15, separate the measure name and the threshold value, like so:
Multiple measures can also be used, separated by commas. For example, to use jaccard and cosine similarity, type the following:
The default thresholds for these measures have been derived from testing using a gold standard dataset of on and off-topic mementos originally generated by AlNoamany. This dataset is now available at: https://github.com/oduwsdl/offtopic-goldstandard-data/. We used this dataset as a standard and selected thresholds that produced the best F1 score for each measure. I will present the details of how we arrived at these thresholds at iPres 2018. Our study is available as a preprint available on arXiv.
Optionally, one may also change the working directory (
I am still researching several features that will make it into future releases. I have separated the capabilities into library modules for use with future Python applications, but the code is currently volatile and I expect changes to come in the following months as new features are added and defects are fixed.
The software does not currently offer an algorithm utilizing the Web-based kernel function specified in AlNoamany's paper. This algorithm augments terms from the memento with terms from search engine result pages (SERPs), pioneered by Sahami and Heilman. Due to the sheer number of mementos to be evaluated by the OTMT and Google's policy on blocking requests to its SERPs, I will likely not implement this feature unless it is requested by the community.
I am also interested in the concept of "collection measures". I created the "timemap measures" key in the JSON output to differentiate one set of measure results from another eventual category of collection-wide measures that would test each memento against the topic of an entire collection. Preliminary work using the Jaccard Distance in this area was not fruitful, but I am considering other ideas.
The Off-Topic Memento Toolkit is available at https://github.com/oduwsdl/off-topic-memento-toolkit. Please give it a try and report any issues encountered and features desired. Although developed with an eye toward Archive-It collections, we hope to increase its suitability for all themed collections of archived web pages, such as personal collections created with webrecorder.io.
-- Shawn M. Jones
Web archive collections are created with a specific purpose in mind. A curator will supply seeds for the collection and create multiple versions of these seeds in order to study the evolution of a web page over time. This is valuable for following the changes in an organization or the events in a news story. Unfortunately, depending on the curator's intent, sometimes these seeds go off-topic. Because web archive crawling software has no way to know that a page is off-topic, these mementos are added to the collection. Below I list a few examples of off-topic pages within Archive-It collections.
This memento from the Human Rights collection at Archive-It created by the Columbia University Libraries is off-topic. The page ceased to be available at some point and produced this "404 Page Not Found" response with a 200 HTTP status. |
This memento from the Egypt Revolution and Politics collection at Archive-It created by the American University in Cairo is off-topic. The web site began having database problems. |
It is important to note that the OTMT does not delete potentially off-topic mementos, but rather only flags them for curator review. Detecting such mementos allows us to exclude them from consideration or flag them for deletion by some downstream tool, which is important to our collection summarization and storytelling efforts. The OTMT detects these mementos using a variety of different similarity measures. One could also use the OTMT to detect and study off-topic mementos.
Installing the software
The OTMT requires Python 3.6. Once you have met that requirement, install OTMT by typing:
# pip install otmt
This installs the necessary libraries and provides the system with a new
detect-off-topic
command.
A simple run
To perform an off-topic run with the software on Archive-It collection 1068, type:
# detect-off-topic -i archiveit=1068 -tm cosine,bytecount -o myoutputfile.json
This will find all URI-Rs (seeds) related to Archive-It collection 1068, download their timemaps (URI-Ts), download the mementos within each timemap, process those mementos via the default similarity measures, and write the results in JSON format out to a file named
outputfile.json
.
The JSON output looks like the following.
Each URI-T serves as a key containing all URI-Ms within that timemap. In this example the timemap at URI-T http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/ contains several mementos. For brevity, we are only showing results for the memento at http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/.
The key "timemap measures" contains all measures run against the memento. In this case I used the two measures "cosine" and "bytecount". Each measure entry indicates which preprocessing has been performed against that memento (e.g., stemmed, tokenized, and removed boilerplate). Under "comparison score" is that measure's score. Under "topic status" is a verdict on whether or not the memento is on or off-topic. Finally, the "overall topic status" indicates if any of the measures determined that the memento is off-topic.
The OTMT uses an input-measure-output architecture. This way the tool separates the concerns of input, (e.g., how to process Archive-It collection 1068 for mementos), from measure (e.g., how to process these mementos using cosine and byte count similarity measures), and output (e.g., how to produce the output in JSON format and write it to the file outputfile.json). This architecture is extensible, providing interfaces allowing for more input types, measures, and output types to be added in the future.
The
-i
(for specifying the input) and -o
(for specifying the output) options are the only required options. The following sections detail the different command line options available to this tool.
Input and Output
The input type is supplied by the
-i
option. OTMT currently supports the following input types:
- an Archive-It collection ID (keyword:
archiveit
) - one or more TimeMap URIs (URI-T) (keyword:
timemap
) - one or more WARCs (keyword:
warc
)
-o
option. Output types are specified by the -ot
option. OTMT currently supports the following output types:
- JSON as shown above (the default) (keyword:
json
) - a comma-separated file consisting of the same content found in the JSON file (keyword:
csv
)
warc
option like so:
# detect-off-topic -i warc=mycrawl1.warc.gz,mycrawl2.warc.gz -o myoutputfile.json
Likewise, for multiple TimeMaps, list them with the
timemap
argument and separate their URI-Ts with commas, like so:
# detect-off-topic -i timemap=https://archive.example.org/urit/http://example.org,https://archive.example.org/urit/http://example2.org -o myoutputfile.json
To use the comma-separated file format instead of json use the
-ot
option as follows:
# detect-off-topic -i archiveit=3936 -o myoutputfile.csv -ot csv
For better processing, we want to eliminate any interference from HTML and JavaScript associated with archive-specific branding. In the case of TimeMaps and Archive-It collections, raw mementos will be downloaded where available. While any TimeMap may be specified for processing, raw mementos are preferred as they do not contain the additional banner information and other augmentations supplied by many web archives. These augmentations may skew the off-topic results. Currently, only raw mementos from Archive-It are detected and processed. WARC files, of course, are "raw" by their nature, so removing web-archive augmentations like banners is not needed for WARC files.
Measures
OTMT supports the following measures with the
-tm
(for "timemap measure") option:
- Cosine Similarity of document vectors informed by TF-IDF with scikit-learn (default) (keyword:
cosine
) - Word Count (keyword:
wordcount
) - Byte Count (keyword:
bytecount
) - Simhash on the raw memento content with (keyword:
raw_simhash
) - Simhash on the term frequencies of the raw memento content (keyword:
tf_simhash
) - Jaccard Distance (keyword:
jaccard
) - Sørensen-Dice Distance (keyword:
sorensen
) - Cosine similarity of document vectors informed by Latent Semantic Indexing with Gensim (keyword:
gensim_lsi
)
Each of these measures considers the first memento in a TimeMap to be on-topic and evaluates all other mementos in that TimeMap against that first memento.
Measures and thresholds can be supplied on the command line, separated by commas. For example, to use Jaccard with a threshold of 0.15, separate the measure name and the threshold value, like so:
# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15
Multiple measures can also be used, separated by commas. For example, to use jaccard and cosine similarity, type the following:
# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15,cosine=0.10
The default thresholds for these measures have been derived from testing using a gold standard dataset of on and off-topic mementos originally generated by AlNoamany. This dataset is now available at: https://github.com/oduwsdl/offtopic-goldstandard-data/. We used this dataset as a standard and selected thresholds that produced the best F1 score for each measure. I will present the details of how we arrived at these thresholds at iPres 2018. Our study is available as a preprint available on arXiv.
Other options
Optionally, one may also change the working directory (
-d
) and the logging file (-l
). By default, the software uses the directory /tmp/otmt-working
for its work and logs to the screen with stdout.The Future
I am still researching several features that will make it into future releases. I have separated the capabilities into library modules for use with future Python applications, but the code is currently volatile and I expect changes to come in the following months as new features are added and defects are fixed.
The software does not currently offer an algorithm utilizing the Web-based kernel function specified in AlNoamany's paper. This algorithm augments terms from the memento with terms from search engine result pages (SERPs), pioneered by Sahami and Heilman. Due to the sheer number of mementos to be evaluated by the OTMT and Google's policy on blocking requests to its SERPs, I will likely not implement this feature unless it is requested by the community.
I am also interested in the concept of "collection measures". I created the "timemap measures" key in the JSON output to differentiate one set of measure results from another eventual category of collection-wide measures that would test each memento against the topic of an entire collection. Preliminary work using the Jaccard Distance in this area was not fruitful, but I am considering other ideas.
The Off-Topic Memento Toolkit is available at https://github.com/oduwsdl/off-topic-memento-toolkit. Please give it a try and report any issues encountered and features desired. Although developed with an eye toward Archive-It collections, we hope to increase its suitability for all themed collections of archived web pages, such as personal collections created with webrecorder.io.
-- Shawn M. Jones
Comments
Post a Comment