2018-07-03: Extracting Metadata from Archive-It Collections with Archive-It Utilities

At iPres 2018, I will be presenting "The Many Shapes of Archive-It", a paper that focuses on some structural features inherent in Archive-It collections. The paper is now available as a preprint on arXiv.

As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU). This library is currently in alpha status, but is already being used with upcoming projects.

The metadata available from an Archive-It collection

Archive-It curators can use the predefined metadata fields of Dublin core. They can also supply their own custom metadata fields.

An screenshot of Archive-It collection 4515 with metadata annotated.

Above is Archive-It collection 4515, named 2013 BART Strike and collected by the San Francisco Public Library. This collection's curators generated quite a bit of metadata. In this screenshot, we can see the following metadata fields for the collection:

Subject
Creator
Publisher
Source
Format
Rights
Language
Collector

In addition to collection-wide metadata, we see that the first seed has the following metadata applied:

Creator
Publisher
Language
Format
Date

For research purposes, there is quite a lot of data here to be analyzed, especially when comparing collections as we did in "The Many Shapes of Archive-It". I discovered that most collections used the controlled vocabulary from Dublin Core, shown as blue in the bar chart below, more often than freeform vocabulary, shown in green.

Distribution of the top 20 collection-wide metadata fields in public Archive-It collections.

Each collection can have one or more topics. As shown in the screenshot below, the curator can choose from the controlled vocabulary offered by the collection topics field. They can also add their own freeform topics in the subject field. The public-facing interface combines entries from both of these input fields into the public-facing subject field.

Metadata can be added by curators by using the metadata page of one of their Archive-It collections.

The bar chart below shows the distribution of the top 20 topics in public Archive-It collections. I discovered that most curators apply the controlled vocabulary topics to their collections.

Distribution of the top 20 collection-wide subjects (also called topics) of public Archive-It Collections.

This creates a confusing nomenclature. When viewing an Archive-It collection from the outside, everything is displayed as part of the subject field. Because of this, the rest of this post, and Archive-It Utilities, uses the subject field to refer to these topics.

As work for "The Many Shapes of Archive-It" progressed, we focused more on collecting seed lists and then mementos for further analysis. We tried to predict the topics using machine learning, but were unsuccessful and chose a different path for predicting the semantic categories of a collection. Most of the metadata gathered did not make it into the study's results, but will be used in future work. I have included these results here to show the kinds of questions one can answer with Archive-It Utilities.

Installation

Archive-It Utilities requires Python 3.6. Once that requirement has been met, you can install it using:

pip install aiu

It provides several experimental executables. We will only cover fetch_ait_metadata in this post.

Running `fetch_ait_metadata`

The fetch_ait_metadata command produces a JSON file containing all of the information available about a public Archive-It collection.

To run it on collection 4515 and store the results in file output.json, type the following command:

fetch_ait_metadata -c 4515 -o output.json

The -c option allows one to specify an Archive-It collection and the -o option allows one to indicate where to store the JSON output.

The JSON output looks like the following, truncated for brevity:

{
    "id": "4515",
    "exists": true,
    "metadata_timestamp": "2018-07-01 16:49:39",
    "name": "2013 BART Strike",
    "uri": "https://archive-it.org/collections/4515",
    "collected_by": "San Francisco Public Library",
    "collected_by_uri": "https://archive-it.org/organizations/160",
    "description": "News articles and documents issued by BART regarding the strike of its workers that began on July 1, 2013",
    "subject": [
        "Government",
        "Government - Cities",
        "Government - Counties",
        "2013 BART strike"
    ],
    "archived_since": "Apr, 2014",
    "private": false,
    "optional": {
        "creator": [
            "Bay Area Rapid Transit"
        ],
        "publisher": [
            "Bay Area Rapid Transit"
        ],
        "source": [
            "http://www.bart.gov"
        ],
        "format": [
            "Web pages and PDFs"
        ],
        "rights": [
            "Public Domain"
        ],
        "language": [
            "English"
        ],
        "collector": [
            "San Francisco Public Library"
        ]
    },
    "seed_metadata": {
        "seeds": {
            "http://www.bart.gov/news/articles/2013/news20130617": {
                "collection_web_pages": [
                    {
                        "title": "Sign up for BART labor strike alerts",
                        "creator": [
                            "San Francisco Bay Area Rapid Transit District"
                        ],
                        "publisher": [
                            "San Francisco Bay Area Rapid Transit District"
                        ],
                        "language": [
                            "English"
                        ],
                        "format": [
                            "Web page"
                        ],
                        "date": [
                            "August 14, 2013"
                        ]
                    }
                ],
                "seed_report": {
                    "group": "",
                    "status": "True",
                    "frequency": "NONE",
                    "type": "normal",
                    "access": "True"
                }
            },
...

From this JSON we can see the name of the collection, which organization created it from the collected_by field, the subjects the curator applied to the collection as a list in the subject field, and when the collection was created in the archived_since field.

Within the optional dictionary field, we see values for freeform metadata added by the curator. In this case we have creator, publisher, source, format, rights, language, and collector.

Also included is the "seed metadata" section containing a list of seeds both scraped from the HTML of the Archive-It collection's web pages and also gathered from the CSV report provided for each Archive-It collection. Above I've listed the seed http://www.bart.gov/news/articles/2013/news20130617 to demonstrate the type of metadata that can be gathered. As noted in "The Many Shapes of Archive-It", seed metadata is optional, but in this example the curator added a title, creator, publisher, language, format, and date to this seed.

Using Archive-It Utilities In Python Code

This information can also be acquired programmatically using the ArchiveItCollection object. The script below demonstrates how one can acquire and the collection name, collecting organization, and the list of seed URIs for Archive-It collection ID 4515.

which produces the following output, truncated for brevity:

The following methods of the ArchiveItCollection class are useful for analyzing the metadata of a collection:

get_collection_name - returns the name of the collection
get_collection_uri - returns the URI of the collection
get_collectedby - returns the name of the collecting organization
get_collectedby_uri - returns the URI of the collecting organization
get_description - returns the content of the "description" field
get_subject - returns a Python list containing the subjects applied to the collection
get_archived_since - returns the content of the "archived since" field
is_private - returns True if the collection is not public, False otherwise
does_exist - not all collection identifiers are valid, this method returns True if the collection identifier actually represents a real collection, False otherwise
list_seed_uris - returns a Python list of seed URIs
get_seed_metadata(uri) - returns a Python dictionary containing metadata for a specific seed at uri
return_collection_metadata_dict - returns a Python dictionary containing all collection-wide metadata
return_seed_metadata_dict - a Python dictionary containing all seeds and their metadata
return_all_metadata_dict - a Python dictionary containing all collection-wide and seed metadata
save_all_metadata_to_file(filename) - writes all collection-wide and seed metadata out as JSON to a file named filename

The code does perform some measure of lazy loading to be nice to Archive-It. If you only need the general collection-wide metadata, it only acquires the first page of the collection. If you need all seed URIs, it must download all Archive-It pages belonging to the collection.

Summary

Archive-It collections have metadata that can be used to answer many research questions. After working on "The Many Shapes of Archive-It", to be presented at iPres 2018, I used the lessons learned to create Archive-It utilities as a Python library that can be used to acquire this metadata. Please try it out and log any issues at the GitHub repository https://github.com/oduwsdl/archiveit_utilities.

--Shawn M. Jones

Search This Blog

Web Science and Digital Libraries Research Group