2018-07-03: Extracting Metadata from Archive-It Collections with Archive-It Utilities
At iPres 2018, I will be presenting "The Many Shapes of Archive-It", a paper that focuses on some structural features inherent in Archive-It collections. The paper is now available as a preprint on arXiv.
As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU). This library is currently in alpha status, but is already being used with upcoming projects.
Above is Archive-It collection 4515, named 2013 BART Strike and collected by the San Francisco Public Library. This collection's curators generated quite a bit of metadata. In this screenshot, we can see the following metadata fields for the collection:
Each collection can have one or more topics. As shown in the screenshot below, the curator can choose from the controlled vocabulary offered by the collection topics field. They can also add their own freeform topics in the subject field. The public-facing interface combines entries from both of these input fields into the public-facing subject field.
The bar chart below shows the distribution of the top 20 topics in public Archive-It collections. I discovered that most curators apply the controlled vocabulary topics to their collections.
This creates a confusing nomenclature. When viewing an Archive-It collection from the outside, everything is displayed as part of the subject field. Because of this, the rest of this post, and Archive-It Utilities, uses the subject field to refer to these topics.
As work for "The Many Shapes of Archive-It" progressed, we focused more on collecting seed lists and then mementos for further analysis. We tried to predict the topics using machine learning, but were unsuccessful and chose a different path for predicting the semantic categories of a collection. Most of the metadata gathered did not make it into the study's results, but will be used in future work. I have included these results here to show the kinds of questions one can answer with Archive-It Utilities.
It provides several experimental executables. We will only cover
Running
The
To run it on collection 4515 and store the results in file output.json, type the following command:
The
The JSON output looks like the following, truncated for brevity:
From this JSON we can see the name of the collection, which organization created it from the collected_by field, the subjects the curator applied to the collection as a list in the subject field, and when the collection was created in the archived_since field.
Within the optional dictionary field, we see values for freeform metadata added by the curator. In this case we have creator, publisher, source, format, rights, language, and collector.
Also included is the "seed metadata" section containing a list of seeds both scraped from the HTML of the Archive-It collection's web pages and also gathered from the CSV report provided for each Archive-It collection. Above I've listed the seed http://www.bart.gov/news/articles/2013/news20130617 to demonstrate the type of metadata that can be gathered. As noted in "The Many Shapes of Archive-It", seed metadata is optional, but in this example the curator added a title, creator, publisher, language, format, and date to this seed.
which produces the following output, truncated for brevity:
The following methods of the
The code does perform some measure of lazy loading to be nice to Archive-It. If you only need the general collection-wide metadata, it only acquires the first page of the collection. If you need all seed URIs, it must download all Archive-It pages belonging to the collection.
--Shawn M. Jones
As part of the data gathering for "The Many Shapes of Archive-It", and also as part of the development the Off-Topic Memento Toolkit, I had to write code that extracts metadata and seeds from public Archive-It collections. This capability will be useful to several aspects of our storytelling and summarization work, so I used the knowledge gained from those projects and produced a standalone Python library named Archive-It Utilities (AIU). This library is currently in alpha status, but is already being used with upcoming projects.
The metadata available from an Archive-It collection
Archive-It curators can use the predefined metadata fields of Dublin core. They can also supply their own custom metadata fields.An screenshot of Archive-It collection 4515 with metadata annotated. |
- Subject
- Creator
- Publisher
- Source
- Format
- Rights
- Language
- Collector
- Creator
- Publisher
- Language
- Format
- Date
Distribution of the top 20 collection-wide metadata fields in public Archive-It collections. |
Each collection can have one or more topics. As shown in the screenshot below, the curator can choose from the controlled vocabulary offered by the collection topics field. They can also add their own freeform topics in the subject field. The public-facing interface combines entries from both of these input fields into the public-facing subject field.
Metadata can be added by curators by using the metadata page of one of their Archive-It collections. |
Distribution of the top 20 collection-wide subjects (also called topics) of public Archive-It Collections. |
This creates a confusing nomenclature. When viewing an Archive-It collection from the outside, everything is displayed as part of the subject field. Because of this, the rest of this post, and Archive-It Utilities, uses the subject field to refer to these topics.
As work for "The Many Shapes of Archive-It" progressed, we focused more on collecting seed lists and then mementos for further analysis. We tried to predict the topics using machine learning, but were unsuccessful and chose a different path for predicting the semantic categories of a collection. Most of the metadata gathered did not make it into the study's results, but will be used in future work. I have included these results here to show the kinds of questions one can answer with Archive-It Utilities.
Installation
Archive-It Utilities requires Python 3.6. Once that requirement has been met, you can install it using:pip install aiu
It provides several experimental executables. We will only cover
fetch_ait_metadata
in this post.
Running fetch_ait_metadata
The fetch_ait_metadata
command produces a JSON file containing all of the information available about a public Archive-It collection.
To run it on collection 4515 and store the results in file output.json, type the following command:
fetch_ait_metadata -c 4515 -o output.json
The
-c
option allows one to specify an Archive-It collection and the -o
option allows one to indicate where to store the JSON output.
The JSON output looks like the following, truncated for brevity:
From this JSON we can see the name of the collection, which organization created it from the collected_by field, the subjects the curator applied to the collection as a list in the subject field, and when the collection was created in the archived_since field.
Within the optional dictionary field, we see values for freeform metadata added by the curator. In this case we have creator, publisher, source, format, rights, language, and collector.
Also included is the "seed metadata" section containing a list of seeds both scraped from the HTML of the Archive-It collection's web pages and also gathered from the CSV report provided for each Archive-It collection. Above I've listed the seed http://www.bart.gov/news/articles/2013/news20130617 to demonstrate the type of metadata that can be gathered. As noted in "The Many Shapes of Archive-It", seed metadata is optional, but in this example the curator added a title, creator, publisher, language, format, and date to this seed.
Using Archive-It Utilities In Python Code
This information can also be acquired programmatically using theArchiveItCollection
object. The script below demonstrates how one can acquire and the collection name, collecting organization, and the list of seed URIs for Archive-It collection ID 4515.
which produces the following output, truncated for brevity:
The following methods of the
ArchiveItCollection
class are useful for analyzing the metadata of a collection:
get_collection_name
- returns the name of the collectionget_collection_uri
- returns the URI of the collectionget_collectedby
- returns the name of the collecting organizationget_collectedby_uri
- returns the URI of the collecting organizationget_description
- returns the content of the "description" fieldget_subject
- returns a Python list containing the subjects applied to the collectionget_archived_since
- returns the content of the "archived since" fieldis_private
- returns True if the collection is not public, False otherwisedoes_exist
- not all collection identifiers are valid, this method returns True if the collection identifier actually represents a real collection, False otherwiselist_seed_uris
- returns a Python list of seed URIsget_seed_metadata(uri)
- returns a Python dictionary containing metadata for a specific seed at urireturn_collection_metadata_dict
- returns a Python dictionary containing all collection-wide metadatareturn_seed_metadata_dict
- a Python dictionary containing all seeds and their metadatareturn_all_metadata_dict
- a Python dictionary containing all collection-wide and seed metadatasave_all_metadata_to_file(filename)
- writes all collection-wide and seed metadata out as JSON to a file namedfilename
The code does perform some measure of lazy loading to be nice to Archive-It. If you only need the general collection-wide metadata, it only acquires the first page of the collection. If you need all seed URIs, it must download all Archive-It pages belonging to the collection.
Summary
Archive-It collections have metadata that can be used to answer many research questions. After working on "The Many Shapes of Archive-It", to be presented at iPres 2018, I used the lessons learned to create Archive-It utilities as a Python library that can be used to acquire this metadata. Please try it out and log any issues at the GitHub repository https://github.com/oduwsdl/archiveit_utilities.--Shawn M. Jones
Comments
Post a Comment