2020-11-18: Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane

Figure 1: Creating collection growth curves with a web page text derivative

Recently, I have been learning about Archives Unleashed Toolkit (AUT), Hypercane, and how these tools can be used together. AUT is one of the tools from the Archives Unleashed Project, which can be used to analyze web archive collections. When AUT is given WARC or ARC files for a web archive collection, it can create network derivatives and text derivatives. The network derivatives have nodes which are the domains in a collection and the links between the nodes occur when there is one or more webpages in one domain that contains a link to a webpage in the other domain. AUT can create text derivatives that include information about either the web pages, images, PDFs, or other documents that are included in the collection. Hypercane, a tool developed by WS-DL's Shawn Jones, can be used to create WARC files that are associated with a public Archive-It collection. The WARC files created by Hypercane can be used as input to AUT.

Figure 2: Growth curve for Archive-It collection 366

While learning about these tools, I have created a set of slides and a collection growth curve notebook. The slides describe collection growth curves (Figure 2) and show how to create collection growth curves when using Hypercane and AUT. The collection growth curve notebook can create the collection growth curves for an Archive-It collection when it is given a web page text derivative (Figure 3) from Archives Unleashed Toolkit or Archives Unleashed Cloud. Hypercane’s hc report growth command can also be used to create collection growth curves. "The Many Shapes of Archive-It" (Jones et al., iPres 2018) first applied the concept of collection growth curves to Archive-It collections.

Figure 3: Web page text derivative

Collection growth curves can be used to gain a better understanding of seed curation and the crawling behavior. A collection growth curve (Figure 2) has two lines which are the seed line and the seed memento line. The seed line is associated with the seed curation process and it shows when the seeds for the collection were added. The seed memento line is associated with the crawling behavior because it shows when seed mementos were crawled during a collection's life. Figure 4 shows how the shape of the lines can change depending on when the seeds or seed mementos were added during the collection's life.

Figure 4: The Anatomy of a collection growth curve

Figure 5 shows an overview of the process for creating a web page text derivative.

Figure 5: Steps for creating collection growth curves

Steps for creating collection growth curves:
  1. Use Hypercane to create WARCs associated with a public Archive-It collection
  2. Create a web page text derivative with Archives Unleashed Toolkit
  3. Upload the web page text derivative to Zenodo
  4. Use the collection growth curve notebook
The slides go over these steps for creating a web page text derivative with Archives Unleashed Toolkit and show how to use the collection growth curve notebook.


-Travis Reid (@TReid803)

Comments