2020-07-29: Working With Archives Unleashed Cloud

Figure 1: Analyzing an Archive-It collection with Archives Unleashed Cloud to create derivatives

Archives Unleashed Cloud is one of the tools from the Archives Unleashed Project that I have been learning about recently. (The Archives Unleashed Project was recently awarded a $1M grant from the Andrew W. Mellon Foundation to continue their work and further integrate Archives Unleashed and Archive-It.) As Figure 1 illustrates, Archives Unleashed Cloud takes an Archive-It collection, performs domain-level and textual analysis, and produces derivatives that can be directly visualized or imported into other tools. The crawl report for an Archive-It collection is shown at the top of Figure 1. This collection was created by my advisor Dr. Michele Weigle and the name of the collection is "South Louisiana Flood - 2016". This collection was created to archive an unexpected event which was a flood that occurred in August 2016 in southern Louisiana. I used Archives Unleashed Cloud to analyze this Archive-It collection so that I could get derivatives for this collection like hyperlink diagrams, text derivatives, and other visualizations.

Archives Unleashed Cloud is a web based tool that can be used to analyze an Archive-It collection and then create derivatives. Currently, users must have Archive-It credentials in order to use Archives Unleashed Cloud. The only collections that can be analyzed are the collections associated with your Archive-It account.

Archives Unleashed Cloud is for people who know how to work with text derivatives and network derivatives or someone who is willing to learn how to use the derivatives by going through tutorials that were created by the Archives Unleashed Team. According to the Archives Unleashed Team, this tool "does not require technical skills".

To use the Archives Unleashed Cloud go to http://cloud.archivesunleashed.org/ for the canonical version, or you can install and run a local version by visiting their GitHub repository at https://github.com/archivesunleashed/auk. When using Archives Unleashed Cloud for the first time, there are a few steps that must be taken, shown in figure 2. First, you have to sign in with either Twitter or GitHub OAuth. After signing in, you will need to click on the "Enter Credentials" button. The credentials that need to be entered after clicking this button are an email address and Archive-It credentials.

Figure 2: Sign in and enter credentials

After signing in, you will see a collections screen. If the Archive-It credentials have been entered for the account, then this screen will show information about each of the collections that are associated with your Archive-It account. An example collections screen is shown in Figure 3.

The information that will be shown on the collections screen:
  • Title of the collection
  • Current status, which shows if the collection has been downloaded, analyzed, or completed
  • The last date when the collection was analyzed
  • Availability of a collection, which can be public or private
  • Number of ARC/WARC files for a collection
  • Compressed size of the collection

Figure 3: Example collections screen

To analyze a collection you can follow the steps below:
  1. Click on the title for the collection that you want to analyze (Figure 4). This will display the collection page.

    Figure 4: Select a collection

  2. Click on the "Analyze Collection" button (Figure 5) to start the analysis process.

    Figure 5: Analyze a collection

  3. Wait for the collection's ARC/WARC files to be downloaded by a WASAPI endpoint and then analyzed by the Archives Unleashed Toolkit. An email (Figure 6) will be sent from Archives Unleashed Cloud after the ARC/WARC files have been downloaded and when the analysis is complete.

    Figure 6: Completed analysis email

  4. After analysis is complete, the collection page (Figure 7) will have visualizations and will allow you to download derivatives.

    Figure 7: Completed collection page

Once the analysis has been completed, there will be three visualizations that can be viewed on the collection page. The three visualizations are a crawl frequency graph, a hyperlink diagram, and a domains visualization.

The crawl frequency graph (Figure 8) shows the number of webpages archived on each crawl date for a collection.

Figure 8: Crawl frequency graph

The hyperlink diagram (Figure 9) shows the domains in a collection and the domains that they link to. Archives Unleashed Cloud makes use of GraphPass and Sigma js when creating this interactive hyperlink diagram. A node in this diagram is a domain. The node size is based on the degree. Nodes from the same community will have the same color. An edge between two domains means there is one or more webpages in one domain that contains a link to a webpage in the other domain.

Figure 9: Hyperlink diagram

The domain visualization (Figure 10) shows the top ten domains for a collection. The ranking of the domains is determined by the number of times that a domain is included in a collection.

Figure 10: Domain visualization

There will also be five downloadable derivatives (Gephi, raw network, domains, web page text, text by domains) after analysis is completed (Figure 11).

Figure 11: Downloadable derivatives

The Gephi derivative is a network diagram formatted by GraphPass that can be loaded with Gephi. Figure 12 shows an example of this file loaded into Gephi. The layout and formatting is similar to the hyperlink diagram.

Figure 12: Gephi derivative

The raw network derivative is a network diagram that is not formatted like the Gephi derivative and requires the user to manually layout and transform the graph. Figure 13 shows an example of the raw network derivative loaded into Gephi.

Figure 13: Raw network derivative

The domains derivative (Figure 14) is a CSV file that includes the domains in a collection and the number of times each domain is included in the collection.

Figure 14: Domains derivative

The web page text derivative (Figure 15) is a CSV file that includes information for each webpage in the collection. The information that is included for a webpage are the crawl date, domain, URL, MIME type from the web server,  MIME type identified by Apache Tika, the language, and content from the webpage.

Figure 15: Web page text derivative

The text by domains derivative (Figure 16) is a ZIP file that contains a filtered web page text derivative for each of the top ten most frequent domains included in the collection.

Figure 16: Text by domains derivative

If you want to do further analysis with the downloadable derivatives, then you can use other tools like GephiVoyant ToolsJupyter Notebook, Google Colab, and Archives Unleashed Notebooks.

While learning about Archives Unleashed Cloud, I have created some slides that include more details about this tool and the other things I have learned about the derivatives.


For more information about Archives Unleashed Cloud you can go to archivesunleashed.org/cloud/, read the documentation for this tool, follow @unleasharchives on Twitter, and read about the next steps for the Archives Unleashed Project.

Acknowledgements
Thanks Dr. Ian MilliganNick RuestDr. Michael Nelson, and Dr. Michele Weigle for providing feedback on this blog post and the slides.

-Travis Reid (@TReid803)

Comments