2012-08-10: MS Thesis - Visualizing Digital Collections at Archive-It

Archive-It is a subscription web archiving service, provided by the Internet Archive, that allows institutions and users to create, maintain, and view digital collections of web resources. The current interface of Archive-It is largely text-based, supporting drill-down navigation using lists of URIs. While this interface provides good searching capabilities, it is not very efficient for browsing. This was our motivation for thinking about new visualizations to make it easy for users to browse Archive-It collections.

This work, "Visualizing Digital Collections at Archive-It", was the subject of a recent MS thesis by Kalpesh Padia (who is continuing his Ph.D. studies at NC State University) and a JCDL 2012 short paper by Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
In order to provide a better visual experience to users of Archive-It collections, we implemented six different visualizations (treemap, time cloud, bubble chart, image plot, timeline, and wordle). The work also provides a rule-based categorization of sites in collections that lack a curator-defined grouping.

Here we describe each visualization along with an example screenshot (click for a larger image).

Treemaps are particularly useful in navigating through hierarchical collections at Archive-It. Size of each node represents number of webpages in a category at Level 1, and number of mementos for a webpage at Level 2. Color coding of each node helps to determine how recently a webpage was added to the collection.

Time Cloud
Time cloud presents tag clouds of terms appearing in a collection over a timespan. A time slider allows users to select a timespan while tag clouds constructed from TF and TF-IDF can be visualized simultaneously to identify thematic changes in the collection.

Image plot with histogram
Image plot with histogram is an implementation of an inverted stacked bar chart to represent all sites in a collection in a graphical manner. The chart is divided based on the collection’s defined groups. Image plot allows for a visual browsing and querying of webpages in the collection.

Hovering over any image in the image plot reveals a wordle summarizing the content of the site, so the user can get insight about the context easily.

Bubble chart
Bubble chart provides a quick summary of the collection by displaying each group in the collection as a bubble, where the size of the bubble represents the number of sites in each group.

A curator can easily see the growth of a collection over time by looking at site density and analyzing the addition (or removal) of sites from the collection. Timeline visualization also uncovers the collection structure and patterns in URI seeding and crawl schedule.

Here's a video showing the bubble chart, image plot with histogram, and timeline interface.

Rule-based categorization
We also provide an option of exploring the collection using a rule-based categorization for those collections where the curator didn't define a grouping. This suggested categorization also is useful for helping users understand which sources and what media types contribute the most to a collection. The following figure shows an image plot with histogram for the Pakistan floods (2011) collection before (the figure on the left) and after (the figure on the right) categorization.