Monday, August 20, 2012

2012-08-20: MS Thesis: An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication

I am pleased to report on the successful completion of my Master's Degree thesis entitled "An Extensible Framework for Creating Personal Archives of Web Resources Requiring Authentication". The problem that I hoped to resolve with the study was one that plagues software like Archive Facebook, even to this day, in that when the hierarchy a social media website changes, tools created to preserve content on those sites tend to break. By conforming these tools to a specification that is setup to represent the hierarchy of the target social media websites, these tools become adaptive without the need of continuous maintenance on the part of the developer.

Also in the study was an exploration and enumeration of various aspects of personal web archiving that prevent the field from taking advantage of the tools, procedures and mediums that are widely used in conventional web archiving. In addition to simply identifying the problem, I also created a Google Chrome extension, WARCreate that allows any viewable webpage to be preserved by the user into the Web ARC (WARC) format.

As the Internet Archive's Heritrix Web Crawler outputs preserved webpages to this format and their replay system, Wayback Machine is setup to consume this format, allowing a user to preserve webpages to this format is a step at bridging the gap between conventional and personal web archiving.

WARCreate's functionality was first presented at JCDL 2012 and further demonstrated at Digital Preservation 2012. Also at Digital Preservation 2012, I received an Innovation Award as Future Steward by the National Digital Stewardship Alliance (NDSA) and was subsequently interviewed on the Library of Congress / NDSA blog The Signal.

After a lengthy review process, I defended my thesis on August 3, 2012 and submitted the finalized version of the document to the registrar soon after.

I am extremely grateful to my advisor, Dr. Michele C. Weigle for her patience in helping me to get my writing up to par, Dr. Michael L. Nelson for ensuring that my ideas were sound enough for public presentation and Dr. Yaohang Li for his ideas on how to make my thesis research here more theoretical in future work.

Starting in Fall 2012, I will continue my research at Old Dominion University as a PhD student.

— Mat Kelly

Wednesday, August 15, 2012

2012-08-10: MS Thesis - Visualizing Digital Collections at Archive-It

Archive-It is a subscription web archiving service, provided by the Internet Archive, that allows institutions and users to create, maintain, and view digital collections of web resources. The current interface of Archive-It is largely text-based, supporting drill-down navigation using lists of URIs. While this interface provides good searching capabilities, it is not very efficient for browsing. This was our motivation for thinking about new visualizations to make it easy for users to browse Archive-It collections.

This work, "Visualizing Digital Collections at Archive-It", was the subject of a recent MS thesis by Kalpesh Padia (who is continuing his Ph.D. studies at NC State University) and a JCDL 2012 short paper by Kalpesh Padia, Yasmin AlNoamany, and Michele C. Weigle.
In order to provide a better visual experience to users of Archive-It collections, we implemented six different visualizations (treemap, time cloud, bubble chart, image plot, timeline, and wordle). The work also provides a rule-based categorization of sites in collections that lack a curator-defined grouping.

Here we describe each visualization along with an example screenshot (click for a larger image).

Treemaps are particularly useful in navigating through hierarchical collections at Archive-It. Size of each node represents number of webpages in a category at Level 1, and number of mementos for a webpage at Level 2. Color coding of each node helps to determine how recently a webpage was added to the collection.

Time Cloud
Time cloud presents tag clouds of terms appearing in a collection over a timespan. A time slider allows users to select a timespan while tag clouds constructed from TF and TF-IDF can be visualized simultaneously to identify thematic changes in the collection.

Image plot with histogram
Image plot with histogram is an implementation of an inverted stacked bar chart to represent all sites in a collection in a graphical manner. The chart is divided based on the collection’s defined groups. Image plot allows for a visual browsing and querying of webpages in the collection.

Hovering over any image in the image plot reveals a wordle summarizing the content of the site, so the user can get insight about the context easily.

Bubble chart
Bubble chart provides a quick summary of the collection by displaying each group in the collection as a bubble, where the size of the bubble represents the number of sites in each group.

A curator can easily see the growth of a collection over time by looking at site density and analyzing the addition (or removal) of sites from the collection. Timeline visualization also uncovers the collection structure and patterns in URI seeding and crawl schedule.

Here's a video showing the bubble chart, image plot with histogram, and timeline interface.

Rule-based categorization
We also provide an option of exploring the collection using a rule-based categorization for those collections where the curator didn't define a grouping. This suggested categorization also is useful for helping users understand which sources and what media types contribute the most to a collection. The following figure shows an image plot with histogram for the Pakistan floods (2011) collection before (the figure on the left) and after (the figure on the right) categorization.