2013-10-11: Archive What I See Now

Earlier this year, we were awarded an NEH Digital Humanities Start-Up Grant for our project "Archive What I See Now": Bringing Institutional Web Archiving Tools to the Individual Researcher.

We were invited to attend the NEH Office of Digital Humanities Project Directors' Meeting in early October, but due to the government shutdown, the meeting was cancelled.  Here I'll give the quick overview of the project that I'd planned for that meeting.  (Mat Kelly has already posted a nice description of the tools we've been developing, WARCreate and WAIL, at http://bit.ly/wc-wail.)

The slides I'd prepared are below:

Our project is focused on helping people archive web pages. Since much of our cultural heritage is now published on the web, we want to make sure that important pages are archived for the future.

Since 1996, the Internet Archive and other archiving services have done great work preserving web pages.  But, the Internet Archive can only do so much. What if you had a website that the Internet Archive doesn’t or can’t crawl or one that changes more frequently than they would crawl it? Until now, your solution was to archive the page yourself, either using ad-hoc methods like “Save Page As” or by attempting to install your own crawler and Wayback Machine instance.

Our partners in this project include church historians who want to allow individual churches to archive their own websites, artists who want to preserve their own sites, political scientists who want to archive conversations about elections in social media, and social scientists who want to archive conversations about disasters in social media.

There are a couple problems here that we’re addressing. First, if you want an archive of a webpage in a standard format, called a WARC, you have to install and configure some rather complex software. Second, if the webpages you want to archive are behind authentication, the crawler will not be able to access them. Another problem is that you typically set the crawl frequency ahead of time, so if you find a page that you want to archive and it might change soon, it may be difficult to schedule.

So, we’ve built some tools that allow you to get around these problems. They let you “Archive”, “What I See”, “Now”.  Essentially, what you see in the browser is what gets archived.

The two tools that we've developed are WARCreate and WAIL. WARCreate is a browser extension (right now for Chrome, but Firefox is coming soon) that lets you create a WARC of whatever page you’re viewing. It can be on social media, it can be a dynamic page, or it can be behind authentication. The WARC is created locally and saved on your local machine.

So, now that you have a WARC, what do you do with it? Our second tool, WAIL, addresses this issue. It is package that contains Heritrix, the Internet Archive’s crawler, and wayback, the software behind the Wayback Machine. This package installs and configures the software in one click. Once WAIL is running, you can point it to a directory of WARCs that you created with WARCreate, and then you can access your archives locally using the Wayback Machine interface.

Right now, WARCreate can only archive a single page and just saves it locally. We are working on building in the ability to archive a set of pages, or a whole site, and the ability to upload the created WARC to a remote server, including a service like the Internet Archive’s Archive-It.

We hope that these two tools will be useful and can help non-IT experts archive important pages for the future.

If you try out these tools, please fill out our feedback form at http://bit.ly/wc-wail-feedback


2014-09-16 Edit:

We've received a follow-on Digital Humanities Implementation Grant for this project, see http://ws-dl.blogspot.com/2014/07/2014-07-22-archive-what-i-see-now.html for more information.

2020-01-23 Edit: Updated SlideShare embed code -- MLN