Tuesday, January 24, 2012

2012-01-23: Release of Warrick 2.0 Beta

After a long hiatus, the Warrick tool has been resurrected with some modifications. Warrick is a free utility for reconstructing (or recovering) a website. The original version of Warrick discovered archived versions of resources by searching the Web Infrastructure (which includes search engine caches and the Internet Archive) for archived versions of web resources. It would automatically download and organize the best versions of the archived resources and package them into a copy of the deleted site.

As discussed by Warrick's creator, Frank McCown, the original version of Warrick was prone to breaking due to frequent changes to search engine APIs and archive URLs. Warrick 2.0, adapted from Dr. McCown's original code by Justin F. Brunelle, interfaces with the Memento framework via the mcurl program (developed by Ahmed AlSum). By incorporating Memento timemaps, Warrick no longer has the responsibility of directly searching and communicating with the caches and archives, or learning about new repositories. Instead, Memento handles the interface and communication with the archives, allowing Warrick to remain unaffected by API or URL changes. This makes Warrick more resistant to failures when repositories change, appear, or disappear. Memento allows Warrick to provide additional functionality, such as the ability to recover sites from a specific point in time by utilizing timemaps and the mcurl program.

Warrick 2.0 has already been helping individuals recover lost web sites. Dag Forssell reached out with the following message:

"I just Googled the idea of restoring a website from the Wayback Machine and discovered your work on Warrick. .... Perhaps you can use my project as one of your guinea pigs.

I am restoring a book by Professor of Law Hugh Gibbons and found when listing his references that he created a website on the principles of law in 2002, then abandoned it in 2006 or so, when he retired. I have downloaded 143 htm files from the Wayback Machine. The site looks complete. But of course, each file comes with its own folder (css, jpgs and such) and the links all point back to the wayback machine. Cleaning it all up will be a lot of work.

If it is in the cards for you ... to take this under your wing, I will be overjoyed."

Dag's site was successfully recovered, and helped us to work out some last remaining bugs before our beta release. Further, Dag mentioned that utilizing Warrick eliminated much of the effort on his part by allowing Warrick to deduplicate resources downloaded from the Internet Archive and arrange the resources in the correct site structure. Further, it allow him a deeper understanding of how the resources interact within the page. His recovered content will reportedly be available live at http://www.biologyoflaw.org/ (the website is not available at the time of this blog posting).

We are happy to announce the release of the Beta source of the project which can be downloaded from its Google Code site. Installation and usage instructions are available from the Google Code site.

Warrick is run with a series of command line flags, or options. These are largely unchanged from the original Warrick, but some flags are new. For example, the user now has the -dr and -R options. The -dr option allows the user to specify the date at which the site should be recovered. For example:

warrick.pl -dr 2004-02-01 http://www.cs.odu.edu/

will recover the ODU Computer Science homepage as close as possible to February 1st, 2004.

-R is the resume flag, which allows a user to resume a suspended reconstruction job from a saved file.

Let's say we run the following recovery job:

warrick.pl -D MyRecoveryDirectory -k -n 100 http://www.justinfbrunelle.com/

This will recovery Justin Brunelle's homepage into the directory MyRecoveryDirectory with the -D flag, convert all links from absolute to be relative to the local disk with the -k flag, and stop the recovery after 100 resources are recovered with the -n flag.

When Warrick completes the recovery session of 100 files, it saves the recovery state in a save file. A user can resume the state by using the -R flag as follows:

warrick.pl -R MYSAVEFILE.save

This will resume the suspended job stored in MYSAVEFILE.save. This will recover an additional 100 files.

For a visual example, let's look at one of the aforementioned commands and demonstrate how it can recover a page.

warrick.pl -dr 2004-02-01 http://www.cs.odu.edu/

We can visit the current (as of 2012-01-23) ODU CS website (http://www.cs.odu.edu/) to see the following representation:


To get an idea of what Warrick will recover, we can observe the ODU CS homepage archived at the Internet Archive on 2004-02-06.

After running Warrick, we can view the reconstructed page at my local directory.

This resource is nearly identical to the copy at the Internet Archive. The branding at the top of the page has been removed to keep the representation as faithful as possible to the original resource at the time of archiving. However, we can see from the reporting that all 10 of the recovered resources that make up the recovered page came from the Internet Archive.

#############################################
Memento Timegate Accesses: 11
Internet Archive Contributions: 10
Bing Contributions: 0
Google Contributions: 0
WebCitation Contributions: 0
Diigo Contributions: 0
UK Archives Contributions: 0
URIs obtained from lister Queries: 0
####
Total recoveries completed: 10
Number of cache resources used: 0
Number of resources overwritten: 0
Number of avoided overwrites: 0
Total failed recoveries: 1
Images recovered: 8
HTML pages recovered: 1
Other resources recovered: 1
URIs left in the Frontier: 0
#############################################



More examples can be viewed in the README file available in the source archive or at Warrick's Google Code Wiki.

Please note that this is only a beta version of the software. Also, it is only runnable via Perl on a Linux command-line. A new version of warrick.cs.odu.edu is in development and will be release soon. This web interface will allow users to run Warrick from a browser which will provide tech-savvy and non-tech-savvy users, alike, to benefit from Warrick.

If you utilize Warrick to recover a web site, we are very interested in learning about your experience; this will help us improve Warrick for future users. Please reach out to us via email by joining the WarrickRecovery Google Group (warrickrecovery@googlegroups.com) to learn how you may help.


--Justin F. Brunelle

3 comments:

  1. I'm having an issue running INSTALL. It's unable to install two Perl modules, I can't tell if I am omitting a step or failed to do something while installing Perl.

    "Can't locate object method "install" via package "URI" at -e line 1.
    Can't locate object method "install" via package "HTTP::Date" at -e line 1.
    "

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  2. Brandon, please contact our Google Groups mailing list so that we can help you out:
    Warrick Google Group
    warrickrecovery@googlegroups.com

    The INSTALL script (Installing Warrick) just runs through some installations of perl modules. It can be run by typing

    sh ./INSTALL

    If the install file isn't working properly, there is a list of dependencies in the Google Code Wiki.

    ReplyDelete