Monday, October 26, 2009

2009-10-26: Communications of the ACM Article Published

The article "Why Websites Are Lost (and How They're Sometimes Found)" has finally been published in the November 2009 issue of Communications of the ACM. Co-written with Frank McCown and Cathy Marshall, it was accepted for publication in the fall of 2007. Although we've had a pre-print available since 2008, it just isn't the same until you see it in print.

Except we won't be seeing this in print; it is instead published in the "Virtual Extension" part of the CACM. So even though it has page numbers (pp. 141-145), this article won't be among those that arrive in your mailbox in a few weeks. As someone who has spent his entire career trying to transform the scholarly communication process with the web and digital libraries I completely understand this move by the CACM, but I have to admit I'm disappointed that I won't see a printed, bound copy. Even though in the long-term, all discovery will come from the web (e.g., Google Scholar or personal publication lists), the short-term thrill of receiving the hard-copy in the mail is hard to to replace.

The article itself is a very nice summary of the problem area. The idea to write the paper came from our involvement in Warrick, a tool for reconstructing lost web sites. Warrick was very successful, and the interest in Warrick was so high we eventually became distracted from the mechanics of reconstruction and our focus turned to the question "why are people losing all these sites?!" We learned quite a bit.

Interested readers might also like: our paper in Archiving 2007, Frank's dissertation, or any of the several papers by Cathy on personal (digital) archiving.


Thursday, October 15, 2009

2009-10-15: Seminars at Emory University

I recently traveled to Emory University to visit with Joan Smith (an alumna of our group -- PhD, 2008) and Rick Luce. While there, I gave two colloquiums: on October 1 at the Woodruff Library on OAI-ORE, and on October 2 at the Mathematics & Computer Science Department on web preservation (specifically, based on Martin Klein's PhD research).

I've uploaded both sets of slides. The first, "OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project", is based on slides from Herbert Van de Sompel:

The second, "(Re-) Discovering Lost Web Pages", is an extended version of slides presented at the NDIIPP Partners Meeting this summer:


Monday, October 5, 2009

2009-10-05: Web Page for the Memento Project Is Available

The Library of Congress funded research project "Tools for a Preservation Ready Web" is coming to a close. The initial phase (2007-2008) of the project funded Joan Smith's PhD research into using the web server to inform web crawlers exactly how many valid URIs there are at a web site (the "counting problem") as well as perform server-side generation of preservation metadata at dissemination time (the "representation problem"). Several interesting papers came out of that project (e.g., WIDM 2006, D-Lib 14(1/2)) as well as the mod_oai Apache module. Joan graduated in 2008 and is now the Chief Technology Strategist for the Emory University Libraries and an adjunct faculty member in the CS department at Emory.

Since that time, Herbert and I (plus our respective teams) have been closing out this project working on some further ideas regarding the preservation of web pages and how web archives can be integrated with the "live web". The result is the Memento Project, which has a few test pages that are collecting links from robots and interactive users that we will use in a description and analysis to be published shortly. In the mean time, the test pages feature some clever scripting from Rob to show Herbert and I standing next to BBC and CNN web pages, respectively. Check them out:

And here are their respective URIs (just for fun):

I'll post a further update on WS-DL when we publish the description of how Memento works. We'd like to again thank the National Digital Information Infrastructure and Preservation Program for their support of the "Tools for a Preservation Ready Web" project.

-- Michael