Tuesday, March 10, 2015

2015-03-10: Where in the Archive Is Michele Weigle?

(Title is an homage to a popular 1980s computer game "Where in the World Is Carmen Sandiego?")

I was recently working on a talk to present to the Southeast Women in Computing Conference about telling stories with web archives (slideshare). In addition to our Hurricane Katrina story, I wanted to include my academic story, as told through the archive.

I was a grad student at UNC from 1996-2003, and I found that my personal webpage there had been very well preserved.  It's been captured 162 times between June 1997 and October 2013 (https://web.archive.org/web/*/http://www.cs.unc.edu/~clark/), so I was able to come up with several great snapshots of my time in grad school.

https://web.archive.org/web/20070912025322/
http://www.cs.unc.edu/~clark/
Aside: My UNC page was archived 20 times in 2013, but the archived pages don't have the standard Wayback Machine banner, nor are their outgoing links re-written to point to the archive. For example, see https://web.archive.org/web/20130203101303/http://www.cs.unc.edu/~clark/
Before I joined ODU, I was an Assistant Professor at Clemson University (2004-2006). The Wayback Machine shows that my Clemson home page was only crawled 2 times, both in 2011 (https://web.archive.org/web/*/www.cs.clemson.edu/~mweigle/). Unfortunately, I no longer worked at Clemson in 2011, so those both return 404s:


Sadly, there is no record of my Clemson home page. But, I can use the archive to prove that I worked there. The CS department's faculty page was captured in April 2006 and lists my name.

https://web.archive.org/web/20060427162818/
http://www.cs.clemson.edu/People/faculty.shtml
Why does the 404 show up in the Wayback Machine's calendar view? Heritrix archives every response, no matter the status code. Everything that isn't 500-level (server error) is listed in the Wayback Machine. Redirects (300-level responses) and Not Founds (404s) do record the fact that the target webserver was up and running at the time of the crawl.

Wouldn't it be cool if when I request a page that 404s, like http://www.cs.clemson.edu/~mweigle/, the archive could figure out that there is a similar page (http://www.cs.unc.edu/~clark/) that links to the requested page?
https://web.archive.org/web/20060718131722/
http://www.cs.unc.edu/~clark/
It'd be even cooler if the archive could then figure out that the latest memento of that UNC page now links to my ODU page (http://www.cs.odu.edu/~mweigle/) instead of the Clemson page. Then, the archive could suggest http://www.cs.odu.edu/~mweigle/ to the user.

https://web.archive.org/web/20120501221108/
http://www.cs.unc.edu/~clark/
I joined ODU in August 2006.  Since then, my ODU home page has been saved 53 times (https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/).

The only memento from 2014 is on Aug 9, 2014, but it returns a 302 redirecting to an earlier memento from 2013.



It appears that Heritrix crawled http://www.cs.odu.edu/~mweigle (note the lack of a trailing /), which resulted in a 302, but http://www.cs.odu.edu/~mweigle/ was never crawled. The Wayback Machine's canonicalization is likely the reason that the redirect points to the most recent capture of http://www.cs.odu.edu/~mweigle/. (That is, the Wayback Machine knows that http://www.cs.odu.edu/~mweigle and http://www.cs.odu.edu/~mweigle/ are really the same page.)

My home page is managed by wiki software and the web server does some URL re-writing. Another way to get to my home page is through http://www.cs.odu.edu/~mweigle/Main/Home/, which has been saved 3 times between 2008 and 2010. (I switched to the wiki software sometime in May 2008.) See https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/Main/Home/

Since these two pages point to the same thing, should these two timemaps be merged? What happens if at some point in the future I decide to stop using this particular wiki software and end up with http://www.cs.odu.edu/~mweigle/ and http://www.cs.odu.edu/~mweigle/Main/Home/ being two totally separate pages?

Finally, although my main ODU webpage itself is fairly well-archived, several of the links are not.  For example, http://www.cs.odu.edu/~mweigle/Resources/WorkingWithMe is not archived.


Also, several of the links that are archived have not been recently captured.  For instance, the page with my list of students was last archived in 2010 (https://web.archive.org/web/20100621205039/http://www.cs.odu.edu/~mweigle/Main/Students), but none of these students are still at ODU.

Now, I'm off to submit my pages to the Internet Archive's "Save Page Now" service!

--Michele

No comments:

Post a Comment