2021-10-22: Rediscovering my Angelfire.com Pages with the Help of the Wayback Machine's CDX API
To liberally steal a line from Stephen King: "A 6 digit ICQ number fled through the ancient Internet, and Jim sporadically scrambled after it." This is has been the case for well over a decade - probably closer to two decades at this point. I know there's a "7" involved. I think it started with "173." Or was it "178." Human memory is fuzzy like that - we tend to store shape features into long term memory which is very problematic for recalling specific numbers twenty-something years later.
In the spirit of celebrating the Internet Archive's 25th birthday (everyone else is saying anniversary but I'm going with birthday because I like to anthropomorphize computer things), I thought it appropriate to share some very special data I was able to dig up from the strata of the ancient web during one of my fits to track down that ICQ number.
I was up late one dark and stormy night, most likely training a U-net model for a work project. If you've ever done any deep learning you know how painfully boring it can be to watch epochs and network losses scroll by on your terminal. These are usually the moments when I get motivated to start digging around and poking at stuff. I had recently had a conversation about ICQ which prompted a renewed interest in recovering my number. I was in Michael Nelson's Web Archiving Forensics course at the time and it dawned on me that I had a painfully late 90's/early 2000's personal website at Angelfire.com and knew how to look for it in the archives. Here's what Angelfire.com looked like when I was originally using it in 1999:
https://web.archive.org/web/19990418032043/http://www.angelfire.com/ |
curl -s "http://web.archive.org/cdx/search/cdx?url=angelfire.com/&matchType=prefix" | sort -k 2 | awk '{print $3};' | sed -E 's|^.*angelfire.com[:80]*/([a-zA-Z0-9]+)/.*$|\1|' | sort -u > directories.txt
There's a lot going on in that command so I'll break it down:
curl -s "http://web.archive.org/cdx/search/cdx?url=angelfire.com/&matchType=prefix"
I'm running a curl command to make an http request to web.archive.org's CDX API at http://web.archive.org/cdx/search/cdx. The request is to return all mementos that have "angelfire.com/" in the url as a prefix. The CDX API's response has this structure:
["urlkey","timestamp","original","mimetype","statuscode","digest","length"]
Sample of a CDX response |
sort -k 2
sorts the response from the CDX API by sorting its second field (the Memento's timestamp). Unfortunately the sort command isn't 0-based.
awk '{print $3};'
This prints the third field (the original URI)
sed -E 's|^.*angelfire.com[:80]*/([a-zA-Z0-9]+)/.*$|\1|'
This uses sed to replace the entire URI printed by the previous awk command with (what we hope is) the directory following some form of www.angelfire.com, using a regular expression to capture what we expect is the directory. This is not perfect because the URI's aren't standardized. There are instances of case-insensitive directories or directories with unexpected characters. Also I'm sure the regex could use some tweaking, but it worked for what I needed at the time.
sort -u
We again call sort on the output from the previous sed command and sort it by unique values
> directories.txt
Finally, we save that resulting output to a file called directories.txt
There's likely a much simpler way to parse all that info down but this was quick and dirty late-night CLI noodling. The resultant list was full of extra info but contained all the directories I needed to scroll through.
Example of the listed Angelfire.com directories |
As I was scrolling through the directories I took note of those that were particularly familiar looking and ran a similar command to parse usernames out.
I saved the directory I want to search as $DIRECTORYTOSEARCH
echo $DIRECTORYTOSEARCH=<familar directory>
where <familiar directory> is the name of the directory to search from the list
curl -s "http://web.archive.org/cdx/search/cdx?url=angelfire.com/$DIRECTORYTOSEARCH/&matchType=prefix" | sort -k 2 | awk '{print $3};' | sed -E 's|^.*angelfire.com[:80]*/([a-zA-Z0-9\-_]+)/([a-zA-Z0-9]+)/.*$|\2|' | sort -uf > usernames.txt
This time adding a second capture field for the username following the directory we want to search, saved as a shell variable called $DIRECTORYTOSEARCH, and saving the content to a file named usernames.txt
Example list of Angelfire usernames for a given directory. Note the period correct instances of Limp Bizkit derivations |
This is what we look like now we added a much better camera and a kid to the scene. |
How do I make this work, since I am trying to do something similar? When I tried (by copy-pasting the code into my address bar) it did nothing. Do I need a special program to make this work?
ReplyDelete