2021-10-22: Rediscovering my Angelfire.com Pages with the Help of the Wayback Machine's CDX API

To liberally steal a line from Stephen King: "A 6 digit ICQ number fled through the ancient Internet, and Jim sporadically scrambled after it." This is has been the case for well over a decade - probably closer to two decades at this point. I know there's a "7" involved. I think it started with "173." Or was it "178." Human memory is fuzzy like that - we tend to store shape features into long term memory which is very problematic for recalling specific numbers twenty-something years later.

In the spirit of celebrating the Internet Archive's 25th birthday (everyone else is saying anniversary but I'm going with birthday because I like to anthropomorphize computer things), I thought it appropriate to share some very special data I was able to dig up from the strata of the ancient web during one of my fits to track down that ICQ number.

I was up late one dark and stormy night, most likely training a U-net model for a work project. If you've ever done any deep learning you know how painfully boring it can be to watch epochs and network losses scroll by on your terminal. These are usually the moments when I get motivated to start digging around and poking at stuff. I had recently had a conversation about ICQ which prompted a renewed interest in recovering my number. I was in Michael Nelson's Web Archiving Forensics course at the time and it dawned on me that I had a painfully late 90's/early 2000's personal website at Angelfire.com and knew how to look for it in the archives. Here's what Angelfire.com looked like when I was originally using it in 1999:

https://web.archive.org/web/19990418032043/http://www.angelfire.com/

The problem is I had no idea what my URL was. I know I needed to know my login name but had no idea what that was and the email address (likely an AOL address with a screen name that a teenager in the 90's would have made up so I'm not going to share it) is long gone. I remembered there were directories that acted as something like communities of interest groups or maybe the state you lived in, etc. This was a conceit used by other site hosting services at time, most notably Geocities where one's site would be listed under a "neighborhood" respective to the thematic context of their site. For example computer/tech related sites would be listed under "SiliconValley" or entertainment oriented sites would be under "Hollywood." So the structure of someone's Angelfire URL would look something like

www.angelfire.com/<directory>/<username>

My first step would be to figure out which directory I was under. I tried to search for a list of directories but came up empty handed, so I decided to check the Wayback Machine and see if I could grab a list of all unique directories for the mementos a CDX query would return

curl -s "http://web.archive.org/cdx/search/cdx?url=angelfire.com/&matchType=prefix" | sort -k 2 | awk '{print $3};' | sed -E 's|^.*angelfire.com[:80]*/([a-zA-Z0-9]+)/.*$|\1|' | sort -u > directories.txt

There's a lot going on in that command so I'll break it down:

curl -s "http://web.archive.org/cdx/search/cdx?url=angelfire.com/&matchType=prefix"

I'm running a curl command to make an http request to web.archive.org's CDX API at http://web.archive.org/cdx/search/cdx. The request is to return all mementos that have "angelfire.com/" in the url as a prefix. The CDX API's response has this structure:

["urlkey","timestamp","original","mimetype","statuscode","digest","length"]

Sample of a CDX response

sort -k 2

sorts the response from the CDX API by sorting its second field (the Memento's timestamp). Unfortunately the sort command isn't 0-based.

awk '{print $3};'

This prints the third field (the original URI)

sed -E 's|^.*angelfire.com[:80]*/([a-zA-Z0-9]+)/.*$|\1|'

This uses sed to replace the entire URI printed by the previous awk command with (what we hope is) the directory following some form of www.angelfire.com, using a regular expression to capture what we expect is the directory. This is not perfect because the URI's aren't standardized. There are instances of case-insensitive directories or directories with unexpected characters. Also I'm sure the regex could use some tweaking, but it worked for what I needed at the time.

sort -u

We again call sort on the output from the previous sed command and sort it by unique values

> directories.txt

Finally, we save that resulting output to a file called directories.txt

There's likely a much simpler way to parse all that info down but this was quick and dirty late-night CLI noodling. The resultant list was full of extra info but contained all the directories I needed to scroll through.

Example of the listed Angelfire.com directories

As I was scrolling through the directories I took note of those that were particularly familiar looking and ran a similar command to parse usernames out.

I saved the directory I want to search as $DIRECTORYTOSEARCH

echo $DIRECTORYTOSEARCH=<familar directory>

where <familiar directory> is the name of the directory to search from the list

curl -s "http://web.archive.org/cdx/search/cdx?url=angelfire.com/$DIRECTORYTOSEARCH/&matchType=prefix" | sort -k 2 | awk '{print $3};' | sed -E 's|^.*angelfire.com[:80]*/([a-zA-Z0-9\-_]+)/([a-zA-Z0-9]+)/.*$|\2|' | sort -uf > usernames.txt

This time adding a second capture field for the username following the directory we want to search, saved as a shell variable called $DIRECTORYTOSEARCH, and saving the content to a file named usernames.txt

Example list of Angelfire usernames for a given directory. Note the period correct instances of Limp Bizkit derivations

If I had some inkling as to what my username would be I'd simply search usernames.txt for it and see if it exists, then I'd know my Angelfire.com URL. Unfortunately I didn't so I just kind of slogged through the lists until I saw one that looked right and checked it out. I eventually found my page but it wasn't particularly interesting. I seem to have taken all my html files down before any archives were made of it.

However, I had my username and surprisingly Angelfire.com has been kept up and running - unlike Geocities which was taken offline in 2009 . So I thought maybe I'd see if I could get logged into my Angelfire.com account. To my surprise, this was pretty easy. The trick is that your username isn't simply your username. When you log in you have to log in as <directory>/<username>. !!!! All that work sussing out the directory wasn't actually wasted!

I tried passwords I might have used back then (memory is still weird, I was just really bad at infosec back then and used the same passwords for way too long) and after a mandatory account update and proof that I am not a killer robot, I was able to log in to my account.

When you log in there's a web shell that you can use to navigate files

The Angelfire Web Shell

I had a habit of using my account as sort of an early cloud storage solution if you'll allow me that grossly modern comparison. But these pictures! I had completely forgotten about them. I thought them lost long ago, on some ancient camera memory card from a more civilized age.

As I was looking through them I stumbled across photos of the girl who would later become my wife and myself, playing with my very first webcam. We hadn't been dating long at all at the time. I had no idea I even had these

We were probably about to go to Blockbuster to get a movie or something.

This is what we look like now we added a much better camera and a kid to the scene.

And that's my story of how the Internet Archives help me hack my way back into a 21 year old free web hosting account and recover files that I didn't even remember having but am very glad to have found. And a very big thanks to the folks at Lycos for being so gracious as to keep those hosted files alive.

I'm still looking for my ICQ number...

~ Jim Ecker

Search This Blog

Web Science and Digital Libraries Research Group

2021-10-22: Rediscovering my Angelfire.com Pages with the Help of the Wayback Machine's CDX API

Comments

Post a Comment