2021-10-07: Tracking Down LSU Women's Basketball Webpages

I grew up as a college sports fan.  I remember learning to count by twos from watching basketball games and learning my sevens times-table from college football. Sometimes I'd even cheer for my team to miss a free throw just so the numbers of the score would stay even.  Since I lived near Baton Rouge, Louisiana, my favorite team was the LSU Tigers.  I was in high school when Shaq played college basketball there.  In high school, I played on the girl's basketball team and was a fan of the LSU women's basketball team.  LSU's coach when I was growing up (and until 2004) was the legendary Sue Gunter.  I went to UNC for graduate school and while I was there, I attended many, many great women's basketball games in Carmichael Arena, where Michael Jordan had played college ball (though, several years before my time).

While I was in grad school, LSU's women's team was doing well and had a bit of a rivalry with Tennessee (though not quite as big as it would be a few years later with the arrival of LSU legends Seimone Augustus and Sylvia Fowles).  In February 1999, I even flew home for the weekend to attend the LSU - Tennessee game in Baton Rouge. We had seats right behind the goal where Katrina Hibbert made a 3-point play off a pass from Marie Ferdinand with 11 seconds left to knock off #1 Tennessee and break their 31-game SEC winning streak.  Tennessee's great Chamique Holdsclaw was a senior and graduated without ever having won in Baton Rouge.  It was amazing.

Here's an archived version, or memento, of my personal women's basketball webpage from Aug 2000 where I mentioned the game and linked to a local copy of the above video.


I had some fun clicking around on the links on that page and reliving some of the women's basketball pages from the early 2000s.  In one example of how things have changed for the better, here's the entirety of the 1999-2000 schedule of women's college basketball games televised on ESPN (with all on ESPN2):


For the Internet Archive's 25th anniversary, I wanted to take a look back at the official LSU women's basketball webpages over the past 25 years.  The Wayback Machine indexes mementos by URI, and I had two URIs as starting places for my investigation: today's live webpage (https://lsusports.net/sports/wbball/) and the URI linked from my personal page above (http://www.lsusports.net/womensbb.cfm).

When we look at today's URI in the Wayback Machine, we see that there are only mementos available starting in July 2021.  


And the older URI is only captured between Jan 2000 and June 2004.  


Here's the oldest memento of the LSU women's basketball webpage from Jan 2000.


But what about all the time in between (2004-2021)? The first thing to notice is that even though the URI for the women's basketball webpage changed over time, the URI for the main LSU sports webpage (http://www.lsusports.net) did not and has been fairly well-archived between 1999-2021 (except for the noticeable gaps between the end of 2005 and mid-2008, which I'll come back to).


Upon closer inspection of the older URI (ending with womensbb.cfm), I found that all of the mementos starting in June 2001 were actually not available, either showing an error page or returning an HTTP 404 status (i.e., an archived 404).


I found that the latest working memento from that URI was from Feb 2001.


This made me suspect that the URL structure of the LSU sports pages had changed, so I found a memento of http://www.lsusports.net from July 2001.


As suspected, the layout of the main sports page had changed. The link to Women's Basketball in the sidebar in this memento is http://www.lsusports.net/sporthome.cfm?sport=WB&linkref=0008727C-CFD1-1AD5-8D47809F187E0000, which is messier than it had been. Often these changes are a result of changing the website's content management system (CMS), but it goes against the concept of "Cool URIs Don't Change" and makes it harder to track down mementos of archived webpages as we'll see.

To find all of the variations with this new structure, I used the Wayback Machine's CDX API to run a prefix match query on http://www.lsusports.net/sporthome.cfm and find entries that returned an HTTP 200 status: http://web.archive.org/cdx/search/cdx?url=http://www.lsusports.net/sporthome.cfm*&filter=statuscode:200. Processing this result for "sport=WB" revealed three different URI variations stored in the Wayback Machine between Jan 2002 - Dec 2004:

This time was the beginning of the greatest period in LSU women's basketball history. Seimone Augustus arrived in 2002 and by the 2003-2004 season had already led LSU to their first NCAA Women's Final FourSylvia Fowles came the next season and continued the streak of Final Fours until she graduated in 2008. In all, LSU went to 5 consecutive Final Fours.

Seimone and Sue from 2004 (https://web.archive.org/web/20040216045430/http://www.lsusports.net:80/sporthome.cfm?sport=WB&amp, https://web.archive.org/web/20040824225250/http://www.lsusports.net/sporthome.cfm?sport=WB&linkref=0008727C-CFD1-1AD5-8D47809F187E0000)

It was pretty heartbreaking to discover that during this same period LSU's sports webpages were not frequently archived. There are no usable mementos for the LSU women's basketball page during Dec 2004 - Dec 2008. I mentioned the dip in mementos between 2005-2008 for the main LSU sports webpage earlier, so I used the CDX API again to do some investigating on http://lsusports.net/robots.txt. (Robots.txt is a file that specifies which crawlers are allowed to access a website.) The Wayback Machine has mementos recorded for robots.txt starting in 2001, but until Aug 2005, these were archived 404s (crawler received an HTTP 404), redirected to a custom 404 page (crawler was redirected to lsusports.net/404error.htm), or were soft 404s (server returned an HTTP 200 that reported the page not found). The robots.txt file in Aug 2005 only allowed Googlebot and MSNBot. All other bots (including the Internet Archive's crawler) were disallowed. This matches up with the beginning of the drop in the lsusports.net webpages being archived. Over time, there were variations in the robots.txt file that allowed other search engine crawlers until it was replaced in Jul 2008 with an empty file. This is when we start to see the webpages being archived again. The empty robots.txt remained until Feb 2012 when just a minimal restriction was added, allowing the site to continue to be archived.

In Aug 2005, the LSU sports webpages underwent another overhaul.  This coincided with (and maybe was the cause of) the robots.txt change, so there aren't many mementos of http://lsusports.net to examine.  The ones that are available don't seem to have links to the individual sports pages, so I couldn't find URIs for the women's basketball page during that time.

Redesign announcement and new layout (https://web.archive.org/web/20050822111859/http://www.lsusports.net/, https://web.archive.org/web/20050918090303/http://www.lsusports.net/)

When the robots.txt file changed in Jul 2008, the top-level lsusports.net page was more frequently archived, and I was able to find a link to the women's basketball page in the new design.  The link structure had changed yet again and had become even worse. The path to the women's basketball main webpage went from /sporthome.cfm?sport=WB to /SportSelect.dbml?SPID=2167&SPSID=27830. What I found after looking around a bit was that there were several variations of URL query string parameters used, but that SPID=2167 referred to women's basketball and SPSID=27830 referred to their main page. (Others I found: SPID=2164 for football, SPID=2165 volleyball, and SPID=2166 men's basketball.)


I went back to the CDX API to see if I could find the different variations for the women's basketball main webpage. I ran a prefix query for http://www.lsusports.net/SportSelect to find all of the underlying sports pages and then filtered for SPID=2167&SPSID=27830. (I couldn't use the SPID in the prefix because there were often other query parameters in the URI before that appeared.) The Wayback Machine does some URL canonicalization to merge mementos for webpages that are likely the same (like www.lsusports.net and lsusports.net), but it does not canonicalize URL query parameters because many times those parameters determine what resource will be returned by the server (as with these sports pages). As a result, for every combination of URL query parameters, there is a separate URI-R indexed by the Wayback Machine. Between Dec 2008 and Oct 2019, I found over 11,000 different URI-Rs for the women's basketball main webpage. (Though I'm not convinced that they all actually point to the women's basketball page.)

Finally in Oct 2019, there was yet another website redesign.  This time, the URIs for the individual sports webpages became more human-readable, with the path to the women's basketball page becoming /sports/womens-basketball and then /sports/wbball/.


To summarize, here's the progression of URI-Rs for the LSU women's basketball main webpage (more details in a GitHub Gist):

  • Jan 2000 - Feb 2001: http://www.lsusports.net/womensbb.cfm
  • Jan 2002 - Dec 2004: http://www.lsusports.net/sporthome.cfm?sport=WB
  • Dec 2004 - Dec 2008: no mementos due to robots.txt exclusion 
  • Dec 2008 - Oct 2019: http://www.lsusports.net/SportSelect.dbml?SPID=2167&SPSID=27830
  • Oct 2019 - May 2021: https://lsusports.net/sports/womens-basketball
  • Jul 2021 - present: https://lsusports.net/sports/wbball

I had fun investigating and pulling all of this together, though I was very disappointed by the archival hole caused by robots.txt. I'm glad the Internet Archive decided to start ignoring robots.txt in 2017.

Happy 25th Anniversary, Internet Archive!


PS: It was fitting that I wrote much of this post while watching former Tennessee great Candace Parker's Chicago Sky beat the Connecticut Sun in the WNBA playoffs.

PPS: While looking through the top-level LSU sports pages, I found a great one from Jan 2008.