2021-09-20: Digging Up a Gem Through the Web Archives

As we commemorate the Internet Archive turning 25 years, I decided to unearth some memories from the most precious days of my life. 


I attended Devi Balika Vidyalaya, Colombo, Sri Lanka for my high school education (2004-2012). In 2004, I joined the Junior Western Band of our school which paved the way for me to join “DBVSBB”, Devi Balika Vidyalaya Senior Brass Band in the following year. Being a senior brass band member at my school for seven years (Figure 01), I have attended many concerts, received many certificates, and won numerous competitions. Fast forward to 2021, being a Ph.D. student working in the realm of web archiving, I was keen to look for any online presence of our band’s achievements at the time through web archives.


Figure 01: A few pictures taken at the band practices and concerts over the years.


As step one in discovering mementos from Internet Archive’s Wayback Machine, I was trying to recall a time where our band got featured on a newspaper or website. Unfortunately, I was not prescient enough to bookmark or safekeep important links many years ago. So I turned into the next best option: Google. I googled for the terms “himarsha” & “dbvsbb” (Figure 02) and it resulted in a Googlewhack (a two-term search that produces a single hit).

Figure 02: Google search result for “himarsha dbvsbb” resulting in a GoogleWhack (a two-term search that produces a single hit).

I archived the Google SERP using the “Save Page Now” feature of Wayback Machine (memento) and Archive.is (memento) for future reference. Although my time at DBVSBB has long ended, if and when Google indexes this blog post my query result will no longer be a Googlewhack.
The article that popped up is from the time DBVSBB became the all-island (national level) champions at the “Wind Ensemble 2008”, a western music and dance competition organized by the Ministry of Education, Sri Lanka. It was from the Sunday Observer (http://www.sundayobserver.lk/) newspaper, a weekly English language newspaper in Sri Lanka. The search result displayed the article (Figure 03) from Sunday Observer Archives (https://archives.sundayobserver.lk/2001/pix/archives.asp?id=1). Since 2006 the newspaper itself preserves its news articles at archives.sundayobserver.lk and luckily this article featuring our band was from 2008. This newspaper archive version is what came up on the Google SERP.

Figure 03: The article featuring DBVSBB at Sunday Observer Newspaper.


My name “himarsha” is a rare name compared to most common names in Sri Lanka. Although I haven’t heard of any other “himarsha” in our school during that time, there was a “himashi” & a “himasha” in my class. Even if it’s not a common name, I am otherwise blocked by a famous person “Himarsha Venkatsamy“, an Indian model. If you google just for “himarsha” the SERP will be filled with hits related to this model (Figure 04). On the other hand, the lengthy acronym for the band, “DBVSBB”, acts almost like a hash value. If the acronym was a shorter common acronym (say, “XYZ”) instead of “DBVSBB”, between popular “himarsha” and “XYZ”, it would be impossible to find this article featuring the band in the newspaper. Our band slogan is “Unique from the rest” and this google search made me realize that the acronym of the band is also indeed living up to that phrase.


Figure 04: A screen capture of the Google SERP for “himarsha”.


I looked for any mementos in the Internet Archive for the Sunday Observer Archives URL of the news article (Figure 05). I was able to find a single memento from Nov 16, 2016 (Figure 06).



Figure 05: Time-Map for the URL http://archives.sundayobserver.lk/2008/11/23/mag14.asp.



Figure 06: The memento from 2016 for http://archives.sundayobserver.lk/2008/11/23/mag14.asp.


Just out of curiosity, I stripped out the “archives.” off the URL and tried accessing “http://sundayobserver.lk/2008/11/23/mag14.asp”. It returned a “404 Not Found” HTTP status code (Figure 07).


$ curl -iLs "http://sundayobserver.lk/2008/11/23/mag14.asp"

HTTP/1.1 301 Moved Permanently

Date: Mon, 06 Sep 2021 04:48:34 GMT

Content-Type: text/html; charset=iso-8859-1

Transfer-Encoding: chunked

Connection: keep-alive

location: http://www.sundayobserver.lk/2008/11/23/mag14.asp

...


HTTP/1.1 404 Not Found

Date: Mon, 06 Sep 2021 04:48:35 GMT

Content-Type: text/html; charset=UTF-8

Transfer-Encoding: chunked

Connection: keep-alive

...


Figure 07: CURL response for “http://sundayobserver.lk/2008/11/23/mag14.asp

I checked for any mementos for “http://sundayobserver.lk/2008/11/23/mag14.asp” and there were 2 mementos: Dec 10, 2008 (Figure 08) and Jan 31, 2009 (Figure 09). This confirmed that the URL “http://sundayobserver.lk/2008/11/23/mag14.asp” was the valid URL for the news article when it was previously available on the live web. This illustrates how the content may have had multiple URLs over its lifetime but if we knew only one of them, it’s the only URL we would use to query the archives. This is a well-known problem in archives as described in “Archival HTTP Redirection Retrieval Policies” by Ahmed AlSum et al.

Figure 08: Memento for http://sundayobserver.lk/2008/11/23/mag14.asp from 2008.



Although the title image and the group photo were not captured from the live web news article in the two mementos, I was impressed to see how an article published on Nov 23, 2008 in a Sri Lankan newspaper was first archived on the Wayback Machine within 20 days. I also checked in several other public web archives for the availability of this article using MemGator (Figure 10), but none of the other archives had captured this news article (from Sunday Observer or Sunday Observer Archives).

$ curl -s https://memgator.cs.odu.edu/timemap/link/http://sundayobserver.lk/2008/11/23/mag14.asp | grep datetime | awk '{print $1}' | awk -v FS=/ '{print $3}' | sort | uniq -c | sort -n

      2 web.archive.org


$ curl -s https://memgator.cs.odu.edu/timemap/link/http://archives.sundayobserver.lk/2008/11/23/mag14.asp | grep datetime | awk '{print $1}' | awk -v FS=/ '{print $3}' | sort | uniq -c | sort -n

      1 web.archive.org

Figure 10: CURL response from the MemGator service for both newspaper URL as well as newspaper archive URL. No other public web archive holds any copies of either of these URLs.

This illustrates an open problem with respect to discoverability. I used the live web (Google search engine) to obtain the URL to the article and used it as a look-up query in the archive to discover these two mementos (similar to the technique that Kanhabua et al. proposed in “How to Search the Internet Archive Without Indexing It” in 2017). Imagine that this article was not on the live web. In such a scenario, Google search wouldn’t have helped and I wouldn't be able to guess the URL of the article. This article would have been archived but undiscoverable using the current tools. It made me think if there could be other pages in the Internet Archive related to this competition or similar articles that we are unable to discover without the URL.

Additionally, this page is of an English newspaper but hosted at a non-Western domain (“.lk” is the top-level domain of this newspaper, the TLD for Sri Lanka). We know that non-English/non-Western sites are not archived as much compared to English and Western web pages (“Comparing the archival rate of Arabic, English, Danish, and Korean language web pages” by Alkwai et al.). I was glad to see that this article was archived, but could not stop wondering if there could be any other articles featuring our band from the time. Even if there were other such articles, especially if it was not in English and from that many years ago, the probability of it being archived would be reduced as a result.

In summary, I was lucky to have found this article. Several things aligned for me to discover mementos of this page from 13 years ago in the Internet Archive:
  • Sunday Observer newspaper maintaining its own archive.
  • Consistent URL structure between the original news article and the newspaper archive.
  • Interaction between Google & Web Archives: Google Indexing this article from the Sunday Observer newspaper archive allowing me to use that URL as the lookup key in web archives.
I am very excited to celebrate #InternetArchive25 with the Internet Archive. The fact that I was able to bring back my memories from the band and connect my passion for DBVSBB with my current research interests has brought a lot of enthusiasm into this blog post.

Acknowledgments I would like to express my gratitude to my current advisors, Dr. Michele C. Weigle and Dr. Michael L. Nelson for their continued support in every aspect and for bringing up my interest in writing this blog post. I would also like to thank my husband, Skanda Siva, who provided stimulating discussions and helped me in compiling the pictures. Finally, I would like to acknowledge all the past & present members, instructors, and teachers in charge of DBVSBB. Himarsha Jayanetti (@HimarshaJ)

Comments