2021-10-20: Not Your Parents’ Web: Scope, Segmentation, Stability, Resilience, and Persistence

 

"Even though the documents on the Internet are the easy documents to collect and archive, the average lifetime of a document is 75 days and then it is gone." -- Brewster Kahle, November, 1996

Researchers from the Internet Archive, Protocol Labs, and Old Dominion University will revisit the question “how long does a web page last?”  The answer frequently given is 44, 75, or 100 days, all of which stem from research that dates back to 1996--2003.  It is well-known that “44-100 days on average” does not capture the complexity of HTTP activity, but it is an easy to remember scalar number that people can understand.  Much of the circulating knowledge regarding this question comes from the 1990s and early 2000s, before the large-scale adoption of JavaScript in web pages for dynamically producing content and the proliferation of native mobile apps (e.g., the release of the iPhone in 2007).  

The Filecoin Foundation has generously funded ($75k) this year-long project, in which we will conduct an in-depth study of HTTP activity, and provide multiple levels of answers to the question “how long does a web page last?”  In this phase of the project, we will investigate two main research questions:

RQ1: Answer the popular press version of the question “what is the average lifespan of a web page?”  That will likely include sampling URLs from: each of the years of IA’s holdings (1996--2021), HTML only pages, limited URL depth (0, 1, or 2), and randomly sampled from TLDs/eTLDs and page language.

RQ2: A more detailed and nuanced version of RQ1, where we augment the URL sample to ensure desired coverage of TLDs/eTLDs and page language.  We will consider more than just "200 OK" → "404 Not Found", including: 

  • content drift, including multiple page reference points (e.g., initial page: today's page vs. the first archived copy, sliding page: today's page vs. yesterday's page, first of the year: today's page vs. the page on January 1)
  • page usability (e.g., the page maps.google.com is archived but not usable as a service)
  • archived paywalls and login pages (cf. Grant Atkins's 2018 study)

There have been numerous studies to investigate URL lifetime and change rates; an incomplete list is included as an appendix.  But this will be the first large-scale study to sample broadly and directly from the Internet Archive's holdings, as opposed to externally generated, domain-specific samples (e.g., URLs linked from scholarly publications or NYT articles).  Of course not all web pages are crawlable by search engines and web archives, but the Internet Archive offers the best and largest perspective on the past web.  Our goal is to sample at least 25M URLs, to coincide with the Internet Archive's 25 Year Anniversary, thereby creating the most expansive data set of this kind in terms of URLs and time span.  Most data sets are about 1M URLs or fewer, and the Fetterly et al. 2003 study tracked 150M URLs, but only for a span of 11 weeks.  Dr. Sawood Alam, a Web and Data Scientist at the Internet Archive and an alumnus of the Web Science and Digital Libraries Research Group, will assist us in navigating the Internet Archive's holdings.

 We thank The Filecoin Foundation and Dietrich Ayala for supporting this research. 

--Michael


Appendix: An incomplete list of web page lifetime studies. 

Most of the studies construct their data sets with URLs drawn from specific domain-specific sources, for example URLs extracted from tweets or pages from the .jp TLD

Comments