2021-10-20: Not Your Parents’ Web: Scope, Segmentation, Stability, Resilience, and Persistence

"Even though the documents on the Internet are the easy documents to collect and archive, the average lifetime of a document is 75 days and then it is gone." -- Brewster Kahle, November, 1996

Researchers from the Internet Archive, Protocol Labs, and Old Dominion University will revisit the question “how long does a web page last?” The answer frequently given is 44, 75, or 100 days, all of which stem from research that dates back to 1996--2003. It is well-known that “44-100 days on average” does not capture the complexity of HTTP activity, but it is an easy to remember scalar number that people can understand. Much of the circulating knowledge regarding this question comes from the 1990s and early 2000s, before the large-scale adoption of JavaScript in web pages for dynamically producing content and the proliferation of native mobile apps (e.g., the release of the iPhone in 2007).

The Filecoin Foundation has generously funded ($75k) this year-long project, in which we will conduct an in-depth study of HTTP activity, and provide multiple levels of answers to the question “how long does a web page last?” In this phase of the project, we will investigate two main research questions:

RQ1: Answer the popular press version of the question “what is the average lifespan of a web page?” That will likely include sampling URLs from: each of the years of IA’s holdings (1996--2021), HTML only pages, limited URL depth (0, 1, or 2), and randomly sampled from TLDs/eTLDs and page language.

RQ2: A more detailed and nuanced version of RQ1, where we augment the URL sample to ensure desired coverage of TLDs/eTLDs and page language. We will consider more than just "200 OK" → "404 Not Found", including:

content drift, including multiple page reference points (e.g., initial page: today's page vs. the first archived copy, sliding page: today's page vs. yesterday's page, first of the year: today's page vs. the page on January 1)
page usability (e.g., the page maps.google.com is archived but not usable as a service)
archived paywalls and login pages (cf. Grant Atkins's 2018 study)

There have been numerous studies to investigate URL lifetime and change rates; an incomplete list is included as an appendix. But this will be the first large-scale study to sample broadly and directly from the Internet Archive's holdings, as opposed to externally generated, domain-specific samples (e.g., URLs linked from scholarly publications or NYT articles). Of course not all web pages are crawlable by search engines and web archives, but the Internet Archive offers the best and largest perspective on the past web. Our goal is to sample at least 25M URLs, to coincide with the Internet Archive's 25 Year Anniversary, thereby creating the most expansive data set of this kind in terms of URLs and time span. Most data sets are about 1M URLs or fewer, and the Fetterly et al. 2003 study tracked 150M URLs, but only for a span of 11 weeks. Dr. Sawood Alam, a Web and Data Scientist at the Internet Archive and an alumnus of the Web Science and Digital Libraries Research Group, will assist us in navigating the Internet Archive's holdings.

We thank The Filecoin Foundation and Dietrich Ayala for supporting this research.

--Michael

Appendix: An incomplete list of web page lifetime studies.

Most of the studies construct their data sets with URLs drawn from specific domain-specific sources, for example URLs extracted from tweets or pages from the .jp TLD.

Wallace Koehler, An analysis of web page and web site constancy and permanence, JASIS, 50(2), 1999, pp. 162-180.
Junghoo Cho, Hector Garcia-Molina, The evolution of the web and implications for an incremental crawler, Proceedings of VLDB 2000.
Michael L. Nelson and B. Danette Allen, Object Persistence and Availability in Digital Libraries, D-Lib Magazine 8(1), January 2002.
Dennis Fetterly, Mark Manasse, Marc Najork, Janet Wiener, A large-scale study of the evolution of web pages, Proceedings of WWW 2003.
Alexandros Ntoulas, Junghoo Cho, Christopher Olston, What's new on the web?: the evolution of the web from a search engine perspective, Proceedings of WWW 2004.
Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012.
Teru Agata, Yosuke Miyata, Emi Ishita, Atsushi Ikeuchi, Shuichi Ueda, Life span of web pages: A survey of 10 million pages collected in 2001, Proceedings of JCDL 2014.
Jonathan Zittrain, Kendra Albert, Lawrence Lessig, Perma: Scoping and addressing the problem of link and reference rot in legal citations, 14(2), pp. 88-99, 2014.
Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, Richard Tobin, Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, PLOS ONE 12;9(12):e115253, 2014.
Helge Holzmann, Wolfgang Nejdl, Avishek Anand, The Dawn of Today's Popular Domains: A Study of the Archived German Web over 18 Years, Proceedings of JCDL 2016.
Shawn M. Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, Claire Grover, Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content, PLOS ONE 12(1):e0171057, 2016.
Jonathan L. Zittrain, John Bowers, Clare Stanton, The Paper of Record Meets an Ephemeral Web: An Examination of Linkrot and Content Drift within The New York Times, SSRN 3833133, 2021.

Search This Blog

Web Science and Digital Libraries Research Group

2021-10-20: Not Your Parents’ Web: Scope, Segmentation, Stability, Resilience, and Persistence

Comments

Post a Comment