2023-10-10: In appreciation of the "ridiculous and unworkable" projects that make the Internet great and research possible



The Internet Archive is hosting their annual celebration this week (October 12, 2023), and I wanted to take this opportunity to both 1) encourage your attendance (virtual for most of us, but if you're in San Francisco, you can attend in person), and 2) express my appreciation and gratitude for continued existence of the Internet Archive, their evolving products and services, and their support of the research community.   

The ongoing devolvement of Twitter into 4chan has caused me to reflect on the platforms, services, and corpuses on which I have built a research program over the last 20+ years.  Discussing the Twitter situation will be the topic of a future post, but here I want to laud the Internet Archive, specifically the Wayback Machine, and by extension, the suite of other public web archives, such as Archive.Today, Arquivo.pt, and the many members of IIPC.  In the past I've referred to the Internet Archive as the "Walter Cronkite of web archives".*  While much of my research has involved archive interoperability, one must acknowledge and appreciate the Internet Archives as the preeminent and pioneering web archive – this portion of my research program simply would not exist without the Internet Archive

For the Internet Archive's 20th and 25th anniversaries, WS-DL members wrote  about their personal interest in and use of web archives, instead of our usual technical explorations.  Some of the entries included: exploring the history of the student-run newspaper at the University of Florida, archived (real) web sites mentioned in fictional works, the tangled history of the official page of LSU Women's Basketball program, and exploring the archived contemporary pages about 9/11 by someone too young to remember the events.  Our emphasis for those anniversary celebrations was about taking a break from our normal technical discussions, discussions that are often deep in the weeds and inaccessible to a general audience, and reminding everyone that our technical studies exist to support these more human interests: what can we learn about ourselves by looking at old web pages?  For us, this week's celebration is slightly different: we celebrate that the Internet Archive's Wayback Machine, as well as other public web archives, has provided the foundation of much of our group's research program.  

The Web Science and Digital Libraries (WS-DL) Research Group performs research in web archiving, web science, digital libraries, neuro-information retrieval, social media, digital preservation, human-computer interaction, information visualization, natural language processing, accessibility, and mining scholarly data.  Much of what we do involves studying well-known web sites as corpuses: studying the behavior of the site, some characteristics of the site's holdings, or building additional services for the site. 

In the last two years, three of the main corpuses that we have studied have been the Wayback Machine (and other web archives), arXiv, and Twitter.  In previous years, Wikipedia and Google Search Engine Results Pages (SERPs) have also been studied, but the Wayback Machine and Twitter are probably the two most prominent sources for our studies.  In 2022 and so far in 2023, I have published about 27 papers.  The table below shows the primary corpus explored in 20 of the 25 publications: 14 involve the Wayback Machine, 4 involve Twitter, and 2 involve arXiv.**   

The Wayback Machine is responsible for at least 56% (14/25) of our publications in the last two years.  The percentages are similar for externally funded grants for us as well: in the 11 externally funded grants we've received since 2020, 55% (6/11) involve the Wayback Machine and 27% (3/11) involve Twitter.***  Furthermore, 73% (11/15) of our PhD alumni have dissertations that prominently feature the Wayback Machine.  

Those interested in the details are invited to explore the research publications and ask questions.  I've provided a publication-centric accounting of the research impact of the Wayback Machine because that's still the coin of the academic realm, but in reality publications are typically supported by web sites (e.g., DSA site), blog posts (e.g., "One in Five arXiv Articles Reference GitHub"), GitHub repos (e.g., TrendMachine), YouTube videos (e.g., "Web Archiving Livestream With NFL Challenge Gameplay 2022-05-17" ), and more.  

Hopefully the above conveys the research impact that the Wayback Machine has on our research program. This relationship is reinforced further by the Internet Archive hiring one of our PhDs, Dr. Sawood Alam, as the Research Lead for the Wayback Machine.  Certainly WS-DL is not the only research group to engage at this level with the Internet Archive – for example, there is the ARCH project (née Archives Unleashed), GLAM Workbench, CEDWARC, and others.  To be fair, the Internet Archive does not exist primarily to be a research platform – it exists to provide "Universal access to all knowledge."  But the pursuit of the latter enables the former, and that has been invaluable for WS-DL.    

Unfortunately, there are threats to the corpuses that enable research.  Wikipedia and search engines are under threat from generative AI.  arXiv's funding model still depends on the largess of its host institution, Cornell University.  Twitter is notoriously being dismantled by its owner.  And the Internet Archive is under constant legal threat, most recently UMG v. Internet Archive for the Great 78 Project, and Hachette v. Internet Archive for the National Emergency Library and controlled digital lending.  Innovation will always have its opponents, and the Internet Archive's innovations find frequent opponents with well-resourced conventional publishers. I personally maintain a long-standing, small monthly donation to the Internet Archive, and I urge others to consider doing so as well.  

As stated in the alt-text punchline of the opening XKCD comic, the Internet Archive is "an invaluable project which, if [it] didn't exist, we would dismiss as obviously ridiculous and unworkable."  This "ridiculous and unworkable" has provided a foundation for our research program, and in recent years has comprised over half of our publications and external grants, and nearly three quarters of our PhD dissertations.  We have no recourse for the loss of Twitter as a research enabler, but we are doing what we can to protect and preserve the Internet Archive.  

Happy 27th!


* Cultural context: Walter Cronkite was a news anchor for the CBS Evening News from 1962–1981 and was considered the "most trusted man in America".  

** The other five publications were on other topics. Some publications straddle multiple corpuses, but I've classified them according to a single corpus.

*** The other two grants involved other topics.