2023-10-10: In appreciation of the "ridiculous and unworkable" projects that make the Internet great and research possible

https://xkcd.com/2085/

The Internet Archive is hosting their annual celebration this week (October 12, 2023), and I wanted to take this opportunity to both 1) encourage your attendance (virtual for most of us, but if you're in San Francisco, you can attend in person), and 2) express my appreciation and gratitude for continued existence of the Internet Archive, their evolving products and services, and their support of the research community.

The ongoing devolvement of Twitter into 4chan has caused me to reflect on the platforms, services, and corpuses on which I have built a research program over the last 20+ years. Discussing the Twitter situation will be the topic of a future post, but here I want to laud the Internet Archive, specifically the Wayback Machine, and by extension, the suite of other public web archives, such as Archive.Today, Arquivo.pt, and the many members of IIPC. In the past I've referred to the Internet Archive as the "Walter Cronkite of web archives".* While much of my research has involved archive interoperability, one must acknowledge and appreciate the Internet Archives as the preeminent and pioneering web archive – this portion of my research program simply would not exist without the Internet Archive.

For the Internet Archive's 20th and 25th anniversaries, WS-DL members wrote about their personal interest in and use of web archives, instead of our usual technical explorations. Some of the entries included: exploring the history of the student-run newspaper at the University of Florida, archived (real) web sites mentioned in fictional works, the tangled history of the official page of LSU Women's Basketball program, and exploring the archived contemporary pages about 9/11 by someone too young to remember the events. Our emphasis for those anniversary celebrations was about taking a break from our normal technical discussions, discussions that are often deep in the weeds and inaccessible to a general audience, and reminding everyone that our technical studies exist to support these more human interests: what can we learn about ourselves by looking at old web pages? For us, this week's celebration is slightly different: we celebrate that the Internet Archive's Wayback Machine, as well as other public web archives, has provided the foundation of much of our group's research program.

The Web Science and Digital Libraries (WS-DL) Research Group performs research in web archiving, web science, digital libraries, neuro-information retrieval, social media, digital preservation, human-computer interaction, information visualization, natural language processing, accessibility, and mining scholarly data. Much of what we do involves studying well-known web sites as corpuses: studying the behavior of the site, some characteristics of the site's holdings, or building additional services for the site.

In the last two years, three of the main corpuses that we have studied have been the Wayback Machine (and other web archives), arXiv, and Twitter. In previous years, Wikipedia and Google Search Engine Results Pages (SERPs) have also been studied, but the Wayback Machine and Twitter are probably the two most prominent sources for our studies. In 2022 and so far in 2023, I have published about 27 papers. The table below shows the primary corpus explored in 20 of the 25 publications: 14 involve the Wayback Machine, 4 involve Twitter, and 2 involve arXiv.**

The Wayback Machine (and other web archives)

Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web Archives (TPDL 2023)
TrendMachine: A Temporal Webpage Resilience Portal (JCDL 2023)
Hashes are not suitable to verify fixity of the public archived web (PLOS ONE 2023)
Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering (JCDL 2023)
Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives (JCDL 2023)
Summarizing web archive corpora via social media storytelling by automatically selecting and visualizing exemplars (ACM TWeb 2023)
Less than 4% of Archived Instagram Account Pages for the Disinformation Dozen are Replayable (JCDL 2023)
To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages (ACM TWeb 2023)
Robots still outnumber humans in web archives, but less than before (TPDL 2022)
Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests (ICADL 2022)
Web Archiving as Entertainment (ICADL 2022)
Creating structure in web archives with collections: different concepts from web archivists (TPDL 2022)
A chromium-based memento-aware web browser (TPDL 2022)
The DSA Toolkit Shines Light Into Dark and Stormy Archives (Code{4}Lib, 2022)
Memento Validator: A toolset for Memento compliance testing (JCDL 2022)

Twitter

Challenges in replaying archived Twitter pages (IJDL 2023)
Extracting Information from Twitter Screenshots (techreport 2023)
Twitter DM Videos Are Accessible to Unauthenticated Users (techreport 2023)
Did They Really Tweet That? Querying Fact-Checking Sites and Politwoops to Determine Tweet Misattribution (techreport 2022)

arXiv

It’s Not Just GitHub: Identifying Data and Software Sources Included in Publications (TPDL 2023)
The Rise of GitHub in Scholarly Publications (TPDL 2022)

The Wayback Machine is responsible for at least 56% (14/25) of our publications in the last two years. The percentages are similar for externally funded grants for us as well: in the 11 externally funded grants we've received since 2020, 55% (6/11) involve the Wayback Machine and 27% (3/11) involve Twitter.*** Furthermore, 73% (11/15) of our PhD alumni have dissertations that prominently feature the Wayback Machine.

Those interested in the details are invited to explore the research publications and ask questions. I've provided a publication-centric accounting of the research impact of the Wayback Machine because that's still the coin of the academic realm, but in reality publications are typically supported by web sites (e.g., DSA site), blog posts (e.g., "One in Five arXiv Articles Reference GitHub"), GitHub repos (e.g., TrendMachine), YouTube videos (e.g., "Web Archiving Livestream With NFL Challenge Gameplay 2022-05-17" ), and more.

Hopefully the above conveys the research impact that the Wayback Machine has on our research program. This relationship is reinforced further by the Internet Archive hiring one of our PhDs, Dr. Sawood Alam, as the Research Lead for the Wayback Machine. Certainly WS-DL is not the only research group to engage at this level with the Internet Archive – for example, there is the ARCH project (née Archives Unleashed), GLAM Workbench, CEDWARC, and others. To be fair, the Internet Archive does not exist primarily to be a research platform – it exists to provide "Universal access to all knowledge." But the pursuit of the latter enables the former, and that has been invaluable for WS-DL.

Unfortunately, there are threats to the corpuses that enable research. Wikipedia and search engines are under threat from generative AI. arXiv's funding model still depends on the largess of its host institution, Cornell University. Twitter is notoriously being dismantled by its owner. And the Internet Archive is under constant legal threat, most recently UMG v. Internet Archive for the Great 78 Project, and Hachette v. Internet Archive for the National Emergency Library and controlled digital lending. Innovation will always have its opponents, and the Internet Archive's innovations find frequent opponents with well-resourced conventional publishers. I personally maintain a long-standing, small monthly donation to the Internet Archive, and I urge others to consider doing so as well.

As stated in the alt-text punchline of the opening XKCD comic, the Internet Archive is "an invaluable project which, if [it] didn't exist, we would dismiss as obviously ridiculous and unworkable." This "ridiculous and unworkable" has provided a foundation for our research program, and in recent years has comprised over half of our publications and external grants, and nearly three quarters of our PhD dissertations. We have no recourse for the loss of Twitter as a research enabler, but we are doing what we can to protect and preserve the Internet Archive.

Happy 27th!

Michael

* Cultural context: Walter Cronkite was a news anchor for the CBS Evening News from 1962–1981 and was considered the "most trusted man in America".

** The other five publications were on other topics. Some publications straddle multiple corpuses, but I've classified them according to a single corpus.

*** The other two grants involved other topics.

Search This Blog

Web Science and Digital Libraries Research Group

2023-10-10: In appreciation of the "ridiculous and unworkable" projects that make the Internet great and research possible

Comments

Post a Comment