2026-03-17: The Disintegration Loops: Generational Loss in Web Archives
The Disintegration Loops: Generational Loss in Web Archives
Michael L. Nelson
As part of the Internet Archive's Information Stewardship Forum (March 18–20, 2026), I decided to use my five minute lightning talk to raise the issue of generational loss in web archives. Or more directly, making copies of copies (...of copies…) – something that web archives currently do not do well. My title is based on William Basinski's four volume release "The Disintegration Loops", in which he played the audio tapes of "found sounds", recorded decades earlier, in loops, with the whole process lasting over an hour. The effect is hauntingly beautiful, with each loop slightly degrading the magnetic tape, resulting in a generational loss. The degradation of each loop is right on the edge of the just-noticeable difference, until the entire track is reduced to just a shadow of its former self.
I first discussed this topic in my 2019 CNI closing keynote (slide 88), where I introduced the inability of web archives to archive other web archives as part of the larger issue of web archive interoperability. Let's begin with walking through the example of archiving a tweet (which we already know to be challenging!). The original tweet is still on the live web, even though the UI has undergone many revisions since when it was originally tweeted in 2018.
https://twitter.com/phonedude_mln/status/990054945457147904
(screen shot from 2026-03-17)
I archived that tweet to the Internet Archive's Wayback Machine in 2018 (screen shot from 2019):
I then archived the Wayback Machine's copy of the tweet to archive.today in 2019 (screen shot from 2019):
Note that archive.today is aware that the page comes from the Wayback Machine but the original host is twitter.com, and it maintains both the original Memento-Datetime (20180501125952) as well as its own Memento-Datetime (20190407023141). I then archived archive.today's memento to perma.cc in 2019 (screen shot from 2019):
Finally, I archived the perma.cc memento back to the Wayback Machine in 2019 (screen shot from 2019):
https://web.archive.org/web/20190407024654/https://perma.cc/3HMS-TB59
Although the loss occurs in discrete chunks, it is reminiscent of Basinski's Disintegration Loops, with information lost at each step, and the final version being a mere shadow of the original. In 2019, this was not universally recognized as a problem, since archiving the playback interface of other web archives was not considered a problem to itself. The "right" solution, of course, is to share the WARC files (or WAC, or HAR, or…) out-of-band and let the other web archives replay from the same source files. But this is rarely possible: for a variety of reasons web archives typically do not share the original WARC files, and in the case of archive.today, might not even store the original source files (and instead, likely only store the radically transformed pages).
More importantly, it is sometimes useful to archive a particular web archive's replay of a page, which itself must be archived, because it changes through time. For example, memento #3 (the perma.cc memento of archive.today's memento) is now different; this is a screen shot from 2026:
2026 replay of https://perma.cc/3HMS-TB59
Surely the source files themselves have not changed, and the difference is due to improvements in pywb, which is under constant development. perma.cc's replay of the 2019 page in 2019 is different from the replay from 2026, which implies that it could be different still in the future. But we can not currently archive without generational loss of perma.cc's replay of that page to, say, the Wayback Machine. The fact that screen shots – which are rife with their own potential for abuse (cf. HT 2025, arXiv 2022) – are the only mechanism to document these replay differences underscores the web archive interoperability problem.
I chose the topic of generational loss for my slot at the Information Stewardship Forum because recent events have introduced a new use case for archiving the replay of web archives. Wikipedia recently announced it was blacklisting archive.today because its editors discovered that webmaster at archive.today was using its captcha to direct a DDoS attack against a blog owned by someone that webmaster had a dispute with (the blogger had posted a lengthy investigation of the identity of webmaster), and, for our discussion more disturbingly, had edited the content of an archived page to include the name of the blogger where it would not otherwise be. The Wikipedia discussion page is hard to follow, in part because the editors are discussing how to archive the replay of an archived page. For one example, they show how the archive.today replay now has been changed back to have "Comment as: Nora Puchreiner" (middle of the image):
But the replay alteration from archive.today in question is archived at megalodon.jp to show that the name "Nora Puchreiner" was replaced with the name of the blogger that had earned webmaster's ire, "Jani Patokallio". And yes, megalodon.jp's replay of archive.today's memento is that bad (at least in my browser, it is shrunk down impossibly small), so I used the dev tools to find the string in question.
Another Wikipedian archived (using yet another archive, ghostarchive.org) a google.com SERP to show that archive.today has reverted from "Jani Patokallio" back to "Nora Puchreiner".
https://ghostarchive.org/archive/c0ZP0
What does changing "Nora" to "Jani" (and then changing it back again) accomplish? I'm not sure; this appears to be just a petty response to an ongoing dispute. But the implication is profound: this is the first known example of a major web archive purposefully and maliciously altering its contents, something that we knew was possible but had not yet experienced.
We have long known that replay can change through time (cf. PLOS One 2023) due to the replay engine (the Wayback Machine, Open Wayback, pywb, etc.) evolving, but these changes were engineering results and the replay mostly improved over time. But now we have seen web archives maliciously alter (and then revert) the replay, and we need a more standard and interoperable way to archive archival replay. Not just to prove that a web archive did alter its replay, but also to prove that an archive did not alter its replay. Out-of-band sharing of WARC files is the gold standard, but for a variety of reasons this is unlikely to happen. We must be able to use web archives to verify and validate web archives. We explored a heavyweight design for this a few years ago (JCDL 2019), but it should be revisited in light of developments like WACZ.
–Michael
Comments
Post a Comment