2025-06-10: The Wayback Machine is now much larger than the sum of all other web archives

Percentage of Retrievable Archived URIs from All Archives with and without The IA
Percentage of Retrievable Archived URIs from All Archives with and without The IA

In this post, we summarize our study on the archiving rate of news stories published in Arabic and English from four major news outlets, Aljazeera Arabic, Aljazeera English, Alarabiya, and Arab News. We found that 45% of news stories' URIs published between 1999 and 2022 were not archived at all. Furthermore, for news stories published between 1999 and 2013, 65% of them were not archived.  For stories published between 2013 and 2022, only 21% of them are not archived. Our findings line up with Ainsworth et al. (2011). who found that between 30% and 90% of the web is archived. Our results indicate a notable improvement in web archiving within the last decade, however, we found that improvement to be limited to the Internet Archive.

An earlier study by Alsum et al. (2014), on a different dataset, found that it is possible to retrieve full TimeMaps for 55% of their dataset using the top three web archives excluding the Internet Archive (IA). Ten years later, in our dataset we found those numbers to be 4.76% and 6.74% for URIs from 1999 to 2013 and from 1999 to 2022, respectively. The percentage does not change with the increase in the number of web archives queried for mementos because, other than IA and Archive-It, only four web archives returned copies of archived web pages we queried: Archive Today, Arquivo.pt, Perma.cc, and Stanford Web Archive. Alsum et al. were able to get mementos from more archives than we did because unlike our dataset, they sampled from DMOZ, which is a websites directory and it is what web archives used for seed list sourcing, and the fact that there were more archives in operation at the time of their study (2013). They were able to retrieve complete TimeMaps for 93% of their dataset using the top nine web archives excluding the Internet Archive. Our results show that 95.24% of URIs in our dataset that are archived by the IA would be lost forever if the Internet Archive was, for example, crippled or killed by legal threats or became unavailable or partially unavailable as a result of  multiple cyber attacks. Although recent attacks were fended off, interruptions will continue to occur as long as these attacks continue to happen.

The large difference between our results and the results from Alsum et al. is due to using a different dataset, the ten years difference in time when each study was conducted, the absence of some public web archives that were operating in 2013 but stopped by 2022 (Library and Archives Canada, The European Archive, The National Library of Ireland, and Public Record Office of Northern Ireland, and WebCite), including more archives via Memento proxies that are no longer operational, and the dramatic improvement that the IA had made within the last decade; other archives did not expand or improve as much. Our study shows that the percentage of retrievable web pages varies depending on the dataset, the emergence and stoppage of web archives as time changes. 

In 2024, the results are very different from 2014; over 95%  of archived Arabic and English news stories in our dataset would be lost forever if the IA is shuttered. We found that the ability to retrieve full TimeMaps for our dataset went from 40% for URIs from 1999-2013 to 55% for URIs from 1999 to 2022 indicating a significant improvement in web archives' performance in the last 10 years, but because only a few mementos were retrieved from archives other than the IA, we conclude that the improvement in performance is limited to the IA and that losing it will be a major catastrophe to web archiving.

The Dataset

In September 2022, I collected 1.5 million multilingual news stories' URLs from sitemaps of four leading news outlets: Aljazeera ArabicAljazeera EnglishAlarabiya, and Arab News. Using sitemaps yielded the maximum amount of stories' URLs. I examined multiple methods to fetch URLs including RSS, Twitter, GDELT, web crawling, and sitemaps.

The dataset is available on GitHub.

The URIs in the dataset are for news stories published on the web between 1999 and September 2022. There were only a few stories collected from 1999 because some of these websites did not exist back in 1999, like Alarabiyah (2004) and Aljazeera English (2006) or their online presence was not notable, like Arab News and Aljazeera Arabic. The dataset is grouped by the day news stories are published. There are some other formats like grouped by news source and grouped by year. For the purpose of my study, I grouped them by year, then by day to be able to extract a sample (day's worth of news stories' URLs for each year) that represents the output of these news outlets for that year. The following table shows the year, the minimum and the maximum number of stories published each day for that year by all four news outlets. It also shows the median and mean for the number of stories published each day for the year. Excluding stories published in 1999 due to the tiny amount collected, the minimum number of stories published each day for all years, 2000 to 2022, is indicated by the red color font in the table, and the maximum is indicated by the green.



The minimum and the maximum number of stories published each day by year
Year Min Max Median Mean
1999 1 2 1 1.33
2000 2 48 26 27.28
2001 14 174 82 81.26
2002 23 147 93 93.86
2003 44 198 127 126.53
2004 54 573 140 132.39
2005 105 553 165 167.46
2006 101 238 154 153.61
2007 112 231 164 162.92
2008 109 272 157 158.43
2009 82 312 149 151.15
2010 137 317 193 195.81
2011 135 664 236 237.53
2012 98 389 242 243.63
2013 128 332 259 245.69
2014 117 260 178 177.75
2015 117 217 171 168.77
2016 106 464 181 186.9
2017 144 356 220 219.58
2018 118 266 182 184.96
2019 107 240 177 176.63
2020 215 437 345 338.94
2021 86 444 340 323
2022 69 275 133 130.75

Plotting the min, max, median, and mean for the number of news stories collected each day by year shows that the dataset has a symmetrical distribution, the mean and the median for each year are the same or very close; therefore, using the day that represents the median number of news stories published by all four news sources is a good representative of the number of stories published each day for that year. 

The min, max, median, and mean for the number of collected stories' URIs each day by year
The min, max, median, and mean for the number of collected stories' URIs each day by year

We noticed that, with some exceptions, the number of stories published each year is increasing over time. The drop in the number of stories published in 2022 is because we constructed our dataset in September of 2022, so not all stories published in 2022 were captured.

The median number of collected URIs per day by year
The median number of collected URIs per day by year

This table is the same as above with the addition of file names in the dataset so researchers can reproduce the results.

The minimum and the maximum number of stories published each day by year with the files' names
YearMin collected/dayMax collected/dayMedianMeanMedian file nameMax file nameMin file name
19991211.331999-01-09.txt1999-01-09.txt1901-12-13.txt
20002482627.282000-11-21.txt2000-12-24.txt2000-11-12.txt
2001141748281.262001-04-21.txt2001-10-01.txt2001-01-02.txt
2002231479393.862002-01-01.txt2002-05-16.txt2002-12-31.txt
200344198127126.532003-05-19.txt2003-12-08.txt2003-05-31.txt
200454573140132.392004-01-03.txt2004-10-03.txt2004-08-14.txt
2005105553165167.462005-01-15.txt2005-01-10.txt2005-11-04.txt
2006101238154153.612006-05-02.txt2006-03-02.txt2006-10-24.txt
2007112231164162.922007-02-17.txt2007-04-19.txt2007-10-13.txt
2008109272157158.432008-01-04.txt2008-02-11.txt2008-07-23.txt
200982312149151.152009-06-22.txt2009-10-01.txt2009-03-14.txt
2010137317193195.812010-04-05.txt2010-04-21.txt2010-04-17.txt
2011135664236237.532011-04-01.txt2011-12-12.txt2011-01-08.txt
201298389242243.632012-01-18.txt2012-05-09.txt2012-02-16.txt
2013128332259245.692013-01-02.txt2013-04-30.txt2013-10-17.txt
2014117260178177.752014-02-09.txt2014-11-19.txt2014-10-24.txt
2015117217171168.772015-02-10.txt2015-03-16.txt2015-09-26.txt
2016106464181186.92016-03-29.txt2016-06-06.txt2016-07-08.txt
2017144356220219.582017-01-11.txt2017-04-24.txt2017-01-20.txt
2018118266182184.962018-01-13.txt2018-04-10.txt2018-08-17.txt
2019107240177176.632019-01-29.txt2019-10-23.txt2019-08-11.txt
2020215437345338.942020-02-28.txt2020-06-17.txt2020-08-01.txt
2021864443403232021-03-07.txt2021-03-24.txt2021-12-25.txt
202269275133130.752022-01-28.txt2022-09-08.txt2022-08-06.txt

The Method

I used MemGator to check if the collected news stories were archived by public web archives. The list of default archives checked by memgator is rather long, however, most of these archives never returned a response with any archived copies, so I eliminated these archives.  I also eliminated The Library of Congress archive at their request because the archive does not have the capacity to support the amount of requests we made in the course of this study. This is an alphabetically ordered list of archives that returned at least one copy for one URL out of the 4116 URLs we queried:

1. archive.today: Archive Today
2. arquivo.pt: The Portuguese Web Archive
3. perma.cc: Perma.cc Archive
4. swap.stanford.edu: Stanford Web Archive
5. wayback.archive-it.org: Archive-It (powered by the Internet Archive)
6. web.archive.org: the Internet Archive

We found that 2269 URIs were archived and that each URI has, at least, one memento that is retrievable from at least one web archive. On the other hand, 1847 URIs were not archived at all by any web archive.

This is a list of archives that did not return mementos for any of the URIs we queried:

1. waext.banq.qc.ca: Libraries and National Archives of Quebec
2. warp.da.ndl.go.jp: National Diet Library, Japan
3. wayback.vefsafn.is: Icelandic Web Archive
4. web.archive.bibalex.org: Bibliotheca Alexandrina Web Archive
5. web.archive.org.au: Australian Web Archive
6. webarchive.bac-lac.gc.ca: Library and Archives Canada
7. webarchive.loc.gov: Library of Congress
8. webarchive.nationalarchives.gov: UK National Archives Web Archive
9. webarchive.nrscotland.gov.uk: National Records of Scotland
10. webarchive.org.uk: UK Web Archive
11. webarchive.parliament.uk: UK Parliament Web Archive
12.  wayback.nli.org.il: National Library of Israel

The sample I checked for each year are the stories that were published by each news outlet in the day that represents the median. 

Results and Discussion

I found that 45% of the stories I collected between 1999 and 2022 are not archived, which is in agreement with Ainsworth et al. (2011) where they found that between 35% and 90% of URIs have at least one archived copy. These stories are on the live web; unlike news stories that are behind paywalls (Atkins et al. 2018); these stories can be crawled and they are a part of the set of publicly archivable web pages. Their URIs can be easily pushed to the IA using batch processed using archive.org services, which is what I am currently working on and will reported on in the future. They can also be pushed to other public web archives using ArchiveNow.

The percentage of archived URIs by year
The percentage of archived URIs by year


I found that the Internet Archive has 99.74% (2263 URIs) of the stories archived. The union of all other web archives only had 4.76% (108 URIs) of archived stories. Remarkably, only six web pages out of 2269 were archived by other archives, but missing from the IA! This is about one news story missing from the IA every four years. In other words, using our dataset as a sample, if we lose all public web archives except the IA, we will only lose 0.26% of archived pages and all of them are from before 2013. On the other hand, using the same sample, if we lose the IA and all other archives maintain their status, we will lose 95.24% of archived web pages.

The percentage of archived URIs in the IA and in the union of other archives
The percentage of archived URIs in the IA and in the union of other archives

From 2013 until 2022, the IA did not miss a single page that other archives have captured. That is almost a decade of the IA not only being the largest web archive, but to be archiving a much larger set of web pages than exist in the union of all other web archives.

The percentage of URIs exclusively archived in the IA and the union of other archives
The percentage  of URIs exclusively archived in the IA and the union of other archives

The results show that the union of all archives besides the IA have only archived 4.76% of the total archived URLs. In other words, if the IA is shutdown tomorrow, we can retrieve, at maximum, 4.76% of the lost copies. These results are very different from what we were able to retrieve 10 years ago (Alsum et.al. 2014). In that study, the authors show that despite the IA being the largest archive on the internet, it was possible to retrieve complete TimeMaps for 93% of the dataset from the top 9 web archives excluding the IA.

The percentage of retrievable archived URIs from all archives with and without the IA
The percentage of retrievable archived URIs from all archives with and without the IA


The key differences between my experiment and the experiment published in 2013 is the dataset and the way the sample was selected. 

In my experiment, the dataset consists of 1.5 million URLs collected from sitemaps of four major news outlets in Arabic and English owned by Arabic news agencies. We split the dataset into smaller datasets based on the day in which the news story was published. We selected the day that represents the median of the number of news stories published to represent the year, and we queried the actual stories' URLs to all web archives without trimming or reducing the URL to the hostname. For example, if the URL for a news story in the sample is https://example.com/foor/bar, we queried web archives with http://example.com/foo/bar. 

On the other hand, Alsum et al. (2014) sampled data from DMOZ, archives holding, and archive access logs, but they only used hostnames. For example https://example.com/foo/bar is queried as https://example.com. Their sample may have overlapping hostnames but within each sample, hostnames are unique. So the experiment in 2013 was performed on hostnames, which are typically home pages for websites, not URLs that have content related to a specific event. 

Another difference is the way hostnames were sampled from DMOZ. Although one of the DMOZ samples was randomized, the other two were language controlled and TLD (top level domain) controlled. Their experiment categorized DMOZ by TLDs, limited their study to a specific set of TLDs distributed around the world, and for each selected TLD, they randomly selected a sample. Their language controlled DMOZ sample consisted of hostnames for a limited set of languages that represent the world after categorizing DMOZ by language. 

Regardless of the sources they used to build their sample, I believe that the reason behind the significant difference between our results and theirs is reducing the collected URLs to hostnames in their sample. While home pages are very important to archive because home pages represent what the website thinks is the most important to show to the user when they fist open them, the user has to navigate away from the home page in order to get information about a certain topic and get the details they are looking for (a news story in our case). 

Furthermore, utilizing archive holdings to find hostnames will eliminate bias towards archives that used DMOZ as a URI source like the IA, but the method makes much smaller archives than IA contribute the same amount of URIs of hostnames to the sample. They already archive these hostnames so they will have copies of them. For example, if the sample has sets hostnames a, b, c, d,and e; let a be the sampled hostnames from archive A, b is the sampled hostnames from archive B, etc, the complete set of archived hostnames can almost be completely retrieved from archives B, C, D, and E because if the experiment included 10 archives, the union of any 9 archives of them will contain, at least, 90% of archived hostnames. 

What Alsum et al. (2014) were able to prove is that complete TimeMaps can be retrieved from the top 9 archives excluding the IA. On the other hand, our study proves that the exclusion of the IA will cause 95.24% of archived pages to be irretrievable if they are no longer on the live web or their content has drifted since they were last archived.

I decided to provide another graph plotting my results in the same way the authors did with their results to be able to visually compare the two results. Of course, my plot only includes archives in which I was able to find at least one memento for at least one of the 4116 URIs that I queried.
Percentage of Retrievable Archived URIs from All Archives with and without The IA
Percentage of Retrievable Archived URIs from All Archives with and without The IA

The percentage of the retrieved timemaps for URIs in the dataset (Alsum et al. 2014)
The percentage of the retrieved timemaps for URIs in the dataset (Alsum et al. 2014)

The following table shows the results from running the median day news stories by all archives listed in memgator's archives that returned at least one copy of at least one URL. The table headers are shortcuts as follows:

Collected: Number of URLs that were published during the median day for the year and collected at the time of dataset creation
Archived: Number of URLs that were published during the median day for the year and archived by any web archive
Not Archived: Number of URLs that were published during the median day for the year and not archived by any web archive
Archive.org: Number of URLs that were published during the median day for the year and found on archive.org
Archive-it.org: Number of URLs that were published during the median day for the year and found on archive-it.org
U IA: The union of URIs archived in archive.org and archive-it.org
& IA: The intersection of URIs archived in archive.org and archive-it.org
Archive.today: Number of URLs that were published during the median day for the year and found on archive.today
Stanford: Number of URLs that were published during the median day for the year and found on swap.stanford.edu
Arquivo.pt: Number of URLs that were published during the median day for the year and found on arquivo.pt
Perma.cc: Number of URLs that were published during the median day for the year and found on perma.cc
U others: The union of all web archives excluding archive.org and archive-it.org
U IA - U others: The number of URLs archived only by the Internet Archive
U others - U IA: The number of URLs archived by any archive other than the Internet Archive and missing from the Internet Archive
U IA & U others: The intersection between the union of all other archives and the Internet Archive


Also, one main difference between the two experiments is time. We collected data from 1999 to 2022, which is almost double the number of years studied in 2013. We decided to trim our data and only include URIs from 1999 to 2013 to compare the two studies, we concluded that the number of archived pages that can be retrieved excluding the IA went from 4.76% to 6.74% reducing the loss of archived pages from 95.24% to 93.26%; therefore, the change in our results is minimal if we reduce our dataset to match the time frame of their dataset.   

The following table is the trimmed version of the table above.

Plotting the graph for the trimmed dataset shows a very little change from the previous graph. Having our graph next to the author's from the study in 2013 shows the difference between our results. We queried all public web archives in 2022, while they queried their sample 10 years before that. As long as the URIs are on the live web, it is possible to archive pages from 2013 (and earlier) after 2013. Trimming our dataset in this manner without considering different crawled/archived date of the URI and eliminating mementos captured after 2013 poses a legitimate argument against our comparison. However, we are studying the current status of public web archives and the main goal of our study is to expose the consequences of losing the IA. Their experiment showed that it was possible to retrieve all TimeMaps for home pages using the top 9 archives in 2013 and our study shows the complete opposite even if we trimmed our sample and only examined URIs that existed from 1999 to 2013.
Percentage of Retrievable Archived URIs from All Archives with and without The IA (1999-2013)
Percentage of Retrievable Archived URIs from All Archives with and without The IA (1999-2013)
The percentage of the retrieved timemaps for URIs in the dataset (Alsum et al. 2014)
The percentage of the retrieved timemaps for URIs in the dataset (Alsum et al. 2014)


Comparing our two graphs, from the complete dataset and the trimmed dataset, shows that the percentage of our dataset for which we were able to retrieve full TimeMaps has increased from around 40% for URIs published between 1999 and 2013 (trimmed dataset) to around 55% for URIs published between 1999 and 2022 (complete dataset). This increase indicates that although the number of published news stories is increasing over time, the performance of web archives, mainly the IA, has increased significantly in the last decade by archiving much more URIs each year than it did prior to 2013.  
 
Our finding is not only important for URIs already archived. In fact, it is much more important for archiving the URIs that have not been archived as soon as possible. According to Nwala et al. (2018), news stories' URIs are difficult to "refind" on Google using the same query. The study shows that the probability of finding the same news story in the default Google search results using the same query goes down to between 1% -- 11% after one week of publishing the story suggesting that Google search results replace old stories with new ones fairly quickly. It is, therefore, important to push the URIs that we collected and were unable to find in web archives because they will probably be never found using Google search after this many years. We have a set of URIs for news stories on the live web and 45% of them are not archived and they are not discoverable using search engines, therefore, their chance of being archived is almost nonexistent. They will be lost forever, due to link rot, if they are not manually pushed to a web archive.

Conclusions

In this study, we established multiple facts using a sample from our dataset including, the IA has, by far, the highest number of archived news stories among all other archives. The IA has greatly increased their capacity and archiving capabilities since 2013, while that of other web archives has at best remained the same, and in some cases entire web archives have disappeared. Moreover, the IA has archived more news stories than all other archives combined. Losing the IA would mean losing 95.24% of archived pages from our sample. Losing all other public web archives will cause the loss of 0.26% of archived web pages from our sample.

Comments