2025-06-10: The Wayback Machine is now much larger than the sum of all other web archives

The overlap in web archives holdings of URIs in our sample

In this post, we summarize our study on the archiving rate of news stories published in Arabic and English from four major news outlets, Aljazeera Arabic, Aljazeera English, Alarabiya, and Arab News. We found that 45% of news stories' URIs published between 1999 and 2022 were not archived at all. Furthermore, for news stories published between 1999 and 2013, 65% of them were not archived. For stories published between 2013 and 2022, only 21% of them are not archived. Our findings line up with Ainsworth et al. (2011). who found that between 30% and 90% of the web is archived. Our results indicate a notable improvement in web archiving within the last decade, however, we found that improvement to be limited to the Internet Archive.

An earlier study by Alsum et al. (2014), on a different dataset, found that it is possible to retrieve full TimeMaps for 55% of their dataset using the top three web archives excluding the Internet Archive (IA). Ten years later, in our dataset we found those numbers to be 4.76% and 6.74% for URIs from 1999 to 2013 and from 1999 to 2022, respectively. The percentage does not change with the increase in the number of web archives queried for mementos because, other than IA and Archive-It, only four web archives returned copies of archived web pages we queried: Archive Today, Arquivo.pt, Perma.cc, and Stanford Web Archive. Alsum et al. were able to get mementos from more archives than we did because unlike our dataset, they sampled from DMOZ, which is a websites directory and it is what web archives used for seed list sourcing, and the fact that there were more archives in operation at the time of their study (2013). They were able to retrieve complete TimeMaps for 93% of their dataset using the top nine web archives excluding the Internet Archive. Our results show that 95.24% of URIs in our dataset that are archived by the IA would be lost forever if the Internet Archive was, for example, crippled or killed by legal threats or became unavailable or partially unavailable as a result of multiple cyber attacks. Although recent attacks were fended off, interruptions will continue to occur as long as these attacks continue to happen.

The large difference between our results and the results from Alsum et al. is due to using a different dataset, the ten years difference in time when each study was conducted, the absence of some public web archives that were operating in 2013 but stopped by 2022 (Library and Archives Canada, The European Archive, The National Library of Ireland, and Public Record Office of Northern Ireland, and WebCite), including more archives via Memento proxies that are no longer operational, and the dramatic improvement that the IA had made within the last decade; other archives did not expand or improve as much. Our study shows that the percentage of retrievable web pages varies depending on the dataset, the emergence and stoppage of web archives as time changes.

In 2024, the results are very different from 2014; over 95% of archived Arabic and English news stories in our dataset would be lost forever if the IA is shuttered. We found that the ability to retrieve full TimeMaps for our dataset went from 40% for URIs from 1999-2013 to 55% for URIs from 1999 to 2022 indicating a significant improvement in web archives' performance in the last 10 years, but because only a few mementos were retrieved from archives other than the IA, we conclude that the improvement in performance is limited to the IA and that losing it will be a major catastrophe to web archiving.

The Dataset

In September 2022, I collected 1.5 million multilingual news stories' URLs from sitemaps of four leading news outlets: Aljazeera Arabic, Aljazeera English, Alarabiya, and Arab News. Using sitemaps yielded the maximum amount of stories' URLs. I examined multiple methods to fetch URLs including RSS, Twitter, GDELT, web crawling, and sitemaps.

The dataset is available on GitHub.

The URIs in the dataset are for news stories published on the web between 1999 and September 2022. There were only a few stories collected from 1999 because some of these websites did not exist back in 1999, like Alarabiyah (2004) and Aljazeera English (2006) or their online presence was not notable, like Arab News and Aljazeera Arabic. The dataset is grouped by the day news stories are published. There are some other formats like grouped by news source and grouped by year. For the purpose of my study, I grouped them by year, then by day to be able to extract a sample (day's worth of news stories' URLs for each year) that represents the output of these news outlets for that year. The following table shows the year, the minimum and the maximum number of stories published each day for that year by all four news outlets. It also shows the median and mean for the number of stories published each day for the year. Excluding stories published in 1999 due to the tiny amount collected, the minimum number of stories published each day for all years, 2000 to 2022, is indicated by the red color font in the table, and the maximum is indicated by the green.

The minimum and the maximum number of stories published each day by year
Year	Min	Max	Median	Mean
1999	1	2	1	1.33
2000	2	48	26	27.28
2001	14	174	82	81.26
2002	23	147	93	93.86
2003	44	198	127	126.53
2004	54	573	140	132.39
2005	105	553	165	167.46
2006	101	238	154	153.61
2007	112	231	164	162.92
2008	109	272	157	158.43
2009	82	312	149	151.15
2010	137	317	193	195.81
2011	135	664	236	237.53
2012	98	389	242	243.63
2013	128	332	259	245.69
2014	117	260	178	177.75
2015	117	217	171	168.77
2016	106	464	181	186.9
2017	144	356	220	219.58
2018	118	266	182	184.96
2019	107	240	177	176.63
2020	215	437	345	338.94
2021	86	444	340	323
2022	69	275	133	130.75

Plotting the min, max, median, and mean for the number of news stories collected each day by year shows that the dataset has a symmetrical distribution, the mean and the median for each year are the same or very close; therefore, using the day that represents the median number of news stories published by all four news sources is a good representative of the number of stories published each day for that year.

The min, max, median, and mean for the number of collected stories' URIs each day by year

We noticed that, with some exceptions, the number of stories published each year is increasing over time. The drop in the number of stories published in 2022 is because we constructed our dataset in September of 2022, so not all stories published in 2022 were captured.

The median number of collected URIs per day by year

This table is the same as above with the addition of file names in the dataset so researchers can reproduce the results.

The minimum and the maximum number of stories published each day by year with the files' names
Year	Min collected/day	Max collected/day	Median	Mean	Median file name	Max file name	Min file name
1999	1	2	1	1.33	1999-01-09.txt	1999-01-09.txt	1901-12-13.txt
2000	2	48	26	27.28	2000-11-21.txt	2000-12-24.txt	2000-11-12.txt
2001	14	174	82	81.26	2001-04-21.txt	2001-10-01.txt	2001-01-02.txt
2002	23	147	93	93.86	2002-01-01.txt	2002-05-16.txt	2002-12-31.txt
2003	44	198	127	126.53	2003-05-19.txt	2003-12-08.txt	2003-05-31.txt
2004	54	573	140	132.39	2004-01-03.txt	2004-10-03.txt	2004-08-14.txt
2005	105	553	165	167.46	2005-01-15.txt	2005-01-10.txt	2005-11-04.txt
2006	101	238	154	153.61	2006-05-02.txt	2006-03-02.txt	2006-10-24.txt
2007	112	231	164	162.92	2007-02-17.txt	2007-04-19.txt	2007-10-13.txt
2008	109	272	157	158.43	2008-01-04.txt	2008-02-11.txt	2008-07-23.txt
2009	82	312	149	151.15	2009-06-22.txt	2009-10-01.txt	2009-03-14.txt
2010	137	317	193	195.81	2010-04-05.txt	2010-04-21.txt	2010-04-17.txt
2011	135	664	236	237.53	2011-04-01.txt	2011-12-12.txt	2011-01-08.txt
2012	98	389	242	243.63	2012-01-18.txt	2012-05-09.txt	2012-02-16.txt
2013	128	332	259	245.69	2013-01-02.txt	2013-04-30.txt	2013-10-17.txt
2014	117	260	178	177.75	2014-02-09.txt	2014-11-19.txt	2014-10-24.txt
2015	117	217	171	168.77	2015-02-10.txt	2015-03-16.txt	2015-09-26.txt
2016	106	464	181	186.9	2016-03-29.txt	2016-06-06.txt	2016-07-08.txt
2017	144	356	220	219.58	2017-01-11.txt	2017-04-24.txt	2017-01-20.txt
2018	118	266	182	184.96	2018-01-13.txt	2018-04-10.txt	2018-08-17.txt
2019	107	240	177	176.63	2019-01-29.txt	2019-10-23.txt	2019-08-11.txt
2020	215	437	345	338.94	2020-02-28.txt	2020-06-17.txt	2020-08-01.txt
2021	86	444	340	323	2021-03-07.txt	2021-03-24.txt	2021-12-25.txt
2022	69	275	133	130.75	2022-01-28.txt	2022-09-08.txt	2022-08-06.txt

The Method

I used MemGator to check if the collected news stories were archived by public web archives. The list of default archives checked by memgator is rather long, however, most of these archives never returned a response with any archived copies, so I eliminated these archives. I also eliminated The Library of Congress archive at their request because the archive does not have the capacity to support the amount of requests we made in the course of this study. This is an alphabetically ordered list of archives that returned at least one copy for one URL out of the 4116 URLs we queried:

1. archive.today: Archive Today

2. arquivo.pt: The Portuguese Web Archive

3. perma.cc: Perma.cc Archive

4. swap.stanford.edu: Stanford Web Archive

5. wayback.archive-it.org: Archive-It (powered by the Internet Archive)

6. web.archive.org: the Internet Archive

We found that 2269 URIs were archived and that each URI has, at least, one memento that is retrievable from at least one web archive. On the other hand, 1847 URIs were not archived at all by any web archive.

This is a list of archives that did not return mementos for any of the URIs we queried:

1. waext.banq.qc.ca: Libraries and National Archives of Quebec

2. warp.da.ndl.go.jp: National Diet Library, Japan

3. wayback.vefsafn.is: Icelandic Web Archive

4. web.archive.bibalex.org: Bibliotheca Alexandrina Web Archive

5. web.archive.org.au: Australian Web Archive

6. webarchive.bac-lac.gc.ca: Library and Archives Canada

7. webarchive.loc.gov: Library of Congress

8. webarchive.nationalarchives.gov: UK National Archives Web Archive

9. webarchive.nrscotland.gov.uk: National Records of Scotland

10. webarchive.org.uk: UK Web Archive

11. webarchive.parliament.uk: UK Parliament Web Archive

12. wayback.nli.org.il: National Library of Israel

The sample I checked for each year are the stories that were published by each news outlet in the day that represents the median.

Results and Discussion

I found that 45% of the stories I collected between 1999 and 2022 are not archived, which is in agreement with Ainsworth et al. (2011) where they found that between 35% and 90% of URIs have at least one archived copy. These stories are on the live web; unlike news stories that are behind paywalls (Atkins et al. 2018); these stories can be crawled and they are a part of the set of publicly archivable web pages. Their URIs can be easily pushed to the IA using batch processed using archive.org services, which is what I am currently working on and will reported on in the future. They can also be pushed to other public web archives using ArchiveNow.

The percentage of archived URIs by year

I found that the Internet Archive has 99.74% (2263 URIs) of the stories archived. The union of all other web archives only had 4.76% (108 URIs) of archived stories. Remarkably, only six web pages out of 2269 were archived by other archives, but missing from the IA! This is about one news story missing from the IA every four years. In other words, using our dataset as a sample, if we lose all public web archives except the IA, we will only lose 0.26% of archived pages and all of them are from before 2013. On the other hand, using the same sample, if we lose the IA and all other archives maintain their status, we will lose 95.24% of archived web pages.

The percentage of archived URIs in the IA and in the union of other archives

From 2013 until 2022, the IA did not miss a single page that other archives have captured. That is almost a decade of the IA not only being the largest web archive, but to be archiving a much larger set of web pages than exist in the union of all other web archives.

The percentage of URIs exclusively archived in the IA and the union of other archives

The results show that the union of all archives besides the IA have only archived 4.76% of the total archived URLs. In other words, if the IA is shutdown tomorrow, we can retrieve, at maximum, 4.76% of the lost copies. These results are very different from what we were able to retrieve 10 years ago (Alsum et.al. 2014). In that study, the authors show that despite the IA being the largest archive on the internet, it was possible to retrieve complete TimeMaps for 93% of the dataset from the top 9 web archives excluding the IA.

The percentage of retrievable archived URIs from all archives with and without the IA

The key differences between my experiment and the experiment published in 2013 is the dataset and the way the sample was selected.

In my experiment, the dataset consists of 1.5 million URLs collected from sitemaps of four major news outlets in Arabic and English owned by Arabic news agencies. We split the dataset into smaller datasets based on the day in which the news story was published. We selected the day that represents the median of the number of news stories published to represent the year, and we queried the actual stories' URLs to all web archives without trimming or reducing the URL to the hostname. For example, if the URL for a news story in the sample is https://example.com/foor/bar, we queried web archives with http://example.com/foo/bar.

On the other hand, Alsum et al. (2014) sampled data from DMOZ, archives holding, and archive access logs, but they only used hostnames. For example https://example.com/foo/bar is queried as https://example.com. Their sample may have overlapping hostnames but within each sample, hostnames are unique. So the experiment in 2013 was performed on hostnames, which are typically home pages for websites, not URLs that have content related to a specific event.

Another difference is the way hostnames were sampled from DMOZ. Although one of the DMOZ samples was randomized, the other two were language controlled and TLD (top level domain) controlled. Their experiment categorized DMOZ by TLDs, limited their study to a specific set of TLDs distributed around the world, and for each selected TLD, they randomly selected a sample. Their language controlled DMOZ sample consisted of hostnames for a limited set of languages that represent the world after categorizing DMOZ by language.

Regardless of the sources they used to build their sample, I believe that the reason behind the significant difference between our results and theirs is reducing the collected URLs to hostnames in their sample. While home pages are very important to archive because home pages represent what the website thinks is the most important to show to the user when they fist open them, the user has to navigate away from the home page in order to get information about a certain topic and get the details they are looking for (a news story in our case).

Furthermore, utilizing archive holdings to find hostnames will eliminate bias towards archives that used DMOZ as a URI source like the IA, but the method makes much smaller archives than IA contribute the same amount of URIs of hostnames to the sample. They already archive these hostnames so they will have copies of them. For example, if the sample has sets hostnames a, b, c, d,and e; let a be the sampled hostnames from archive A, b is the sampled hostnames from archive B, etc, the complete set of archived hostnames can almost be completely retrieved from archives B, C, D, and E because if the experiment included 10 archives, the union of any 9 archives of them will contain, at least, 90% of archived hostnames.

What Alsum et al. (2014) were able to prove is that complete TimeMaps can be retrieved from the top 9 archives excluding the IA. On the other hand, our study proves that the exclusion of the IA will cause 95.24% of archived pages to be irretrievable if they are no longer on the live web or their content has drifted since they were last archived.

I decided to provide another graph plotting my results in the same way the authors did with their results to be able to visually compare the two results. Of course, my plot only includes archives in which I was able to find at least one memento for at least one of the 4116 URIs that I queried.

Percentage of Retrievable Archived URIs from All Archives with and without The IA

The percentage of the retrieved timemaps for URIs in the dataset (Alsum et al. 2014)

The following table shows the results from running the median day news stories by all archives listed in memgator's archives that returned at least one copy of at least one URL. The table headers are shortcuts as follows:

Collected: Number of URLs that were published during the median day for the year and collected at the time of dataset creation

Archived: Number of URLs that were published during the median day for the year and archived by any web archive

Not Archived: Number of URLs that were published during the median day for the year and not archived by any web archive

Archive.org: Number of URLs that were published during the median day for the year and found on archive.org

Archive-it.org: Number of URLs that were published during the median day for the year and found on archive-it.org

U IA: The union of URIs archived in archive.org and archive-it.org

& IA: The intersection of URIs archived in archive.org and archive-it.org

Archive.today: Number of URLs that were published during the median day for the year and found on archive.today

Stanford: Number of URLs that were published during the median day for the year and found on swap.stanford.edu

Arquivo.pt: Number of URLs that were published during the median day for the year and found on arquivo.pt

Perma.cc: Number of URLs that were published during the median day for the year and found on perma.cc

U others: The union of all web archives excluding archive.org and archive-it.org

U IA - U others: The number of URLs archived only by the Internet Archive

U others - U IA: The number of URLs archived by any archive other than the Internet Archive and missing from the Internet Archive

U IA & U others: The intersection between the union of all other archives and the Internet Archive

Also, one main difference between the two experiments is time. We collected data from 1999 to 2022, which is almost double the number of years studied in 2013. We decided to trim our data and only include URIs from 1999 to 2013 to compare the two studies, we concluded that the number of archived pages that can be retrieved excluding the IA went from 4.76% to 6.74% reducing the loss of archived pages from 95.24% to 93.26%; therefore, the change in our results is minimal if we reduce our dataset to match the time frame of their dataset.

The following table is the trimmed version of the table above.

Plotting the graph for the trimmed dataset shows a very little change from the previous graph. Having our graph next to the author's from the study in 2013 shows the difference between our results. We queried all public web archives in 2022, while they queried their sample 10 years before that. As long as the URIs are on the live web, it is possible to archive pages from 2013 (and earlier) after 2013. Trimming our dataset in this manner without considering different crawled/archived date of the URI and eliminating mementos captured after 2013 poses a legitimate argument against our comparison. However, we are studying the current status of public web archives and the main goal of our study is to expose the consequences of losing the IA. Their experiment showed that it was possible to retrieve all TimeMaps for home pages using the top 9 archives in 2013 and our study shows the complete opposite even if we trimmed our sample and only examined URIs that existed from 1999 to 2013.

Percentage of Retrievable Archived URIs from All Archives with and without The IA (1999-2013)

The percentage of the retrieved timemaps for URIs in the dataset (Alsum et al. 2014)

Comparing our two graphs, from the complete dataset and the trimmed dataset, shows that the percentage of our dataset for which we were able to retrieve full TimeMaps has increased from around 40% for URIs published between 1999 and 2013 (trimmed dataset) to around 55% for URIs published between 1999 and 2022 (complete dataset). This increase indicates that although the number of published news stories is increasing over time, the performance of web archives, mainly the IA, has increased significantly in the last decade by archiving much more URIs each year than it did prior to 2013.

Our finding is not only important for URIs already archived. In fact, it is much more important for archiving the URIs that have not been archived as soon as possible. According to Nwala et al. (2018), news stories' URIs are difficult to "refind" on Google using the same query. The study shows that the probability of finding the same news story in the default Google search results using the same query goes down to between 1% -- 11% after one week of publishing the story suggesting that Google search results replace old stories with new ones fairly quickly. It is, therefore, important to push the URIs that we collected and were unable to find in web archives because they will probably be never found using Google search after this many years. We have a set of URIs for news stories on the live web and 45% of them are not archived and they are not discoverable using search engines, therefore, their chance of being archived is almost nonexistent. They will be lost forever, due to link rot, if they are not manually pushed to a web archive.

Conclusions

In this study, we established multiple facts using a sample from our dataset including, the IA has, by far, the highest number of archived news stories among all other archives. The IA has greatly increased their capacity and archiving capabilities since 2013, while that of other web archives has at best remained the same, and in some cases entire web archives have disappeared. Moreover, the IA has archived more news stories than all other archives combined. Losing the IA would mean losing 95.24% of archived pages from our sample. Losing all other public web archives will cause the loss of 0.26% of archived web pages from our sample.

The overlap in web archives holdings of URIs in our sample

2025-10-03 edit: I replaced all graphs in this post with graphs that are more visually appealing.

-Hussam Hallak

Search This Blog

Web Science and Digital Libraries Research Group

2025-06-10: The Wayback Machine is now much larger than the sum of all other web archives

Comments

Post a Comment