2025-03-27: Establishing a Baseline by Administration for the Takedown of US Government Webpages using Web Archives

Figure 1: A samhsa.gov page identified by the New York Times as removed under the second Trump administration

On February 2, 2025, Ethan Singer (@ethanpsinger) published an article in the New York Times titled "Thousands of U.S. Government Web Pages Have Been Taken Down Since Friday." Singer showed that over 8,000 webpages from at least a dozen federal agencies had been removed. The article exposed individual pages as well as entire sections of websites that had been taken down. In the full version of the article available to subscribers, Singer identified four limitations:

This study only identified removed pages, not pages that stayed but changed.
This study only covered the current administration on the specified day because they used live sitemaps. The live web can't be used to extend the study backwards to previous administrations. Continuing the analysis with this methodology for this administration as pages are restored or further removed requires constant monitoring of live sitemaps.
The methodology using sitemaps can't directly distinguish between redirected and deleted pages.
This study was unable to establish a benchmark for prior administrations to conclusively determine that more pages are being deleted than normal.

Web archives have infrastructure, tools, and APIs that can address some of these limitations. We will use the Substance Abuse and Mental Health Services Administration domain (samhsa.gov), which has been online since at least 1996 according to the Wayback Machine, to explore how web archives can address the limitations in the article. Using the web archives methodology described below, we found that the rate of deletions on samhsa.gov of four administrations shows the Trump administrations delete 6-10 times the amount of items in the first month as Obama or Biden.

Can archived sitemaps be used to extend this study to prior administrations? (Limitations 2 and 4)

Websites can choose to offer sitemaps in a variety of formats, including XML and HTML. Most of the XML sitemaps are in the root folder called sitemap.xml, but the location and name can be verified by examining the robots.txt file of the domain. samhsa.gov's current robots.txt file confirms that its sitemap xml file is called sitemap.xml and is located in the root. In order to repeat the study for previous administrations, the sitemap must have been archived before the relevant inauguration and then again about a month later in mid February. However, Figure 2 shows that the sitemap was not captured in either of these timeframes for 2021, and was never captured before 2019. This means that the archived XML sitemap cannot be used to replicate the study for previous administrations.

Figure 2: samhsa.gov/sitemap.xml captures at the Wayback Machine show that it was not captured in January or February in 2021, preventing conducing an analysis in the same way as Singer's New York Times article.

SAMHSA also has an HTML sitemap, which is currently located at /sitemap but was located at /sitemap.aspx prior to 2014. These HTML "sitemaps" are meant for humans and not machines, which means they're not standardized in any way as far as location or format. We attempted to identify additional domains with HTML sitemaps to conduct the analysis, but found that some domains have a sitemap for every folder or subdomain instead of one comprehensive list (nist.gov), only list important pages (va.gov), or are not frequently updated (nrc.gov).

We ran into problems regarding limitation #3 from above when attempting to use the SAMHSA sitemap to identify deleted pages, because we identified 787 pages that were present in January 2025's sitemap but gone in February 2025's sitemap. Many of these 787 pages appear to be redirects rather than deletions, such as Craig Obey's biography moving from /about-us/who-we-are/leadership/biographies/craig-obey to /about/leadership/craig-obey. The sitemap method can't natively distinguish between redirects and deletions, which is important for this work because a redirected page implies a different intent than a deleted page. In the New York Times article, they manually followed the redirected pages on the live web to determine if they had been deleted or moved. Other researchers trying to use sitemaps, such as Ed Summers, also needed to request status codes individually by page.

Using the CDX API to extend this study to prior administrations (Limitations 2, 3, and 4)

The Wayback Machine maintains an index of all of its captures, and provides access to this index through the CDX API. This index includes many fields, including the datetimes and HTTP status codes of the captures. We can use the CDX API to request a list of all captures under a specified domain. Pages with successful requests have a 200 HTTP status code, while pages that have been redirected have a 3xx HTTP status code, and pages that have client or server errors have 4xx or 5xx status codes. When requesting a large amount of the index with the CDX API, typically the results are paginated. We queried the index containing each capture under all of samhsa.gov for all available years, with a polite pause of 8-11 seconds between each paginated request.

After downloading the index, we processed it looking for captures with a 200 HTTP status code before each administration’s inauguration date (relying on a capture from the year before when necessary, due to varying crawl rates for high-level and deep pages) along with a non-200 HTTP status code as its most recent capture. For redirects, we only kept captures with a non-200 and non-3xx status code within the window. Using this methodology, we can show that both Trump administrations deleted 6-10 times the amount of items on samhsa.gov in the first month compared to Biden and Obama as shown in Table 1. The items include webpages as well as resources such as PDFs and videos.

Administration (Inauguration)	Unique items captured with 200 status code	Pages that changed to a non-200 status code (redirects filtered)	Percent
Obama (2009)	10,781	0	0.000%
Trump 1 (2017)	23,145	25	0.108%
Biden (2021)	173,088	26	0.015%
Trump 2 (2025)	559,091	365	0.065%

Table 1: The rate of deletions on samhsa.gov of four administrations shows the Trump administrations delete 6-10 times the amount of items in the first month as Obama or Biden.

Outlook

By using the CDX API, we have addressed limitations 2, 3, and 4 of the New York Times study. The CDX API also provides file size information that could be used to identify candidates to analyze for change (limitation 1).

We would also like to collect and analyze pages that have "come back differently" as mentioned in the article "A Look at Federal Health Data Taken Offline." One example of a page that has come back differently is the FAQs About National Healthcare Safety Network (NHSN) Security page. Figure 3a shows the page in January 2025 with the term "gender." Figure 3b shows when the page went offline, as it was redirected to a page explaining that the entire NHSN website was down to bring it into compliance with the executive order about only two genders. Figure 3c shows the banner on the website when it was restored in mid February. Figure 3c shows the same capture scrolled down, with the term "gender" from Figure 3a replaced with the term "sex."

Figure 3a: The January 31, 2025 capture of the NHSN FAQ page using the term "gender"

Figure 3b: The February 1, 2025 redirect to a maintenance page due to Executive Order 14168

Figure 3c: The February 14, 2025 capture of the maintenance banner

Figure 3d: The same capture as Figure 3c scrolled down, showing the term "gender" from Figure 3a has been replaced with the term "sex."

We plan to collect and analyze additional pages that have "come back differently."

-Lesley Frew, Michael L. Nelson, and Michele C. Weigle

Search This Blog

Web Science and Digital Libraries Research Group