2026-02-13: Paper Summary: "High Fidelity Web Archiving of News Sites and New Media with Browsertrix"

Figure 1: The Browsertrix Tool Suite

For the Saving Ads project and the Game Walkthroughs and Web Archiving project, we have used Browsertrix Crawler, ArchiveWeb.page, and ReplayWeb.page, which are Webrecorder tools that were discussed in Walsh et al.’s paper “High Fidelity Web Archiving of News Sites and New Media with Browsertrix.” In this paper, Walsh et al. describe tools that are integrated with Browsertrix and the features that differentiated their tools from other web archive crawlers and replay systems. Browsertrix is a free and open-source web archiving platform that can be run locally, self-hosted, or used through Webrecorder's hosted service. Browsertrix uses Browsertrix Crawler for archiving web pages, ArchiveWeb.page to patch archived web pages, and ReplayWeb.page to replay archived web pages (Figure 1).

Walsh, Tessa, Henry Wilkinson, and Ilya Kreymer. “High Fidelity Web Archiving of News Sites and New Media with Browsertrix.” in Proceedings of the 2024 International Federation of Library Associations and Institutions (IFLA) International News Media Conference. https://repository.ifla.org/handle/20.500.14598/3399. 2024.

Browsertrix

Walsh et al. described some of the features that differentiate Browsertrix from other crawlers, such as its ability to use browser profiles to archive content behind log-ins and personalized social media feeds, ad blocking features, and automated interactions (behaviors) that can be executed during the crawling session.

Several Browsertrix features were described in this paper:

Features associated with collaboration, sharing, and improving transparency:

It is possible for multiple users to work together to create a web archive collection.
Browsertrix collections can be shared with others through an unlisted link or by making the collection public.

There is also an option that allows anyone to download the collection as a WACZ file.
A Browsertrix collection can be embedded into a web page by downloading its WACZ file and then using ReplayWeb.page to load the WACZ.

When archiving web pages, it is possible to view the crawler archiving web pages (Figure 2) and the user can modify the crawling workflow’s exclusion settings in real-time.

This feature makes it easier to avoid crawler traps.
Browsertrix uses Browsertrix Crawler’s feature for displaying the live web page during the crawling session. This Browsertrix Crawler feature is used by my web archiving livestream tool and it helps improve the transparency of the web archiving process, because it allows the viewers of our livestream to see the live web page during the crawling session.

Browsertrix can also digitally sign archived web pages to create a chain of custody. There is also an archival receipt mode that allows anyone to view the signing information.

Exclusion settings: Their exclusion settings allow for regular expressions to be used to exclude URLs from the crawl.
Patching a crawl: When an archived web page has missing resources, ArchiveWeb.page can be used to patch the crawl by manually archiving the web page and then uploading the WACZ file from ArchiveWeb.page to the Browsertrix collection.
Browser profiles and settings: Browsertrix allows users to customize cookies, login to accounts and save browser settings like ad blocking, cookie blocking, and other privacy settings for the Brave browser.
API: Browsertrix has an API that allows applications to be notified of when a crawl has started or stopped and this feature can be used to start a new crawl without having to manually start a crawl from Browsertrix’s UI.

Figure 2: Browsertrix’s Watch Crawl feature. This figure is a screenshot from a video that is embedded in Webrecorder’s blog post “Preserving Government Websites with Browsertrix.”

Embedding ReplayWeb.page

Walsh et al. also highlighted the benefits of using ReplayWeb.page to embed a collection (that is stored in a WACZ file) or the replay of an archived web resource (like a social media post) into a static HTML web page. This approach differs from MementoEmbed, which embeds a social card into a web page, providing information about the archived resource, like the memento-datetime and links to the live and archived versions of the resource. The advantages of ReplayWeb.page’s embed feature is that it reduces the cost and complexity of hosting the archived website since a database server is not required and it removes the need to update dependencies like JavaScript libraries.

For the Game Walkthroughs and Web Archiving project, we used this feature during Replay mode to embed the replay of an archived web page into a HTML page that was dynamically generated by my web archiving livestream tool (Figure 3). I also used this feature during the Saving Ads project when I created a web page to display all the archived web ads from our dataset (Figure 4).

Figure 3: Using ReplayWeb.page to embed the replay of an archived web page during Replay mode.

Figure 4: Using ReplayWeb.page to embed the replay of an archived web ad.

Quality Assurance

Browsertrix's automated quality assurance metrics were created to help with identifying problems with the crawl without needing to manually inspect the replay of the archived web page.

They use three performance metrics:

Extracted text comparison: Levenshtein distance is applied on the extracted text from the crawl and replay (Figure 5).

Figure 5: Extracted text comparison. This figure is a screenshot from Webrecorder’s video “High Fidelity Web Archiving with Browsertrix (SAA).”

Screenshot comparison: The Pixelmatch tool is used to compare the screenshots of the live and archived web page, by measuring the difference between color samples and the intensity difference between pixels (Figure 6).

Figure 6: Screenshot comparison. This figure is a screenshot from Webrecorder’s video “High Fidelity Web Archiving with Browsertrix (SAA).”

Resource count comparison: A table is displayed that includes the resource types and the counts of successful (2xx or 3xx) and unsuccessful (4xx or 5xx) HTTP status codes that occurred during crawl and replay.

Browsertrix also has features for the user to create, approve, and reject comments on each page and rate the quality of a crawl with a review score (Figure 7).

Figure 7: Browsertrix’s feature for reviewing a crawl. This figure is from Webrecorder’s blog post “Browsertrix 1.10: Now with Assistive QA!”

Kiesel et al. (“Reproducible web corpora: Interactive archiving with automatic quality assessment”) and Reyes Ayala (“Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons”) have also compared screenshots of the live and archived web pages when determining the quality of the archived web page. Kiesel et al.’s approach, which I summarized in a previous blog post, used machine learning and used a deep convolutional neural network named VGGNet, for image classification tasks. Reyes Ayala’s approach, which I also summarized in a previous blog post, involved using popular image similarity metrics like Structural Similarity Index (SSIM) and Normalized Root Mean Square Error (NRMSE) when comparing the screenshots of the live web page and archived web page.

Summary

In Walsh et al.’s paper, they discussed the features of multiple Webrecorder tools (Browsertrix, Browsertrix Crawler, ArchiveWeb.page, and ReplayWeb.page). Browsertrix is their web archiving platform that uses Browsertrix Crawler for archiving web pages, ArchiveWeb.page for patching web pages, and ReplayWeb.page to replay archived web resources. Browsertrix Crawler is a browser-based web archive crawler that can use browsing profiles to archive web pages while being logged into an account. Using browsing profiles also allows for the customization of cookies and browser settings like ad blocking, cookie blocking, and other privacy settings. ArchiveWeb.page is a browser extension that can be used to manually archive web pages and can also patch crawls from a Browsertrix collection. ReplayWeb.page is a replay system that can be used to embed the replay of an archived website into a static HTML page, which reduces the complexity of hosting an archived website.

Walsh et al. also described how Browsertrix can be used to assist with quality assurance for web archive collections. They added features that allow a user to create, approve, and reject comments on each archived web page and rate the quality of the web page with a review score. Browsertrix can also compute three quality assurance metrics, which involve extracted text, screenshots, and HTTP status codes for resources. For extracted text, the Levenshtein distance is applied to the extracted text from the crawl and replay. For the screenshot comparison of the live and archived web pages, Pixelmatch is used to measure the difference between color samples and the intensity difference between pixels. For the resource count comparison, Browsertrix displays a table with the resource types and the counts of successful (2xx or 3xx) and unsuccessful (4xx or 5xx) HTTP status codes that occurred during crawl and replay.

For the Saving Ads project and the Game Walkthroughs and Web Archiving project, we have used most of the Webrecorder tools discussed in this paper. In the Saving Ads project, we used ArchiveWeb.page and Browsertrix Crawler to archive web pages that dynamically loaded ads. After we created a dataset of archived web ads, we then used ReplayWeb.page to embed the replay of the ads from our dataset into a static HTML page. For the Game Walkthroughs and Web Archiving project, Browsertrix Crawler was used to archive web pages during some web archiving livestreams. ReplayWeb.page was then used during Replay mode to embed the replay of the archived web pages during these livestreams.

--Travis Reid (@TReid803)

Search This Blog

Web Science and Digital Libraries Research Group