2026-01-22: Paper Summary: "Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons"

Figure 1: Example of an image pair from Reyes Ayala’s dataset of a live web page (left screenshot) and an archived web page (right screenshot).

For the Game Walkthroughs and Web Archiving project, we have created web archiving livestreams (example livestream) that use tools to measure the performance of a web archive crawler during the livestream. In a previous blog post, I compared six different approaches for measuring web archiving and replay performance metrics so that I could identify tools that could be used during our future web archiving livestreams. In this post, I will summarize another related work (“Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons”) by Brenda Reyes Ayala that determines the quality of an archived web page by comparing screenshots of the live web page and archived web page.

Reyes Ayala, B. (2025). Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons. In M. Cornia et al. (Eds.). Proceedings of the 21st Conference on Information and Research science Connecting to Digital and Library science, 3937. Udine, Italy: CEUR Workshop Proceedings https://ceur-ws.org/Vol-3937/

Measuring Visual Correspondence With Web Archiving Screenshot Compare Tool

Reyes Ayala created the Web Archiving Screenshot Compare tool which assists with automating quality assurance by comparing screenshots of the live web page and the archived web page and determining the visual correspondence. The process for generating screenshots involves several steps. First, the tool reads the settings file that contains the seed list. For each seed, it checks if the web page exists and, if so, takes a screenshot. Next, the tool creates a CSV file with a list of the Archive-It URI-Ms associated with the archived versions of the seed. Then, the Web Archiving Screenshot Compare tool takes a screenshot of each archived web page. Finally, the URI-Ms and their screenshot file names are written to a CSV file.

After the screenshots are taken, the Web Archiving Screenshot Compare tool can use an image similarity metric to compute a score. Before computing a score, this tool checks if the live web page screenshot is not blank and then will crop a screenshot from the (live screenshot and archived screenshot) pair if both images are different sizes. After the score is computed it is output to a CSV file.

Determining the Effectiveness of the Image Similarity Metrics

The image similarity metrics supported by the Web Archiving Screenshot Compare tool are Structural Similarity Index (SSIM), Mean Squared Error (MSE), Normalized Root Mean Square Error (NRMSE), Perceptual Hash (P-Hash), Peak Signal to Noise Ratio (PSNR), and a percentage similarity metric that Reyes Ayala created. Three of these metrics (MSE, P-Hash, and PSNR) were discarded from her evaluation, because these metrics did not have an upper bound. The percentage similarity metric was also discarded, because it had a strong negative correlation with NRMSE.

The dataset that was used for the evaluation included 221 pairs of screenshots of the live and archived web pages. The archived web pages were from four Archive-It collections (Idle No More, Fort McMurray Wildfire 2016, Western Canadian Arts, and Government of Canada). After calculating the similarity scores on her dataset, she sent the screenshots to Amazon Mechanical Turk (AMT) so that she could compare the computed scores to reviewer scores. An example of an image pair that was shown to participants from Amazon Mechanical Turk is shown in Figure 1.

Reyes Ayala found that SSIM and NRMSE were able to detect high and low visual correspondence after performing statistical analysis using tests of significance. The metrics she used were one-way multivariate analysis of variance (MANOVA) and univariate analysis of variance (ANOVA). The scores for MANOVA, when using a combined dependent variable were: 𝐹(2, 222) = 44.95, 𝑝 < .001; Wilks’ 𝜆 = 0.71; Pillai’s trace = 0.29, partial 𝜂2 = 0.29 . The scores (with a Bonferroni 𝛼 adjusted level of .025) for the univariate ANOVAs were: SSIM: 𝐹(1, 223) = 10.53, 𝑝 = .001; partial 𝜂2 = 0.05 and NRMSE: 𝐹(1, 223) = 89.52, 𝑝 < .001; partial 𝜂2 = 0.29.

Kiesel et al. (“Reproducible web corpora: Interactive archiving with automatic quality assessment”) and Walsh et al. (“High Fidelity Web Archiving of News Sites and New Media with Browsertrix”) also determined the quality of an archived web page by performing screenshot comparisons. Kiesel et al.’s approach, which I summarized in a previous blog post, involves machine learning and uses a deep convolutional neural network named VGGNet, for image classification tasks. Walsh et al.’s approach, which I summarized in another blog post, used the Pixelmatch tool for screenshot comparison, measuring differences in color samples and the intensity difference between pixels.

Summary

Reyes Ayala created the Web Archiving Screenshot Compare tool, which is used to determine the quality of an archived web page by comparing the screenshots of the live web page and the archived web page. After creating a dataset of 221 pairs of screenshots and retrieving human review scores, she performed statistical analysis using tests of significance. She found that the Structural Similarity Index (SSIM) and Normalized Root Mean Square Error (NRMSE) were able to distinguish between high quality and low quality archived web pages.

For our web archiving livestreams, we currently measure the performance of the web archive crawler during the livestream and plan to measure replay performance during future livestreams. Writing this paper summary has helped with learning about another approach that could be used to measure visual correspondence during a web archiving livestream.

--Travis Reid (@TReid803)

Search This Blog

Web Science and Digital Libraries Research Group