2024-03-21: Surveying Recent Work on Measuring Web Archiving and Replay Performance

 

Figure 1: Six articles and the type of web archiving and replay performance metrics that they included in their work.

For the Game Walkthroughs and Web Archiving project, we have created web archiving livestreams (example livestream) that use tools to measure the performance of web archiving and replay systems, so that we can see which crawler had better performance during the livestream. In this post, we categorize and survey six related works that include performance metrics related to web archiving and replay quality (Figure 1). The four categories shown in Figure 1 are archivability, completeness, visual correspondence, and interactional correspondence. Three of these categories, visual correspondence, completeness, and interactional correspondence, for measuring web archiving and replay performance are associated with Reyes Ayala’s grounded theory (Paper: "Correspondence as the primary measure of information quality for web archives: A human-centered grounded theory study"). 

The other works in Figure 1 measured performance metrics to predict how difficult it will be to archive a web page (website archivability) or to assess the quality of replaying an already archived web page. One difference between these approaches is that the categories that measure replay quality usually do not have access to the live web page (e.g., the page on the live web and the page archived 10 years ago are out of sync) and this can make it difficult to determine the importance of the missing resources.

Website Archivability

Website archivability is associated with the difficulty of archiving a website. Banos and Manolopoulos created the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure the archivability of a website ("A quantitative approach to evaluate Website Archivability using the CLEAR+ method"). 


Banos and Manolopoulos used four facets to measure website archivability which are accessibility, standards compliance, cohesion, and metadata. For accessibility, they checked if the web content can be retrieved by the web archive crawler and how quickly the crawler could archive the website’s resources. For standards compliance, they used known standards for file and code validation, checked for files that did not use an open standard, and looked for certain HTTP headers. The file and code validation tools were used to see if there are any issues with the resources on the web page. HTTP headers were checked to get more information about web resources that will be archived. For cohesion, they checked how many resources were associated with an external service (a service not hosted on the same website). HTTP headers, markup, and HTML source code were checked for the metadata facet, because they can be used to get more information about the resources being archived.

ArchiveReady.com is the reference implementation of the CLEAR+ method. Figure 2 shows ArchiveReady.com being used to get the Website Archivability rating for https://www.odu.edu/. The overall rating for the website is 72%, and details are provided for each of the other facets.  The accessibility rating is 59% and the warnings were for 15 invalid links, six inline <script> elements, a network response time larger than 200ms, and that the robots.txt file includes Disallow commands. The cohesion rating is 49% and the warnings were for five remote CSS files, six remote scripts, and one remote HTML5 video. The metadata rating is 100% with no warning messages. Finally, the standards compliance rating is 78% with warning messages for seven CSS files with errors (checked by W3’s CSS validator), one HTML document (https://www.odu.edu/) with errors (checked by W3’s HTML validator), and five images that could not be checked with JHOVE.




Visual Correspondence

Visual correspondence is the similarity of appearance between the archived web page and the live web page. Figure 3 is an example of an archived web page that has high visual correspondence, where the archived web page almost looks the same as the original web page. An example of low visual correspondence is shown in Figure 4, where none of the visual content is loaded during replay.

Figure 3: Example of an archived web page with high visual correspondence. Screenshots from https://doi.org/10.1007/s00799-021-00314-x


Figure 4: Example of an archived web page with low visual correspondence. Screenshots from https://doi.org/10.1007/s00799-021-00314-x

Three of the related works in Figure 1 measured visual correspondence. In Gray and Martin’s work ("Choosing a Sustainable Web Archiving Method: A Comparison of Capture Quality"), they compared two web archiving methods and one of the categories (substantial category) focused on how the visual appearance of the archived web page is impacted. They checked for missing resources that impacted the visual appearance of the web page like missing images and CSS files.

Figure 5: An example archived web page with three embedded images that have different relative importance values. Image from https://doi.org/10.1007/s00799-015-0150-6

In Brunelle et al.’s work ("Not all mementos are created equal: measuring the impact of missing resources"), they created a memento damage algorithm that is related to visual correspondence, because the algorithm checks for the position of elements on the web page and detects if the formatting of the web page is not correct. Their algorithm measures the relative importance of each embedded resource on a web page. For image and multimedia resources, the importance is determined by the size of the resource (based on height and width) compared to the size of the web page and how close the resource is to the center of the web page. The second image in Figure 5 is an example of an image with high relative importance since the image is large and close to the center of the web page. The importance of missing style sheets is determined by checking if the content on the web page is mostly on the left side of the viewport and checking if the non-background color is evenly distributed across the viewport. Figure 6 shows an example of content being shifted to the left when style sheets are missing. Figure 7 is how the web page would look if no stylesheets are missing.

Figure 6: Archived web page with missing stylesheets. Image from https://doi.org/10.1007/s00799-015-0150-6

Figure 7: Same web page as Figure 6, but with no missing stylesheets. Image from https://doi.org/10.1007/s00799-015-0150-6

For Kiesel et al.’s work ("Reproducible web corpora: Interactive archiving with automatic quality assessment"), they measured replay quality by using screenshot comparisons of the live web page and the archived web page. Their approach involves machine learning and uses a deep convolutional neural network named VGGNet, for image classification tasks. Figure 8 shows example screenshots used to determine replay quality of an archived web page when using their tool. Brunelle et al.’s approach is different from Kiesel et al.’s approach since Brunelle et al. did not use machine learning or image similarity to determine the quality of the archived web page. Also, when computing memento damage, Brunelle et al.’s algorithm does not have access to the live web page which is different from Kiesel et al.’s tool. In 2018, Brunelle compared his memento damage algorithm and Kiesel et al.’s approach for measuring replay quality.

Figure 8: Example screenshots used when determining replay quality of an archived web page. Image from https://doi.org/10.1145/3239574

Completeness

An archived web page is complete when it contains all of the resources that were on the live web page. An example of an incomplete web page is shown in Figure 9, where the missing web resources impact the visual appearance of the archived web page. Some archived web pages can be visually similar to the live web page like the example in Figure 3, but can still have missing resources (Figure 10).

Figure 9: Example of an incomplete web page with missing images

Figure 10: The web resources that were missing from the archived web page shown in Figure 3. Image from https://doi.org/10.1007/s00799-021-00314-x

Three of the related works included in Figure 1 measured completeness. For Berlin et al.’s work ("To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages"), they checked for blocked requests when replaying an archived web page which is related to completeness, because a blocked request prevents an archived resource from being loaded (Figure 11) and this negatively impacts the completeness of the archived web page. Gray and Martin measured completeness by determining how many resources from the original website were reachable from the archived website’s homepage. Brunelle et al.'s memento damage algorithm determines how the missing resources will impact the quality of an archived web page.

Figure 11: Blocked requests for web resources impacts the completeness of the replay of an archived web page. Screenshots from https://doi.org/10.1145/3589206

Interactional Correspondence

Interactional correspondence is associated with being able to perform the same interactions on the archived web page and the live web page. An example of low interactional correspondence is shown in Figure 12, where the map in the archived web page does not allow the user to perform user interactions with the map like moving the map to the left to view the buildings or streets that are currently not visible. None of the papers that I have read so far have measured the interactional correspondence when determining replay quality.

Figure 12: Example of an archived web page with low interactional correspondence. Screenshots from https://doi.org/10.1007/s00799-021-00314-x

Web archiving tools that perform user interactions like Browsertrix crawler, Memento Tracer, and Webis Web Archiver can help with improving interactional correspondence since the user interactions that are performed can be adjusted by the user when the user notices that some interactions are failing when replaying the archived web page. It is also possible that full interactional correspondence will be impossible for some sites, especially those that are essentially applications (e.g., Google Maps or Google Docs; see also: "Game Walkthroughs As A Metaphor for Web Preservation").

Measuring Web Archiving and Replay Performance During Our Web Archiving Livestream

During our web archiving livestreams, web archiving performance is measured and shown to the viewers during results mode. In the future, the web archiving livestreams will also measure replay performance metrics. The performance results are shown to viewers so they can see how well the web archive crawlers have performed during the livestream. The performance metrics that are currently measured (example results summary) include the time it takes a web crawler to archive all of the seed URIs, the number of successfully archived resources, and the number of missing web resources. Most of the performance metrics currently measured are temporary and they will be replaced with performance metrics measured by other tools like the Memento Damage service.
 
The six related works in Figure 1 were surveyed to learn more about different approaches for measuring web archiving and replay performance metrics and to identify tools that could be used during our web archiving livestream. Our web archiving livestream will use tools to measure visual correspondence, completeness, and interactional correspondence. For visual correspondence, we could use tools like the Memento Damage service and Webis Web Archiver to measure the replay quality. For completeness, a similar approach as Berlin et al. could be used to determine the missing web resources. Since our web archiving livestream has access to the live web page, we could check if the resources missing during replay are still available on the live web page, and this will help with measuring the completeness of the archived web page. For interactional correspondence, an approach similar to Webis Web Archiver could be used where a user can specify the required user interactions to perform during the crawling session and replay session. When the user interactions are performed during the replay session, the number of HTTP requests with a 400 or 500 level status code could be checked and this may assist with measuring interactional correspondence for some archived web pages.

Summary

Some categories for measuring web archiving and replay performance are archivability, visual correspondence, completeness, and interactional correspondence. Website archivability is different from the other categories, because it predicts how difficult it is to archive a website without needing to archive the website.

For our web archiving livestreams, we currently measure web archiving performance during the livestream and plan to measure replay performance during future livestreams. This survey has helped with learning about different approaches for measuring web archiving and replay performance and identifying some tools that could help with measuring visual correspondence, completeness, and interactional correspondence.

--Travis Reid (@TReid803)

Comments