2018-09-03: Let's compare memento damage measures!

It is always nice getting a Google Scholar alert that one of my papers has been cited. In this case, I learned that the paper "Reproducible Web Corpora: Interactive Archiving with Automatic Quality Assessment" (to appear in the ACM Journal of Data and Information Quality) cited a paper that I wrote during my doctoral studies with fellow PhD students Mat Kelly and Hany SalahEldeen and our advisors Michael Nelson and Michele Weigle. More specifically, the Reproducible Web Corpora paper (by Johannes Kiesel, Florian Kneist, Milad Alshomary, Benno Stein, Matthias Hagen, and Martin Potthast) is a very important and well-executed follow on to our paper "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources" (a best student paper from JCDL2014 and subsequently published in the International Journal of Digital Libraries).

In this blog post, I will be providing a quick recap and analysis of the Kiesel paper from the perspective of an author of the paper that provides the Brunelle15 metric used as the benchmark measure in the Kiesel 2018 paper.

(I suppose this should be referred to as a "guest post" since I have since graduated from the WS-DL research lab and am currently working as a Principal Researcher at the MITRE Corporation.)

Despite missing more embedded resources, the screenshot of the web comic (XKCD) on the left rates as higher quality as compared to the screenshot on the right since the screenshot on the right is missing the most important embedded resource: the image of the comic.
To begin, it is worthwhile to reflect on our 2014/2015 paper. We set out with the goal of improving upon the naive metric of "percentage of missing resources" used by archivists to assess the quality of a memento. Intuitively, larger, more central embedded resources (e.g., images, videos) are likely more important to a human's interpretation of quality than smaller images on the periphery of a web page. Similarly, a CSS resource that is responsible for formatting the look-and-feel of a page is more important than a CSS resource that does not have as great an impact on the visual layout of the page content. Using these qualitative notions of quality, we created an algorithm that measures -- quantitatively -- the quality of a memento based on the measured importance of its missing embedded resources. For example (and -- not coincidentally -- the one used in both the Brunelle and Kiesel papers), a web comic that is missing its large, centrally-located comic is much more "damaged" than a news article missing the social media share icons at the bottom of the page despite the percentage of missing embedded resources being greater for the news article than the web comic.

From this baseline, we used Amazon's Mechanical Turk to identify whether or not our new measure of quality of a memento aligns more closely with a human's interpretation of quality than the naive measure of proportion of the missing embedded resources. I will leave the details and specifics of our approach, algorithm, and result to the reader, but the punch-line is that our algorithm out-performed (albeit slightly) the measure of proportion of missing embedded resources. The take-away from this paper is that there is merit to evaluating the nuances of how we interpret the quality of a memento during and after archiving its live-web counterpart. Erika Siregar has turned the algorithm from our paper into a service for measuring memento damage.




With this context, we can appropriately analyze the 2018 Kiesel paper. Kiesel and his counterparts are developing The Webis Web Archiver and wanted to create a method of immediately and automatically assessing quality of its mementos. The Webis Web Archiver uses an archiving approach that exercises the JavaScript-enabled aspects of a representation to ensure the deferred representations are appropriately archived, replayable, and maintain their behavior when archived. (This can be best described as a combination of the two-tiered crawling approach we proposed in 2017 and the approach used by Webrecorder.io.)

To accomplish the quality assessment of the Webis Web Archiver mementos, Kiesel, et al. sampled 10,000 URI-Rs (and if you want to analyze their data, reuse it, provide an extension to this work, they have made the dataset available!) and identified 6,348 URI-Rs with mementos that had "reproduction errors" (which leads me to believe that the remaining 3,652 mementos were pixel-wise and embedded resource equivalents to their live-web counterparts). With a similar methodology to our work, the authors assigned 9 Turkers to rate the quality of a screenshot of a memento to a screenshot of its live-web counterpart using a Likert scale (1-5) with 1 being "minimal impact" and 5 being "completely unusable".

One topic left out of the Kiesel paper is that a Turker's evaluation of what makes a "well preserved web page" is likely to differ from the evaluation of an archivist. This was a notional finding of our 2014/2015 memento damage work and -- while the nuances of this difference is alluded to -- is not directly mentioned in the Kiesel paper. For example, the edge case of a video "still loading" in the screenshot (among other examples cited in the Kiesel paper) is considered a low-quality memento by the paper's authors but may not be considered completely unusable by the Turkers. To reinforce the difference between archivists and Turkers, the authors noted that they changed the aggregate quality score of the Turker assessments of 11% of the comparisons (717 of the 6,348). To test the null hypothesis in this experiment, the authors could have presented Turkers with a perfect memento from the set of 3,652 mementos without reproduction errors. Ideally, the Turkers would have rated the comparisons in this set as 1s on the authors' Likert scale.
The image on the right shows the loading multi-media embedded resource.
This is part of Figure 5 in the Kiesel paper.

Using their evaluation approach, Kiesel, et al. compared the Brunelle15 approach, a pixel-wise comparison using RMSE, and a neural network-driven classifier according to their respective alignment with Turker assessments. As I would have assumed, the Brunelle15 measure is uncorrelated with Turker assessments. The authors' interpretation of this result matches mine: the Brunelle15 measure is performed in absence of the live-web representation, meaning it has to make assumptions about things like image size and placement when unavailable. Further, Brunelle15 puts an increased emphasis on image/multimedia as opposed to CSS. We assumed that Turkers focus on the potentially more highly visible CSS damage in a memento whereas archivists focus on the absence of prominent embedded resources despite formatting and positioning. I was surprised at how closely RMSE correlated with Turker assessments. This could be a potentially low-computational cost (as compared to training a neural network) method of identifying quality. Of course, the neural network approach performed best and demonstrates the promise of this approach.

Figure 6 of the Kiesel paper shows the correlation of the models with the Turker ratings.
An interesting extension to the Kiesel work would be to identify the aspects of a page that lead to lower quality scores. In alignment with my assumption that higher ratings from Turkers is due to well preserved CSS and archivists rate well-preserved "important" embedded resources higher, it would be interested to see a more granular feature vector for a memento that can be used to tune an archival service. For example, favoring small images over CSS if the research indicates that Turkers (or archivists) do not assign much value to quality based on formatting. (This is unlikely, in my opinion, but is a valid result.) The tuning can also be performed based on wall-clock time of the crawl (a topic that we discussed in our JCDL2017 paper on what it "costs" to archive JavaScript). Another interesting extension would be comparing the quality of mementos resulting from different archival approaches such as archive.is and Webrecorder.io, particularly with respect to resources with deferred representations.

The Kiesel, et al. paper uses an optimal approach for immediate and automatic memento quality assessment -- comparison to the live web and human-assessed interpretations. They also use a neural network to learn to assess quality from the human evaluations. I view this work as a natural and necessary next step toward understanding how to measure memento quality. I look forward to their future work!

--Justin F. Brunelle

The authors' affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions or viewpoints expressed by the authors. ©2018 The MITRE Corporation. ALL RIGHTS RESERVED. Approved for Public Release; Distribution Unlimited. Case Number 18-2725-2.

Comments