2017-03-24: The Impact of URI Canonicalization on Memento Count
We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report.
Memento represents a set of captures for a URI (e.g., http://google.com) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource).
Variations in the "original URI" are canonicalized (coalescing https://google.com and http://www.google.com:80/, for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value.
<http://ws-dl.blogspot.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/>; rel="self"; type="application/link-format"; from="Wed, 29 Sep 2010 00:03:40 GMT"; until="Mon, 20 Mar 2017 19:09:10 GMT", <http://web.archive.org/web/http://ws-dl.blogspot.com/>; rel="timegate", <http://web.archive.org/web/20100929000340/http://ws-dl.blogspot.com/>; rel="first memento"; datetime="Wed, 29 Sep 2010 00:03:40 GMT", <http://web.archive.org/web/20110202180231/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 02 Feb 2011 18:02:31 GMT", <http://web.archive.org/web/20110902171049/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:10:49 GMT", <http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 02 Sep 2011 17:12:56 GMT", ... <http://web.archive.org/web/20151205080546/http://www.ws-dl.blogspot.com/>; rel="memento"; datetime="Sat, 05 Dec 2015 08:05:46 GMT", <http://web.archive.org/web/20161104143102/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Fri, 04 Nov 2016 14:31:02 GMT", <http://web.archive.org/web/20161109005749/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Wed, 09 Nov 2016 00:57:49 GMT", <http://web.archive.org/web/20170119233646/http://ws-dl.blogspot.com/>; rel="memento"; datetime="Thu, 19 Jan 2017 23:36:46 GMT", <http://web.archive.org/web/20170320190910/http://ws-dl.blogspot.com/>; rel="last memento"; datetime="Mon, 20 Mar 2017 19:09:10 GMT"
For instance, to view the TimeMap for this very blog from Internet Archive, a user may request http://web.archive.org/web/timemap/link/http://ws-dl.blogspot.com/ (Figure 1). Each URI-M (e.g., http://web.archive.org/web/20110902171256/http://ws-dl.blogspot.com/) is listed with a corresponding relationship (rel) and datetime value. Note the www.ws-dl.blogspot.com and ws-dl.blogspot.com subdomain variants are both included in the same TimeMap, an product of the canonicalization procedure. The TimeMap for this URI-R currently contains 60 URI-Ms. Internet Archive's Web interface reports 58 captures -- a subtle yet differing "count". This difference get much more extreme with other URI-Rs.
The quality of each memento (e.g., in terms of completeness of capture of embedded resources) cannot be determined using the TimeMap alone. This fact is inherent in a URI-M needing to be dereferenced and each embedded resource requested upon rending the base URI-M. Comprehensively evaluating the quality over time is something we have already covered (see our TPDL2013, JCDL2014, and IJDL2015 papers/article).
In performing some studies and developing web archiving tools, we required knowing how many captures existed for a particular URI using both a Memento aggregator and the TimeMap from an archive's Memento endpoint. For http://google.com, counting the number of URIs in a TimeMap with a rel value of "memento" produces a count of 695,525 (as of May 2017). The number obtained from Internet Archive's calendar interface and CDX endpoint currently show much smaller count values (e.g., calendar interface currently states 62,339 captures for google.com).
Dereferencing these URI-Ms would take a very long time due to network latency in accessing the archive as well as limits on pipelining (though the latter can be mitigated with distributing the task). We did exactly this for google.com and found that the large majority of the URI-Ms produced a redirect to another URI-M in the TimeMap. This lead us to know that counting mementos in an archive's holdings is not sufficient with this procedure.
For google.com we found that nearly 85% of the URI-Ms resulted in a redirect when dereferenced. We repeated this procedure for seven other TimeMaps for large web sites (e.g., yahoo.com, instagram.com, wikipedia.org) and found a wide array of trends in this naïve counting method (88.2%, 67.3%, and 44.6% are redirects, respectively). We also repeated this procedure with thirteen academic institutions' URI-Rs to observe if this trend persisted.
We have posted an extensive report of our findings as a tech report available on arXiv (linked below).
— Mat (@machawk1)
Mat Kelly, Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel. "Impact of URI Canonicalization on Memento Count," Technical Report arXiv:1703.03302, 2017.
I think this provides some interesting insight about redirects over time, but the question of how many mementos for a given site are 3xx vs 200 is trivial to answer by using the right api. The CDX Server API clearly includes this information in its TimeMap results and requires no expensive dereferencing.
ReplyDeleteOf course, the contents of a TimeMap do not imply that all the mementos are accessible at all times, a TimeMap with status codes included, by definition, provides the "best" answer on what the archive claims to contain.
It seems the real takeaway here is that the Memento protocol is limited by the fact it does not include status codes and requires expensive dereferencing to guess this information.
An extension to Memento could simply include the status code in the TimeMap, or users can be advised to use the CDX Server API, which is informally standardized and widely available for web archives.
The paper alludes to using the CDX Server for comparison, it would be interesting to see the results from the CDX Server were vs what was discovered via dereferencing.
Hi Ilya,
ReplyDeleteYes, we realize that including the status code in the TimeMap is the way to go in the future. Historically, this was less of an issue -- I'm pretty sure WMs used to export only 200s (excluding date-based redirections) -- or at least export far fewer non-200s. I think some of the foundations and assumptions underneath the protocol changed, and that was exacerbated by other changes (e.g., http-->https).
Also, other systems (e.g., wikis, archive.is) 1) don't have CDXs, and 2) still essentially only expose 200s.
regards,
Michael