Posts

Showing posts from February, 2016

2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

Image
Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives.  These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from
AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.

We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex.  We document our findings in a technical report entitled:  "Rules of Acquisition for Mementos and Their Content".

Our technical report briefly covers the following key points:
Special techniques for acquiring mem…