2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected
Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives. These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from
AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.
We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex. We document our findings in a technical report entitled: "Rules of Acquisition for Mementos and Their Content".
Our technical report briefly covers the following key points:
WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work. An example WebCite page is shown below.
For acquiring most memento content, we utilized the cURL data transfer tool. With this tool, one merely types the following command to save the contents of the URI http://www.example.com:
AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.
We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex. We document our findings in a technical report entitled: "Rules of Acquisition for Mementos and Their Content".
Our technical report briefly covers the following key points:
- Special techniques for acquiring mementos from the WebCite on-demand archive (http://www.webcitation.org)
- Special techniques for dealing with JavaScript Redirects created by the Internet Archive
- An alternative to BeautifulSoup for removing elements and extracting text from mementos
- Stripping away archive-specific additions to memento content
- An algorithm for dealing with inaccurate character encoding
- Differences in whitespace treatment between archives for the same archived page
- Control characters in HTML and their effect on DOM parsers
- DOM-corruption in various HTML pages exacerbated by how the archives present the text stored within <noscript> elements
Rather than repeating the entire technical report here, we want to focus on the two issues of interest that may have the greater impact on others acquiring and experimenting with mementos: acquiring mementos from Web Cite and inaccurate character encoding.
Acquisition of Content from WebCite
WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work. An example WebCite page is shown below.
For acquiring most memento content, we utilized the cURL data transfer tool. With this tool, one merely types the following command to save the contents of the URI http://www.example.com:
curl -o outputfile.html http://www.example.com
For WebCite, the output from cURL for a given URI-M results in the same HTML frameset content, regardless of which URI-M is used. We sought to acquire the actual content of a given page for text extraction, so merely utilizing cURL was insufficient. An example of this HTML is shown below.
Post a Comment