Posts

Showing posts from March, 2017

2017-03-24: The Impact of URI Canonicalization on Memento Count

Image
Mat reports that relying solely on a Memento TimeMap to evaluate how well a URI is archived is not a sufficient method.                           We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report.Memento represents a set of captures for a URI (e.g., http://google.com) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource).Variations in the "original URI" are canonicalized (coalescing https://google.com and http://www.google.com:80/, for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value. <http://ws-dl.b…

2017-03-20: A survey of 5 boilerplate removal methods

Image
Fig. 1: Boilerplate removal result for BeautifulSoup's get_text() method for anews website. Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text.
Fig. 3: Boilerplate removal result for Justext method for anews website. Extracted text includes smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Boilerplate removal refers to the task of extracting the main text content of webpages. This is done through the removal of content such as navigation links, header and footer sections, etc. Even though this task is a common prerequisite for most text processing tasks, I have not found an authoritative versatile solution. In other to better understand how some common options for boilerplate removal perform against one another, I developed a simple experiment to measure how well the methods perform when compared to a gold standard text extraction method (myself). Pyt…

2017-03-09: A State Of Replay or Location, Location, Location

Image
We have written blog posts about the time travelingzombie apocalypse in web archives and how the lack of client-side JavaScript execution at preservation time prevented the SOPA protest of certain websites from being seen in the archive. A more recent post about CNN's utilization of JavaScript to load and render the contents of its homepage have made it unarchivable since November 1st, 2016. The CNN post detailed how some "tricks" were utilized to circumvent CORS restrictions of HTTP requests made by JavaScript to talk to their CDN were the root cause of why the page is unarchivable / unreplayable. I will now present to you a variation of this which is more insidious and less obvious than what was occurring in the CNN archives. TL;DR In this blog post, I will be showing in detail what caused a particular web page to fail on replay. In particular, the replay failure occurred due to the lack of necessary authentication and HTTP methods made for the custom resources this …

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Image
Archive Unleashed 3.0 took place in the Internet Archive, San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium, hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam, Mohamed Aturban, Erika Siregar, and myself. This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0, and Web Archive Hackathon 2.0. @WebSciDL at @internetarchive after Archives Unleashed 3.0 wrap up. We have a winner of #HackArchivespic.twitter.com/vYLi89yap0 — Sawood Alam (@ibnesayeed) February 25, 2017
This workshop, was supported by the Internet ArchiveRutgers University, and the University of Waterloo. The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools to web archives. The three orga…