Showing posts from March, 2017

2017-03-24: The Impact of URI Canonicalization on Memento Count

Mat reports that relying solely on a Memento TimeMap to evaluate how well a URI is archived is not a sufficient method.                            We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report . Memento represents a set of captures for a URI (e.g., ) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource). Variations in the "original URI" are canonicalized (coalescing and , for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value.

2017-03-20: A survey of 5 boilerplate removal methods

Fig. 1: Boilerplate removal result for  BeautifulSoup's get_text()  method for a   news website . Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text. Fig. 2: Boilerplate removal result for  NLTK's (OLD) clean_html()  method for a   news website .  Extracted text includes  e xtraneous text, but does not include Javascript, HTML, comments or CSS text. Fig. 3: Boilerplate removal result for  Justext  method for a   news website .  Extracted text includes  s maller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Fig. 4: Boilerplate removal result for   Python-goose  method for this   news website . No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent. Fig. 5: Boilerplate removal result for    Python-boilerpipe  (ArticleExtra

2017-03-09: A State Of Replay or Location, Location, Location

We have written blog posts about the time traveling zombie apocalypse in web archives and how the lack of client-side JavaScript execution at preservation time prevented the SOPA protest of certain websites from being seen in the archive . A more recent post about CNN's utilization of JavaScript to load and render the contents of its homepage have made it unarchivable since November 1st, 2016 . The CNN post detailed how some "tricks" were utilized to circumvent CORS restrictions of HTTP requests made by JavaScript to talk to their CDN were the root cause of why the page is unarchivable / unreplayable. I will now present to you a variation of this which is more insidious and less obvious than what was occurring in the CNN archives. TL;DR In this blog post, I will be showing in detail what caused a particular web page to fail on replay. In particular, the replay failure occurred due to the lack of necessary authentication and HTTP methods made for the c

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Archive Unleashed 3.0 took place in the Internet Archive , San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium , hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group ( WSDL ) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam , Mohamed Aturban , Erika Siregar , and myself . This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0 , and Web Archive Hackathon 2.0 . @WebSciDL at @internetarchive after Archives Unleashed 3.0 wrap up. We have a winner of #HackArchives — Sawood Alam (@ibnesayeed) February 25, 2017 This workshop, was supported by the Internet Archive ,  Rutgers University , and the University of Waterloo . The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools t