Posts

2017-04-17: CNI Spring 2017 Trip Report

Image
The Coalition for Networked Information (CNI) Spring 2017 Membership Meeting was held April 3-4, 2017 in Albuquerque, NM.  As before, the presentations were of very high quality but the eight-way (!) split of presentations means that you're going to miss some good presentations.  The full schedule is available, but this trip report will focus on the sessions that I was able to attend.   Fortunately, the attendees did well-covered in Twitter ( #cni17s ), and the tweets are collected by both CNI ( Day 1 , Day 2 ) and Michael Collins ( Day 1 , Day 2 ).  The presentation slides are being collected at OSF .     The first day began with a plenary by Alison J. Head , representing Project Information Literacy (PIL).  Alison's talk was entitled "What today's university students have taught us", and these slides from not quite a year ago were similar to what she presented at CNI.   Alison has done extensive research about how undergraduates use Wikipedia , the Web

2017-03-24: The Impact of URI Canonicalization on Memento Count

Image
Mat reports that relying solely on a Memento TimeMap to evaluate how well a URI is archived is not a sufficient method.                            We performed a study of very large Memento TimeMaps to evaluate the ratio of representations versus redirects obtained when dereferencing each archived capture. Read along below or check out the full report . Memento represents a set of captures for a URI (e.g., http://google.com ) with a TimeMap. Web archives may provide a Memento endpoint that allows users to obtain this list of URIs for the captures, called URI-Ms. Each URI-M represents a single capture (memento), accessible when dereferencing the URI-M (resolving the URI-M to an archived representation of a resource). Variations in the "original URI" are canonicalized (coalescing https://google.com and http://www.google.com:80/ , for instance) with the original URI (URI-R in Memento terminology) also included with a literal "original" relationship value.

2017-03-20: A survey of 5 boilerplate removal methods

Image
Fig. 1: Boilerplate removal result for  BeautifulSoup's get_text()  method for a   news website . Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text. Fig. 2: Boilerplate removal result for  NLTK's (OLD) clean_html()  method for a   news website .  Extracted text includes  e xtraneous text, but does not include Javascript, HTML, comments or CSS text. Fig. 3: Boilerplate removal result for  Justext  method for a   news website .  Extracted text includes  s maller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Fig. 4: Boilerplate removal result for   Python-goose  method for this   news website . No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent. Fig. 5: Boilerplate removal result for    Python-boilerpipe  (ArticleExtra

2017-03-09: A State Of Replay or Location, Location, Location

Image
We have written blog posts about the time traveling zombie apocalypse in web archives and how the lack of client-side JavaScript execution at preservation time prevented the SOPA protest of certain websites from being seen in the archive . A more recent post about CNN's utilization of JavaScript to load and render the contents of its homepage have made it unarchivable since November 1st, 2016 . The CNN post detailed how some "tricks" were utilized to circumvent CORS restrictions of HTTP requests made by JavaScript to talk to their CDN were the root cause of why the page is unarchivable / unreplayable. I will now present to you a variation of this which is more insidious and less obvious than what was occurring in the CNN archives. TL;DR In this blog post, I will be showing in detail what caused a particular web page to fail on replay. In particular, the replay failure occurred due to the lack of necessary authentication and HTTP methods made for the c