Thursday, December 19, 2013

2013-12-19: 404 - Your interview has been depublished

Early November 2013 I gave an invited presentation at the EcoCom conference (picture left) and at the Spreeeforum, an informal gathering of researchers to facilitate knowledge exchange and foster collaborations. EcoCom was organized by Prof. Dr. Michael Herzog and his SPiRIT team and the Spreeforum was hosted by Prof. Dr. Jürgen Sieck who leads the INKA research group. Both events were supported by the Alcatel-Lucent Stiftung for Communications research. In my talks I gave a high-level overview of the state of the art in web archiving, outlined the benefits of the Memento protocol, pointed at issues and challenges web archives face today, and gave a demonstration of the Memento for Chrome extension.

Following the talk at the Spreeforum I was asked to give an interview for the German radio station Inforadio (you may think of it as Germany's NPR). The piece was aired on Monday, November 18th at 7.30am CET. As I had left Germany already I was not able to listen to it live but was happy to find the corresponding article online that basically contained the transcript of the aired report and an audio file was embedded in the document. I immediately bookmarked the page.

A couple of weeks later I revisited the article at its original URI only to find it was no longer available (screenshot left). Now, we all know that the web is dynamic and hence links break and even we have seen odd dynamics at other media companies before but in this case, as I was about to find out, it was higher powers that caused the detrimental effect. Inforadio is a public radio station and therefore, as many others in Germany and throughout Europe, to a large extent financed by the public (as of 2013 the broadcast receiving license is 17.98 Euros (almost USD 25) per month per household). As such they are subject to the "Rundfunkstaatsvertrag", which is a contract between the German states to regulate broadcasting rights. The 12th amendment to this contract from 2009 mandates that most online content must be removed after 7 days of publication. Huh? Yeah, I know, it sounds like a very bad joke but it is not. It even lead to coining the term "depublish" - a paradox by itself. I had considered public radio stations as "memory organizations", in league with libraries, museums, etc. How wrong was I and how ironic is this, given my talk's topic!? For what it's worth though, the content does not have to be deleted from the repository but it has to be taken offline.

I can only speculate about the reasons for this mandate but to me believable opinions circulate indicating  that private broadcasters and news publishers complained about unfair competition. In this sense, the claim was made that "eternal" availability of broadcasted content on the web is unfair competition as the private sector is not given the appropriate funds to match that competitive advantage. Another point that supposedly was made is that this online service goes beyond the mandate of public radio stations and hence would constitute a misguided use of public money. To me personally, none of this makes any sense. Broadcasters of all sorts have realized that content (text, audio, and video) is increasingly consumed online and hence are adjusting their offerings. How this can be seen as unfair competition is unclear to me.

But back to my interview. Clearly, one can argue (or not) whether the document is worth preserving but my point here is a different one:
Not only did I bookmark the page when I saw it, I also immediately tried to push it into as many web archives as I could. I tried the Internet Archive's new "save page now" service but, to add insult to injury, Inforadio also has a robots.txt file in place that prohibits the IA from crawling the page. To the best of my knowledge this is not part of the 12th amendment to the "Rundfunkstaatsvertrag" so the broadcaster could actually take action to preserve their online content. Other web sites of public radio and TV stations such as Deutschlandfunk or ZDF do not prohibit archives from crawling their pages.



Fortunately, the archiving service Archive.is was able to grab the page (screenshot left) but the audio feed is lost.



Just one more thing (Peter Falk style):
Note that the original URI of the page:

http://www.inforadio.de/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

when requested in a web browser redirects (200-style) to:

http://www.inforadio.de/error/404.html?/rbb/inf/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

The good news here: it is not a soft 404 so the error is somewhat robot friendly. The bad news is that the original URI is thrown away. As the original URI is the only key for a search in web archives, we can not retrieve any archived copies (such as the one I created in Archive.is) without it. Unfortunately, this is not only true for manual searches but it also undermines automatic retrieval of archives copies by clients such as the browser extension Memento for Chrome. As stressed in our recent talk at CNI this is very bad practice and unnecessarily makes life harder for those interested in obtaining archived copies of web pages at large, not only my radio interview.

--
Martin

Wednesday, December 18, 2013

2013-12-18: Avoiding Spoilers with the Memento Mediawiki Extension

From Modern Family to the Girl with the Dragon Tatoo, fans have created a flood of fan-based wikis based on their favorite television, book, and movie series. This dedication to fiction has allowed fans to settle disputes and encourage discussion using these resources.
These resources, coupled with the rise in experiencing fiction long after it is initially released, has given rise to another cultural phenomenon: spoilers. Using a fan-based resource is wonderful for those who are current with their reading/watching, but is fraught with disaster for those who want to experience the great reveals and have not caught up yet.
Memento can help here.
Above is a video showing how the Memento Chrome Extension from Los Alamos National Laboratory (LANL) can be used to avoid spoilers while browsing for information on Downtown Abbey. This wiki is of particular interest because the TV show is released in the United Kingdom long before it is released in other countries. The wiki has a nice sign warning all visitors about impending spoilers should they read the pages within, but the warning is redundant, seeing as fans who have not caught up will know that spoilers are implied.
A screenshot of the page with this notice is shown below.
We can use Memento to view pre-spoiler versions.
To avoid spoilers for Downtown Abbey Series 4, we choose a date prior to its release: August 30, 2012. Then we use LANL's Memento Chrome Extension to browse to that date. The HTTP conversation for this exchange is captured using Google Chrome's Live HTTP Headers Extension and detailed in the steps below.
1. The Chrome Memento Extension sends a HEAD request to the site using Memento's Accept-Datetime header*.
2. Because there are no Memento headers in the response, it connects to LANL's Memento aggregator using a GET request with the same Accept-Datetime header and gets back a 302 redirection response.
3. Then it follows the URI from the Location response header to a TimeGate specifically set up for Wikia, making another GET request using the Accept-Datetime request header on that URI. The TimeGate uses the date given by Accept-Datetime to determine which revision of a page to retrieve. The URI for this revision is sent back in the Location response header as part of the 302 redirection response.
4. From here it performs a final GET request on the URI specified in the Location response header, which is the revision of the article closest to the date requested. A screenshot of that revision is shown below, without the spoiler warning.
Even though this method works, it is not optimal.
The external Memento aggregator must know about the site and provide a site-specific TimeGate.  In this case, the aggregator is merely looking for the presence of "wikia.com" in the URI and redirecting to the appropriate TimeGate in step 3. Behind the scenes, the Mediawiki API is used to acquire the list of past revisions and the TimeGate selects the best one in step 4. This requires LANL, or another Memento participant like the UK National Archives, to provide a TimeGate for all possible Wiki sites on the Internet, which is not possible.
To see where this is relevant, let's look at the fan site A Wiki of Ice and Fire, detailing information on the series A Song of Ice and Fire (aka Game of Thrones). LANL has no Memento TimeGate specifically for this real fan wiki, unlike what we saw with the Downtown Abbey site.
Here's a screenshot of the starting page for this demonstration. Let's assume we want to avoid spoilers from the book A Dance With Dragons, which was released in July 2011, so we choose the date of June 30, 2011.
1. The Chrome Memento Extension connects with an Accept-Datetime request header, hoping for a response with Memento headers.
2. Because there were no Memento headers in the response, it turns to the Memento Aggregator at LANL, which serves as the TimeGate, using the datetime given by the Accept-Datetime request header to find the closest version of the page to the requested date. The TimeGate then provides a Location response header containing the archived version of the page at the Internet Archive.
3. Using the URI from that Location response header, the page is then retrieved directly from the Internet Archive.

But this page has a date of 27 Apr 2011, which is missing information we want, like who played this character in the TV series, which was added to the 7 June 2011 revision of the page. This is because the Internet Archive only contains two revisions around our requested datetime: 27 Apr 2011 and 1 Aug 2011.  Even though the fan wiki contains the 7 June 2011 revision, the Internet Archive does not.

Fortunately, there is the native Memento Mediawiki Extension, supported by the Andrew Mellon Foundation, which addresses these issues. It has been developed jointly by Old Dominion University and LANL. Mediawiki was chosen because it is the most widely used Wiki software, used in sites such as Wikipedia and Wikia.

This native extension allows direct access to all revisions of a given page, avoiding spoilers. It can also return the data directly, requiring no Memento aggregators or other additional external infrastructure.
We set up a demonstration wiki using data from the same Game of Thrones fan wiki above. The video above shows this extension in action. Because our demonstration wiki has the native extension installed, it allows for access to all revisions of each article.
We will try the same scenario using this Memento-enabled wiki.
Here is a screenshot of the starting page for this demonstration.
In this case, because the Memento Mediawiki Extension has full Memento support, the HTTP messages sent are different. We again use the date June 30, 2011 to show that we can acquire information about a given article without revealing any spoilers from the book A Dance With Dragons, which was released on July 2011.
1. The Memento Chrome Extension sends an Accept-Datetime request header, but this time Mediawiki itself is serving as the TimeGate, deciding on the page closest to, but not over, the date requested. Mediawiki then issues its own 302 redirection response.
2. That response gives a Location response header pointing to the correct revision of the page, which was published on June 7, 2011, prior to the release of A Dance With Dragons. From here the Memento Chrome Extension can issue a GET request on that URI to retrieve the correct representation of the page.
As this demonstrates, running the Memento Mediawiki Extension on a fan wiki will ensure that site visitors can not only browse the site spoiler free, but also will get the date closest, but not over, their requested date. This way they avoid spoilers and don't miss any information.

To recap, the native extension gives us the following benefits:
  1. The Memento Infrastructure cannot know about all possible wikis and provide TimeGates for each one, so the chances of a wiki having one are low.
  2. The Internet Archive does not have all revisions of each fan wiki page, meaning that visitors to a fan wiki may miss out on information.
  3. Visitors to the fan wiki site who are trying to avoid spoilers don't need to worry about any issues with the Memento wiki TimeGate infrastructure. Changes to a wiki's API can threaten the whole process, and APIs change frequently while Memento is established by a more stable RFC. 
If you are running a fan wiki and want to help your visitors avoid spoilers, the Memento Mediawiki Extension is what you need. Please contact us and we'll help you customize it to your needs, if necessary.

--Shawn

* = Memento for Chrome version 0.1.11 actually performs two HEAD requests on the resource, but this will be fixed in the next release.

Friday, December 13, 2013

2013-12-13: Hiberlink Presentation at CNI Fall 2013

Herbert and Martin attended the recent Fall 2013 CNI meeting in Washington DC, where they gave an update about the Hiberlink Project (joint with the University of Edinburgh), which is about preserving the referential integrity of the scholarly record. In other words, we link to the general web in our technical publications (and not just other scholarly material) and of course the links rot over time.  But the scholarly publication environment does give us several hooks to help us access web archives to uncover the correct material. 

As always, there are many slides but they are worth the time to study them.  Of particular importance are slides 8--18, which helps differentiate Hiberlink from other projects, and slides 66-99 which walk through a demonstration of the "Missing Link" concepts (along with the Memento for Chrome extension) can be used to address the problem of link rot.  In particular, absent specific versiondate attributes on a link, such as:

<a versiondate="some-date-value" href="...">

A temporal context can be inferred from the "datePublished" META value defined by schema.org:

<META itemprop="datePublished" content="some-ISO-8601-date-value">



Again, the slides are well-worth your time.

--Michael