Posts

Showing posts with the label HTTP

2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

Image
Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives.  These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from AlNoamany's work  or  Andy Jackson's work , because we are trying to acquire and extract text from mementos across many web archives, rather than just one. We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex.  We document our findings in a technical report entitled:  " Rules of Acquisition for Mementos and Their Content ". Our technical report briefly covers the following key points: Special techniques for

2015-08-28 Original Header Replay Considered Coherent

Image
Introduction As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089 , defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15 . This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers. Archive Headers and Original Headers Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT. HTTP/1.1 200 OK Content-Type: image/gif Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT Figure 1. No Original Header Playback Try to answer the

2014-12-20: Using Search Engine Queries For Reliable Links

Image
Earlier this week Herbert brought to my attention Jon Udell 's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes. Jon has a series of columns for InfoWorld , and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI: http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html is currently: http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html The same content had at least one other URI as well, from at least 2009--2012: http://www.infoworld.com/d/developer-world/xquery-and-power-learning-example-924 The first reaction is to say InfoWorld should use " Cool URIs ", mod_rewrite , or even handles .  In fairness, Inforworld is still redirecting the second URI to the c

2014-05-08: Support for Various HTTP Methods on the Web

Image
While clearly not all URIs will support all HTTP methods, we wanted to know what methods are widely supported, and how well is the support advertised in HTTP responses. Full range of HTTP method support is crucial for RESTful Web services. Please read our  previous blog post  for definitions and pointers about REST and HATEOAS. Earlier, we have done a brief analysis of HTTP method support in the HTTP Mailbox paper. We have extended the study to carry out deeper analysis of the same and look at various aspects of it. We initially sampled 100,000 URIs from the DMOZ and found that only 40,870 URIs were live. Our further analysis was based on the response code, "Allow" header, and "Server" header for OPTIONS request from those live URIs. We found that out of those 40,870 URIs: 55.31% do not advertise which methods they support 4.38% refuse the OPTIONS method, either with a 405 or 501 response code 15.33% support only HEAD, GET, and OPTIONS 38.53% support

2013-11-19: REST, HATEOAS, and Follow Your Nose

Image
This post is hardly timely, but I wanted to gather together some resources that I have been using for REST (Representational State Transfer) and HATEOAS (Hypermedia as the Engine of Application State).  It seems like everyone claims to be RESTful, but mentioning HATEOAS is frequently met with silence.  Of course, these terms come from Roy Fielding 's PhD dissertation , but I won't claim that it is very readable (it is not the nature of dissertations to be readable...).  Fortunately he's provided more readable blog posts about REST and HATEOAS . At the risk of aggressively over-simplifying things, REST = "URIs are nouns, not verbs" and HATEOAS = "follow your nose". "Follow your nose" simply means that when a client dereferences a URI, the entity that is returned is responsible for providing a set of links that allows the user agent to transition to the next state.  This standard procedure in HTML: you follow links to guide you through an o

2013-09-09: MS Thesis: HTTP Mailbox - Asynchronous RESTful Communication

Image
It is my pleasure to report the successful completion of my Master's degree thesis entitled "HTTP Mailbox - Asynchronous RESTful Communication". I have defended my thesis on July 11th and got my written thesis accepted on August 23rd 2013. In this blog post I will briefly describe the problem that the thesis is targeting at followed by proposed and implemented solution to the problem. I will walk through an example that will illustrate the usage of the HTTP Mailbox then I will provide various links and resources to further explore the HTTP Mailbox. Traditionally, general web services used only the GET and POST methods of HTTP while several other HTTP methods like PUT, PATCH, and DELETE were rarely utilized. Additionally, the Web was mainly navigated by humans using web browsers and clicking on hyperlinks or submitting HTML forms. Clicking on a link is always a GET request while HTML forms only allow GET and POST methods. Recently, several web frameworks/libraries hav

2013-05-09: HTTP Mailbox - Asynchronous RESTful Communication

Image
We often encounter web services that take a very long time to respond to our HTTP requests. In the case of an eventual network failure, we are forced to issue the same HTTP request again. We frequently consume web services that do not support REST . If they did, we could utilize the full range of HTTP methods while retaining the functionality of our application, even when the external API we utilize in our application changes. We sometime wish to set up a web service that takes job requests, processes long running job queues and notifies the clients individually or in groups. HTTP does not allow multicast or broadcast messaging. HTTP also requires the client to stay connected to the server while the request is being processed. Introducing HTTP Mailbox - An Asynchronous RESTful HTTP Communication System. In a nutshell, HTTP Mailbox is a mailbox for HTTP messages. Using its RESTful API, anyone can send an HTTP message (request or response) to anyone else independent of the availabi

2011-04-13: Implementing Time Travel for the Web

Image
Recent trends in digital libraries are towards integration with the architecture of the World Wide Web . The award-winning Memento Project proposes extending HTTP to provide protocol-level access to mementos (archived previous states) of web resources. Using content negotiation and other protocol operations, rather than archive-specific methods, Memento provides the digital library and preservation community with a standardized method to navigate between the original resource and its mementos. Memento Client State Chart The ODU Web Sciences and Digital Libraries Research Group has partnered with the LANL Research Library to create Memento and develop prototype Memento-compliant client and server implementations. A variety of Memento clients have been created, tested, and co-evolved along with the Memento protocol. There is now a FireFox extension , Internet Explorer browser helper object, and WebKit -based Android browser . The design and technical solutions identified during th

2010-11-15: Memento Presentation at UNC; Memento ID

Image
I recently had a chance to return to the School of Information and Library Science , UNC Chapel Hill, where I had a most enjoyable post-doc during the academic year 2000-2001. Jane Greenberg was nice enough to invite me to speak about Memento in her INLS 520 "Organization of Information" class on Tuesday, November 9th as well as give an invited lecture about Memento to the UNC Scholarly Communications Working Group on Wednesday, November 10th. When I first went to UNC I had the office next to Jane and she was just an assistant professor, now she's a full professor and director of the Metadata Research Center . I enjoyed catching up with her and my many other friends and colleagues at SILS. My slides are available on slideshare.net ; they are mostly a combination of slides I've posted before, but with some updates in the HTTP headers. Although the changes are very slight, the recently submitted (11/12/10) Memento Internet Draft takes precedence over all of

2010-11-05: Memento-Datetime is not Last-Modified

Image
One of the key contributions of the Memento Framework is the HTTP response header " Memento-Datetime " (previously called "Content-Datetime" in our earlier publications & slides). Memento-Datetime is the sticky, intended datetime* for the representation returned when a URI is dereferenced. The presence of the Memento-Datetime HTTP response header is how the client realizes it has reached a Memento. Rather than formally explain what we mean by "sticky, intended datetime", it is easier to explain how it is neither the value in the HTTP response header Last-Modified , nor is it the creation date of the resource (which has no corresponding HTTP header, for reasons that will become clear). For the examples below, we'll define the following abbreviations: CD (Creation-Datetime) = the datetime the resource was created MD (Memento-Datetime) = the datetime the representation was observed on the web LM (Last-Modified) = the datetime the resource last c