Wednesday, February 17, 2010

2010-02-17: Using Web Page Titles to Rediscover Lost Web Pages

The object of my project was to glean from a web page's title whether the title could be used to find the resource within the yahoo search engines caches.

Lost pages for this project are pages that return a 404. A 404 response code is an error message indicating that the client was able to communicate with the server but the server could not find what was requested. There are a multitude of possibilities why a page or an entire web site may disappear. These pages may reside only in the cache’s of search engines, or web archives, or just moved from one URI to another. In the context of this experiment Titles are denoted by the TITLE element within a web page. There can only be one title in a web page. The title may not contain anchors, highlighting, or paragraph marks.

What would be most desirable for this experiment would be to take all URIs as our collection set. Regrettably, using the entire web as our test set is unrealistic. Capturing a representative sample set of web-sites for the entire web is not an insignificant task. Therefore, we selected a random collection of web pages from, The Open Directory Project, is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors. From this sample we obtained 7314 web-sites as our initial set. After filtering out non-English, we were left with 7157.

Each of these pages titles was feed through the yahoo’s api. Any result within the first ten was considered found. Any result after the first ten returned was considered not found. Intuitively, most results not on the first page of a result set are not viewed or visited.

The data set was comprised of pages titles with a mean character count of 44.7, with a standard deviation of 27.4, giving a range of 17 to 72 characters. Furthermore, if titles were considered as words, anything broken up by white spaces, the mean is 6.7, with a standard deviation of 3.3, giving a range of 3 to 10 terms. Lastly, our data set was broken down into 66% found and 34% not found.

The goal of the experiment is to discern an element or series of components within a title that would allow us to predict the status of a web page. If we summarily said all titles are good titles for our response, we would be correct 66% of the time. Our baseline or point of reference for determining if a test merits consideration, is a test that can discern good titles from bad titles more than 66% of the time.

Our test were broken down into four types:

Grammar based

A title is a sentence structure it seemed reasonable to use the amount of nouns, adverbs, adjectives, etc as a determinant for choosing which title would produce good or bad results.

Search based:

Using queries with Boolean or, Boolean and, quoted are some combination.

Stop word:

Stop word are words that are considered too ubiquitous and are filtered out prior to submission to a search. Using predefined stop word sets as a percentage of a title as a test.

Stop title:

After looking at the sampling, there were certain titles that would always produce a not found classification. Using these titles as a percentage of a title as a test.

Regrettably, Grammar based, Search based and Stop word test did not provide any better results than assuming all titles would find the resource. On the other hand the use of stop titles, increased our ability to find titles that would produce a not found status by 6%.

The conclusion we may draw from are test are the usefulness of a titles for discovering a good titles is limited. A more useful finding is that excluding stop titles increases the accuracy of discerning a good title from a bad title. A larger data set may led to more stop titles or prove the usefulness current not productive tests.

The full report can be found at:

Jeffery L. Shipman, Martin Klein, Michael L. Nelson, Using Web Page Titles to Rediscover Lost Web Pages, Technical Report arXiv:1002.2439, February 2010.

Thursday, February 11, 2010

2010-02-11: Memento and OAC at the CNI Fall 2009 Membership Meeting

Herbert, Rob and I were at the Coalition for Networked Information Fall 2009 Membership Meeting in Washington DC, December 14-15, 2009. The CNI meetings are always good and this one was no exception. We gave a presentation about Memento (direct link on vimeo):

Memento: Time Travel for the Web from CNI Video Editor on Vimeo.

Note that this presentation was based on the initial version of Memento first presented in November 2009, not the slightly updated version from February 2010.

While we were there, we were also interviewed by Gerry Bayne of EDUCAUSE. Here's an embedded version of the interview:

Also at CNI Fall 2009, Rob gave a presentation about the Open Annotation Collaboration (OAC), of which I am on the technical committee. Rob's presentation is also available:

Interoperable Annotation: Perspectives from the Open Annotation Collaboration from CNI Video Editor on Vimeo.
We also did a short interview about OAC with EDUCAUSE:

Rob has also uploaded the demo from that presentation to YouTube, although this version does not have the necessary narration (listen to the CNI video for that). He's also posted the slides.


Monday, February 8, 2010

2010-02-08: Memento Meeting, San Francisco, Feb 2-3 2010

The entire Memento team went to San Francisco, CA February 2-3, 2010 to meet with representatives from the Internet Archive, California Digital Library, Microsoft Research, Library of Congress, LOCKSS and WebBase. The full attendee list and agenda is available at the Memento site, including six detailed presentations.

Based on the excellent feedback from the representatives, we ended up with two significant changes in our approach. The first change is simply moving the URI of the original resource (URI-R) from the Alternates: response header to a separate Link: header. The information returned from the TimeGate (URI-G) and Memento (URI-M) is the same, it has just moved from one header to another.

The second change represents a larger change from the previous model. Instead of URI-R redirecting (302 response code) to URI-G when it sees an Accept-Datetime header, URI-R always returns one or more Link: response headers pointing to one or more TimeGates (whether or not the client sent an Accept-Datetime header). It is then up to the requesting client to either use the URI-G value(s) returned by URI-R or to use their own value of URI-G. This should be easier to implement for servers since they just have to send a Link header, and it should also work better with existing http caches.

We're in the process of changing over our demo systems. Herbert has updated the slides to reflect these changes and integrated them all into a single presentation:

The original slides from November 2009 are still available. Thanks to Kris Carpenter (IA) for inviting us out and setting up the meeting.


Saturday, February 6, 2010

2010-02-06: Superbowl XLIV

Regardless of which team you are rooting for this is going to be a good football game. Both teams have explosive offenses captained by quarterbacks that are destined to be indoctrinated into the Hall of Fame.

Peyton Manning is one cool character and if he can figure out the Saints defense the Colts are going to pull away and not look back. The Colts have been consistently good all season and they have a good chance of continuing that trend on Sunday. If the Colts have a weakness, it is their running game. Both offensively and defensively the Colts run game has performed below the league average.

The Saints with Drew Brees, have the leagues best offense without question. They have more yards per attempt and less interceptions than the Colts. They can pass and run the ball very well and if they want to win they had better use it to their advantage. The Saints handicap is their defense. They are below the league average and the Saints secondary against Peyton makes me shudder. That being said the Saints Defense has a magical ability to force a turnover and make a play that wins the game.

Most of the models we use take the home team into consideration. For the Superbowl neither team is really the home team so I ran the models with once with each team as the home team to see if it made a difference and and surprisingly it did not. Looking at the numbers both teams are closely matched and I thought that something like Home field would tilt the difference but all of the models stayed the same no matter which team was considered at home.

Now for the Superbowl predictions.


-- Greg Szalkowski