Showing posts from 2009

2009-12-05: NFL Playoff Outlook

Week 12 of the regular NFL season is over and with five more weeks to go the playoff picture is starting to take shape. The predictive algorithms we chose are coming along fine. Some are doing better than others which is to be expected. One of the algorithms we are using leverages Google's PageRank algorithm. We formed a directed graph where each vertex is one of the 32 NFL teams. A directed edge is placed for each game played with the loser pointing to the winner. The edge is then weighted with the margin of victory(mov). The mov is the winner's score - the loser's score. Then the page rank is calculated using the graph. We are using the igraph library in R to calculate the page rank. The teams are then ranked in order of page rank. This is similar to another group that has used PageRank to rank NFL teams. One of the concepts of PageRank is that good pages are pointed to by other good pages. In our case good teams will be pointed to by good teams. An interesting observ

2009-11-19: Memento Presentation and Movie; Press Coverage

On Monday, November 16 2009 Herbert and I went to the Library of Congress and presented slides from our Memento eprint (see the previous post for a short description of Memento). On Thursday, November 19 2009 Herbert gave the same presentation at OCLC . Below are the slides that were presented as well as supporting movie. Fortunately, the slides & movie were finished in between ODU sporadically losing power over the weekend due to the Nor'easter , and on Tuesday when and were brought down by a disk failure. Thanks to Scott Ainsworth and the ODU systems staff for their yeoman's work on getting everything back up and running. Slides & movie from the Library of Congress Brown Bag Seminar: Memento: Time Travel for the Web from Herbert Van de Sompel 2010-02-12 Edit: The recorded presentation has just been uploaded to the Library of Congress web site. Also, Memento has enjoyed considerable press &

2009-11-09: Eprint released for "Memento: Time Travel for the Web"

This is a follow-up to my post on October 5 , where I mentioned the availability of the Memento project web site. Herbert 's team and my team, working under an NDIIPP grant , have introduced a framework where you can browse the past web (i.e., old versions of web pages) in the same manner that you browse the current web. The framework uses HTTP content negotiation as a method for requesting the version of the page you want. Most people know little about content negotiation, and the little they think they know is often wrong (see [1-3] for more information about CN). In a nutshell, CN allows you to link to a URI "foo" but, for example, without specifying its format (e.g., "foo.html" vs. "foo.pdf") or language ("foo.html.en" vs. ""). Your browser automatically passes preferences to the server (e.g., "I slightly prefer HTML over PDF, and I greatly prefer English to Spanish") and the server tries to find its be

2009-11-08: Back From Keynotes at WCI and RIBDA.

October was a busy travel month. On October 11-13, I attended a technical meeting for the Open Annotation Collaboration project at Berkeley, CA. From there, I traveled to Berlin, Germany to give a keynote about OAI-ORE at the Wireless Communication and Information Conference (WCI 2009). Michael Herzog was kind enough to invite me to speak there again; I also gave an invited talk at Media Production 2007 , also in Berlin. After a short week back in the US, it was off to Lima, Peru to give another keynote about OAI-ORE, this time at Reunión Interamericana de Bibliotecarios, Documentalistas y Especialistas en Información Agrícola, or RIBDA 2009 . This was also another repeat performance -- I had given an invited talk about OAI-PMH in Lima in 2004 , and my colleague there, Libio Huaroto, invited me back. Slides from the keynotes are probably available on the conference web sites, however they were both edited versions of the more detailed ORE seminar I recently gave at Emory

2009-10-26: Communications of the ACM Article Published

The article " Why Websites Are Lost (and How They're Sometimes Found) " has finally been published in the November 2009 issue of Communications of the ACM . Co-written with Frank McCown and Cathy Marshall , it was accepted for publication in the fall of 2007. Although we've had a pre-print available since 2008, it just isn't the same until you see it in print. Except we won't be seeing this in print; it is instead published in the "Virtual Extension" part of the CACM. So even though it has page numbers (pp. 141-145), this article won't be among those that arrive in your mailbox in a few weeks. As someone who has spent his entire career trying to transform the scholarly communication process with the web and digital libraries I completely understand this move by the CACM, but I have to admit I'm disappointed that I won't see a printed, bound copy. Even though in the long-term, all discovery will come from the web (e.g., Google Scho

2009-10-15: Seminars at Emory University

I recently traveled to Emory University to visit with Joan Smith (an alumna of our group -- PhD, 2008) and Rick Luce . While there, I gave two colloquiums: on October 1 at the Woodruff Library on OAI-ORE , and on October 2 at the Mathematics & Computer Science Department on web preservation (specifically, based on Martin Klein 's PhD research). I've uploaded both sets of slides. The first, "OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project", is based on slides from Herbert Van de Sompel : OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project from Michael Nelson The second, "(Re-) Discovering Lost Web Pages", is an extended version of slides presented at the NDIIPP Partners Meeting this summer: (Re-) Discovering Lost Web Pages from Michael Nelson --Michael 2020-01-23 Edit: updated embed code for SlideShare.

2009-10-05: Web Page for the Memento Project Is Available

The Library of Congress funded research project " Tools for a Preservation Ready Web " is coming to a close. The initial phase (2007-2008) of the project funded Joan Smith 's PhD research into using the web server to inform web crawlers exactly how many valid URIs there are at a web site (the "counting problem") as well as perform server-side generation of preservation metadata at dissemination time (the "representation problem"). Several interesting papers came out of that project (e.g., WIDM 2006 , D-Lib 14(1/2) ) as well as the mod_oai Apache module. Joan graduated in 2008 and is now the Chief Technology Strategist for the Emory University Libraries and an adjunct faculty member in the CS department at Emory. Since that time, Herbert and I (plus our respective teams) have been closing out this project working on some further ideas regarding the preservation of web pages and how web archives can be integrated with the "live web".

2009-09-28: OAI-ORE In 10 Minutes

A significant part of my research time in 2007-2008 was spent working on the Open Archives Initiative Object Reuse & Exchange project (OAI-ORE, or simply just ORE). Producing the OR E suite of eight documents was difficult and took longer than I anticipated, but we had an excellent team and I'm extremely proud of the results. In the process, I also learned a great deal about the building blocks of ORE: the Web Architecture , Linked Data and RDF . I'm often asked "What is ORE?" and I don't always have a good, short answer. The simplest way I like to describe ORE is "machine readable splash-pages". More formally, ORE addresses the problem of identifying Aggregations of Resources on the Web. For example, we often use the URI of an html page as the identifier of an entire collection of Resources. Consider this YouTube URI: Technically, it identifies just the html page that is returned when that URI is d

2009-09-19: Football Intelligence and Beyond

Football Intelligence (FI) is a system for gathering, storing, analyzing, and providing access to data to help Football enthusiasts discover more about the performance of their favorite past time. While taking Dr. Nelson's Collective Intelligence class I became fascinated with techniques for mining useful data from the "collective intelligence" of readily available data on the Internet. We decided to apply some of the Data Mining Techniques covered in class in an attempt to predict the 2009 NFL Football season. There is a plethora of data out there that could be mined from Injury reports to betting lines but we decided to limit the scope to use the box score data for training and predictions. Using box scores from 2003 to present we trained a number of different models from Support Vector Machines to Multilayer Perceptron Networks. The implementations of the models we are using are based on the Weka Data Mining Software. Weka contains a number of tools for experimenting

2009-09-16: Announcing ArchiveFacebook - A Firefox Add-on for Archiving Facebook Accounts

ArchiveFacebook is a Firefox extension, which helps you to save web pages from Facebook and easily manage them. Save content from Facebook directly to your hard drive and view them exactly the same way you currently view them on Facebook. Why would you want to do this?  Facebook has become a very important part of our lives.  Information about our friends, family, business contacts and acquaintances is stored in Facebook with no easy way to get it out.  ArchiveFacebook allows you to do just that.  What guarantee do you have that Facebook won't accidentally, or in some cases intentionally delete your account?  Don't trust your data to one web site alone.  Take matters into your own hands and preserve this information.  Show it to your kids one day! Currently ArchiveFacebook can save: Photos Messages Activity Stream Friends List Notes Events Groups Info Installation: You can download the extension from .  Once

2009-08-21: CS 751/851 "Introduction to Digital Libraries" Postponed Until Spring 2010

CS 751/851 "Introduction to Digital Libraries" has been postponed from Fall 2009 to Spring 2010. I apologize to those who had planned to take the class this Fall. --Michael

2009-07-30: Position Paper Published in Educause Review

The July/August 2009 issue of Educause Review has a position paper of mine entitled " Data Driven Science: A New Paradigm? " This invited paper is essentially a cleaned-up version of my position paper at the 2007 NSF/JISC Workshop on Data-Driven Science and Scholarship held in Arizona, April 17-19 2007. Prior to the workshop, we were all assigned topics on which we were to write a short position paper . My topic was to address the question of is "data-driven science is becoming a new scientific paradigm – ranking with theory, experimentation, and computational science?" You can judge my response by the original paper's more cheeky title of "I Don't Know and I Don't Care". My argument can be summed up as "we've always had data-driven science at whatever was the largest feasible scale; it just happens that the scale is now very large." Scale is important, in fact some days I might argue that scale is all there is. But part

2009-07-17: Technical Report "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure"

This week I uploaded the technical report which is co-authored by Michael L. Nelson to the e-print service . The underlying idea of this research is to utilize the web infrastructure (search engines, their caches, the Internet Archive, etc) to rediscover missing web pages - pages that return the 404 "Page not Found" error. We apply various methods to generate search engine queries based on the content of the web page and user created annotations about the page. We then compare the retrieval performance of all methods and introduce a framework to combine such methods to achieve the optimal retrieval performance. The applied methods are: 5- and 7-term lexical signatures of the page the title of the page tags users annotated the page with on 5- and 7-term lexical signatures of the page neighborhood (up to 50 pages linking to the missing page) We query the big three search engines (Google, Yahoo and MSN Live) with the outcome of all methods and analyze t

2009-07-16: The July issue of D-Lib Magazine has JCDL and InDP reports.

The July/August 2009 issue of D-Lib Magazine has just published reports for the 2009 ACM/IEEE JCDL (written by me) and InDP (written by Frank and his co-organizers), as well as several other reports for JCDL workshops and other conferences (such as Open Repositories 2009 ). Whereas my previous entry about JCDL & InDP was focused on our group's experiences, these reports give a broader summary of the events. --Michael

2009-07-07: Hypertext 2009

From June 30th through July 1st I attended Hypertext 2009 ( HT 2009 ) in Torino Italy . The conference saw a 70% increase in submissions (117 total) compared to last year but due to the equally increased number of accepted papers (26 long and 11 short) and posters maintain last years acceptance rate of roughly 32%. HT 2009 also had a record of 150 registered attendees. I presented our paper titled " Comparing the Performance of US College Football Teams in the Web and on the Field " ( DOI ) which was joint work with Olena Hunsicker under the supervision of Michael L. Nelson . The paper describes an extensive study on the correlation of expert rankings of real world entities and search engine rankings of their representative resources on the web. Comparing the Performance of US College Football Teams in the Web and on the Field from Martin Klein We published a poster, " Correlation of Music Charts and Search Engine Rankings " ( DOI ), with the resu

2009-06-29: NDIIPP Partners Meeting

On June 24-26 I attended the 2009 NDIIPP Partners Meeting in Washington DC. Although it has grown from the early years, I believe this year's attendance of 150 people is similar to last year's. Clay Shirky , author of "Here Comes Everybody", gave the keynote on Wednesday morning. Hopefully the Library of Congress will post a video of the keynote soon. If not, take a look at some of his other presentations -- you will find them enjoyable and informative. On Thursday morning I presented a summary of Martin 's PhD research, the tangible product of which will be a FireFox extension called "Synchronicity": Synchronicity: Just-In-Time Discovery of Lost Web Pages from Michael Nelson The presentation was very well received and there is a lot of interest in the extension. There were several interesting break out sessions, but the real news was on Friday when Martha Anderson (LC) introduced the upcoming National Digital Stewardship Alliance

2009-06-22: Back From JCDL 2009

We had a good showing at the 2009 ACM/IEEE Joint Conference on Digital Libraries (JCDL) in Austin, TX last week. In total, we had 1 full paper, 3 short papers, 2 posters, 1 workshop paper and 1 doctoral consortium paper. JCDL is the flagship conference in our field and we always make a point to send as many people as possible. Chuck Cartledge (left) presented "A Framework for Digital Object Self-Preservation" at the doctoral consortium . He also presented the related short paper " Unsupervised Creation of Small World Networks for the Preservation of Digital Objects ". Chuck is planning to have his doctoral candidacy exam sometime in the early fall. Michael presented the full paper " Using Timed-Release Cryptography to Mitigate The Preservation Risk of Embargo Periods ". This paper was based on Rabia Haq's MS Thesis, which she defended in the fall of 2008. Michael also co-organized the doctoral consortium and convinced WS-DL alumna Joan Smith