Posts

2010-01-15: Divisional Playoffs

Dallas has really taken off and seems to be peaking at just the right time. The Cowboys decimated the Eagles for a second time and they almost made it look easy. While our research seems to indicate that teams with good offensive passing generally do better, the recent Dallas surge lends credence to the old saw about the run game. According to our numbers Dallas has the best run offense this year and close to the best run defense. The Dallas pass statistics are not that bad but they are overshadowed by the likes of New Orleans and the Colts. The Colts bring up an issue that I alluded to last week that I had to deal with before running this weeks algorithms. Because Indianapolis sat their starters for the past few weeks I need to decide whether or not to exclude those weeks from the calculations. If I include weeks 16 and 17 for Indy then Baltimore comes out as the predicted winner. However if I ignore those weeks then the Colts are the pick. We felt that that those two weeks are not

2010-01-13: JCDL 2009 Doctoral Consortium Abstracts Published

Image
The Winter 2009 Bulletin of the IEEE Technical Committee on Digital Libraries ( TCDL ) has the extended abstracts from the Doctoral Consortium of the 2009 ACM/IEEE Joint Conference on Digital Libraries ( JCDL 2009 ). Thanks to Lisa Spiro for such a great job putting together the latest IEEE TCDL Bulletin. Megan Winget (UT Austin) and I organized the doctoral consortium this year. We were fortunate enough to have 24 submissions this year -- a record number -- of which we selected 14 for presentation. Those 14 students had their extended abstracts reviewed by a 10 person committee prior to the consortium and then they presented their research to the committee. More information about the participants and processes is available in our opening editorial and the JCDL 2009 Doctoral Consortium page. Neither Megan nor I will be involved in the 2010 JCDL Doctoral Consortium, to be held in Gold Coast, Australia. Those wishing to participate should check the JCDL 2010 web site for

2010-01-12: Classes for Spring 2010

The Web Science and Digital Libraries Research Group is offering two classes this spring: CS 751/851 Introduction to Digital Libraries CS 495/595 Web Server Design Please note that there has been a change in instructor for 495/595: Martin Klein is now the instructor for Spring 2010. Although the instructor has changed, the class will be similar to the previous offerings of the course. Michael Nelson is still the instructor for 751/851. --Michael

2010-01-08: NFL Playoffs Wildcard Weekend

Image
The NFL playoffs have arrived so grab some snacks and find a good spot to watch the games. We are going to have our own playoffs here by pitting the best performing algorithms from the regular season against each other to see which one does the best job predicting the playoffs and superbowl. While I was running the algorithms I contemplated how New England was going to perform without Welker and especially after reading that Brady has been playing with broken fingers on his throwing hand. How can my algorithms take things like this into account to improve the accuracy. Having the Colts sandbag the last few games of the regular season by benching their starters already throws a monkey wrench into the works. Then there was the Dallas-Philly game, Philadephia had the playoff spot already so did they really take it easy and let Dallas stomp all over them like that last week or are they in for another beating? I am sure Frank would say that Dallas really is that good. If their defense

2009-12-05: NFL Playoff Outlook

Image
Week 12 of the regular NFL season is over and with five more weeks to go the playoff picture is starting to take shape. The predictive algorithms we chose are coming along fine. Some are doing better than others which is to be expected. One of the algorithms we are using leverages Google's PageRank algorithm. We formed a directed graph where each vertex is one of the 32 NFL teams. A directed edge is placed for each game played with the loser pointing to the winner. The edge is then weighted with the margin of victory(mov). The mov is the winner's score - the loser's score. Then the page rank is calculated using the graph. We are using the igraph library in R to calculate the page rank. The teams are then ranked in order of page rank. This is similar to another group that has used PageRank to rank NFL teams. One of the concepts of PageRank is that good pages are pointed to by other good pages. In our case good teams will be pointed to by good teams. An interesting ob

2009-11-19: Memento Presentation and Movie; Press Coverage

Image
On Monday, November 16 2009 Herbert and I went to the Library of Congress and presented slides from our Memento eprint (see the previous post for a short description of Memento). On Thursday, November 19 2009 Herbert gave the same presentation at OCLC . Below are the slides that were presented as well as supporting movie. Fortunately, the slides & movie were finished in between ODU sporadically losing power over the weekend due to the Nor'easter , and on Tuesday when odusource.cs.odu.edu and mementoarchive.cs.odu.edu were brought down by a disk failure. Thanks to Scott Ainsworth and the ODU systems staff for their yeoman's work on getting everything back up and running. Slides & movie from the Library of Congress Brown Bag Seminar: Memento: Time Travel for the Web from Herbert Van de Sompel 2010-02-12 Edit: The recorded presentation has just been uploaded to the Library of Congress web site. Also, Memento has enjoyed considerable press &

2009-11-09: Eprint released for "Memento: Time Travel for the Web"

Image
This is a follow-up to my post on October 5 , where I mentioned the availability of the Memento project web site. Herbert 's team and my team, working under an NDIIPP grant , have introduced a framework where you can browse the past web (i.e., old versions of web pages) in the same manner that you browse the current web. The framework uses HTTP content negotiation as a method for requesting the version of the page you want. Most people know little about content negotiation, and the little they think they know is often wrong (see [1-3] for more information about CN). In a nutshell, CN allows you to link to a URI "foo" but, for example, without specifying its format (e.g., "foo.html" vs. "foo.pdf") or language ("foo.html.en" vs. "foo.html.es"). Your browser automatically passes preferences to the server (e.g., "I slightly prefer HTML over PDF, and I greatly prefer English to Spanish") and the server tries to find its b

2009-11-08: Back From Keynotes at WCI and RIBDA.

Image
October was a busy travel month. On October 11-13, I attended a technical meeting for the Open Annotation Collaboration project at Berkeley, CA. From there, I traveled to Berlin, Germany to give a keynote about OAI-ORE at the Wireless Communication and Information Conference (WCI 2009). Michael Herzog was kind enough to invite me to speak there again; I also gave an invited talk at Media Production 2007 , also in Berlin. After a short week back in the US, it was off to Lima, Peru to give another keynote about OAI-ORE, this time at ReuniĂ³n Interamericana de Bibliotecarios, Documentalistas y Especialistas en InformaciĂ³n AgrĂ­cola, or RIBDA 2009 . This was also another repeat performance -- I had given an invited talk about OAI-PMH in Lima in 2004 , and my colleague there, Libio Huaroto, invited me back. Slides from the keynotes are probably available on the conference web sites, however they were both edited versions of the more detailed ORE seminar I recently gave at Emory

2009-10-26: Communications of the ACM Article Published

Image
The article " Why Websites Are Lost (and How They're Sometimes Found) " has finally been published in the November 2009 issue of Communications of the ACM . Co-written with Frank McCown and Cathy Marshall , it was accepted for publication in the fall of 2007. Although we've had a pre-print available since 2008, it just isn't the same until you see it in print. Except we won't be seeing this in print; it is instead published in the "Virtual Extension" part of the CACM. So even though it has page numbers (pp. 141-145), this article won't be among those that arrive in your mailbox in a few weeks. As someone who has spent his entire career trying to transform the scholarly communication process with the web and digital libraries I completely understand this move by the CACM, but I have to admit I'm disappointed that I won't see a printed, bound copy. Even though in the long-term, all discovery will come from the web (e.g., Google Sch

2009-10-15: Seminars at Emory University

Image
I recently traveled to Emory University to visit with Joan Smith (an alumna of our group -- PhD, 2008) and Rick Luce . While there, I gave two colloquiums: on October 1 at the Woodruff Library on OAI-ORE , and on October 2 at the Mathematics & Computer Science Department on web preservation (specifically, based on Martin Klein 's PhD research). I've uploaded both sets of slides. The first, "OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project", is based on slides from Herbert Van de Sompel : OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project from Michael Nelson The second, "(Re-) Discovering Lost Web Pages", is an extended version of slides presented at the NDIIPP Partners Meeting this summer: (Re-) Discovering Lost Web Pages from Michael Nelson --Michael 2020-01-23 Edit: updated embed code for SlideShare.

2009-10-05: Web Page for the Memento Project Is Available

Image
The Library of Congress funded research project " Tools for a Preservation Ready Web " is coming to a close. The initial phase (2007-2008) of the project funded Joan Smith 's PhD research into using the web server to inform web crawlers exactly how many valid URIs there are at a web site (the "counting problem") as well as perform server-side generation of preservation metadata at dissemination time (the "representation problem"). Several interesting papers came out of that project (e.g., WIDM 2006 , D-Lib 14(1/2) ) as well as the mod_oai Apache module. Joan graduated in 2008 and is now the Chief Technology Strategist for the Emory University Libraries and an adjunct faculty member in the CS department at Emory. Since that time, Herbert and I (plus our respective teams) have been closing out this project working on some further ideas regarding the preservation of web pages and how web archives can be integrated with the "live web".

2009-09-28: OAI-ORE In 10 Minutes

Image
A significant part of my research time in 2007-2008 was spent working on the Open Archives Initiative Object Reuse & Exchange project (OAI-ORE, or simply just ORE). Producing the OR E suite of eight documents was difficult and took longer than I anticipated, but we had an excellent team and I'm extremely proud of the results. In the process, I also learned a great deal about the building blocks of ORE: the Web Architecture , Linked Data and RDF . I'm often asked "What is ORE?" and I don't always have a good, short answer. The simplest way I like to describe ORE is "machine readable splash-pages". More formally, ORE addresses the problem of identifying Aggregations of Resources on the Web. For example, we often use the URI of an html page as the identifier of an entire collection of Resources. Consider this YouTube URI: http://www.youtube.com/watch?v=SkJDKdOlUGQ Technically, it identifies just the html page that is returned when that URI

2009-09-19: Football Intelligence and Beyond

Image
Football Intelligence (FI) is a system for gathering, storing, analyzing, and providing access to data to help Football enthusiasts discover more about the performance of their favorite past time. While taking Dr. Nelson's Collective Intelligence class I became fascinated with techniques for mining useful data from the "collective intelligence" of readily available data on the Internet. We decided to apply some of the Data Mining Techniques covered in class in an attempt to predict the 2009 NFL Football season. There is a plethora of data out there that could be mined from Injury reports to betting lines but we decided to limit the scope to use the box score data for training and predictions. Using box scores from 2003 to present we trained a number of different models from Support Vector Machines to Multilayer Perceptron Networks. The implementations of the models we are using are based on the Weka Data Mining Software. Weka contains a number of tools for experiment

2009-09-16: Announcing ArchiveFacebook - A Firefox Add-on for Archiving Facebook Accounts

Image
ArchiveFacebook is a Firefox extension, which helps you to save web pages from Facebook and easily manage them. Save content from Facebook directly to your hard drive and view them exactly the same way you currently view them on Facebook. Why would you want to do this?  Facebook has become a very important part of our lives.  Information about our friends, family, business contacts and acquaintances is stored in Facebook with no easy way to get it out.  ArchiveFacebook allows you to do just that.  What guarantee do you have that Facebook won't accidentally, or in some cases intentionally delete your account?  Don't trust your data to one web site alone.  Take matters into your own hands and preserve this information.  Show it to your kids one day! Currently ArchiveFacebook can save: Photos Messages Activity Stream Friends List Notes Events Groups Info Installation: You can download the extension from https://addons.mozilla.org/en-US/firefox/addon/13993/ .  Once

2009-08-21: CS 751/851 "Introduction to Digital Libraries" Postponed Until Spring 2010

CS 751/851 "Introduction to Digital Libraries" has been postponed from Fall 2009 to Spring 2010. I apologize to those who had planned to take the class this Fall. --Michael