Posts

Showing posts from 2012

2012-12-21: The Performance of Betting Lines for Predicting the Outcome of NFL Games

Image
It was the first week of the 2007 National Football League (NFL) season. After waiting all summer for the NFL season to begin, the fans were rabid with anticipation. The airwaves were filled with sportscasters debating the prospects of teams from both conferences and how they would perform.

Of particular interest was the New England Patriots. They had two starters out with injuries and their star receiver, Randy Moss, was questionable for the game. New England was playing against the NY Jets and their simmering rivalry add heat to the fire. Many of the sportscasters were lining up with the Jets and Vegas was favoring the Jets with a 6 point line at home.

When betting opened for the game the action on the Patriots was heavy. The shear volume of bets place on New England to win forced the sportsbooks to move the spread in an attempt to equalize betting on both sides. Eventually the line moved all  of the way to New England being a seven point favorite by game time. New England went on …

2012-12-20: NFL Power Rankings Week 16

Image
The NFL Playoffs are only a few weeks away. With the end of the regular season in sight there are a few trends that subtlety change the game. One of the trends is the weather. Tennessee is playing at Green Bay and snow is in the forecast. Another end of the season trend is displayed by teams that have clinched playoff positions. They rest their starting lineup and play backup players. That is more of a week 17 phenomenon but with Atlanta and Houston both at 12-2 for the season they may play some non-starters during the game. This ranking system is based on team performance and does not take trends like the weather into account.

Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory.�…

2012-12-17: Archive-It Partners Meeting

I attended the 2012 Archive-It Partners Meeting in Annapolis, MD on December 3.

I decided to attend at the last minute, and Kristine and Lori graciously let me have 5 minutes to talk about our project and upcoming NEH proposal.  We're looking for humanities-types and Archive-It partners to work with in evaluating our visualizations. After my presentation, I was able to make contacts with several potential partners.


Visualizing Digital Collections at Archive-It from Michele Weigle

There were several nice talks in the half-day session.  The full schedule and slides from all of the presentations are available.

Related to what we're working on, Alex Thurman from Columbia University Libraries talked about their local portal to their Human Rights collection (collection at Archive-It).  They offer a rotated list of screenshots for featured sites and have tabs to show the collection pages by title, URL, subject, place, and language. One nice feature they've added is the ability to …

2012-12-14: InfoVis at Grace Hopper

Image
I was selected give a 5-minute faculty lightning talk at the Grace Hopper Celebration of Women in Computing in October in Baltimore.  Short talks are among the most difficult to prepare, especially short talks for a general audience. I decided to increase my level of difficulty for the talk by combining two topics in my 5-minute talk, information visualization (infovis) and web archiving.

I ended up presenting a snapshot of the work that Kalpesh Padia and Yasmin AlNoamany did for their JCDL 2012 paper, Visualizing Digital Collections at Archive-It (see related blog post).


Information Visualization - Visualizing Digital Collections at Archive-It from Michele Weigle

The faculty lightning talks session was new at Grace Hopper, but went very well.  We had a 45-minute session and got to hear about 8 totally different research projects.  Info and slides from all of the presentations are available on the GHC wiki.  Especially for work-in-progress, this format was a great way for the speakers …

2012-11-10: Site Transitions, Cool URIs, URI Slugs, Topsy

Image
Recently I was emailing a friend and wanted to update her about the recent buzz we have enjoyed with Hany SalahEldeen's TPDL 2012 paper about the loss rate of resources shared over Twitter.  I remembered that an article in the MIT Technology Review from the Physics arXiv blog started the whole wave of popular press (e.g., MIT Technology Review, BBC, The Atlantic, Spiegel).  To help convey the amount of social media sharing of these stories, I was sending links to the sites using social media search engine Topsy.  Having recently discovered it, Topsy has quickly become one of my favorite sites.  It does many things, but the part I enjoy most is the ability to prepend "http://topsy.com/" to a URI to discover how many times a URI has been shared and who is sharing it.  For example:

http://www.bbc.com/future/story/20120927-the-decaying-web

becomes:

http://topsy.com/http://www.bbc.com/future/story/20120927-the-decaying-web

and you can see all the tweets that have linked to the…

2012-11-06: TPDL 2012 Conference

Image
It all started last April, particularly on the 9th, when I received an email from the Dr. George Buchanan delivering the good news, my paper have been accepted at the annual international conference on Theory and Practice of Digital Libraries TPDL 2012. Being the Program Chair, Dr. Buchanan sent me the reviews and feedback associated with my paper which was entitled “Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?” which paved the way in the following months for the preparation process to present this paper.

Along with submitting the paper, Dr. Nelson gave me the permission to submit my PhD proposal to be considered for the Doctoral Consortium at the conference. Scoring my second goal, Dr. Birger Larsen and Dr. Stefan Gradmann sent me a delightful email announcing the committee's acceptance to my proposal and I was invited a day before the conference to present my work at the consortium.

The Hat-trick came a few weeks before the conference in the for…

2012-10-24: NFL Power Rankings Week 8

Image
After running the R script for the week 8 rankings, the first thing that struck my mind was the disparity in the size of the nodes between the AFC on the left side of our graph and the NFC on the right side.

Two weeks ago we wrote that the NFC West has been dominant so far this year. The NFC West has the best combined record and their aggregate point differential puts others to shame.  However it is not just the West division but the entire NFC conference has dominated and out-performed the AFC conference at every turn. CBS Sports rates the NFC as head and shoulders above the AFC this year.

Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link fro…

2012-10-11: NFL Power Rankings Week 6

Image
It is now five weeks into the 2012 season and the season is starting to come into focus. The topic of many online discussions is this years performance of the NFC West division compared to last year. The NFC West is one of the best performing divisions so far this year, which is a far cry from last year. They are certainly doing well in our ranking system.

Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of vot…

2012-10-10: Zombies in the Archives

Image
In our current research, the WS-DL group has observed leakage in archived sites. Leakage occurs when archived resources include current content. I enjoy referring to such occurrences as "zombie" resources (which is appropriate given the upcoming Halloween holiday). That is to say, these resources are expected to be archived ("dead") but still reach into the current Web.

In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web. 

We provide two examples with humorous juxtaposition of past and present content. Because of  JavaScript, rendering a page from the past will include advertisements from the present Web.


First, we look at cnn.com. We can observe an archived resource from the Wayback Machine at http://web.archive.org/web…

2012-09-29: Data Curation, Data Citation, ResourceSync

Image
During September 10-11, 2012 I attended the UNC/NSF Workshop Curating for Quality: Ensuring Data Quality to Enable New Science in Arlington.  The structure of the workshop was to invite about 20 researchers involved with all aspects of data curation and solicit position papers in one of four broad topics:
data quality criteria and contextshuman and institutional factorstools for effective and painless curationmetrics Although the majority of the discussion was about science data, my position paper was about the importance of archiving the web.  In short, treating the web as the corpus that should be retained for future research.  The pending workshop report will have a full list of participants and their papers, but in the meantime I've uploaded to arXiv my paper, "A Plan for Curating `Obsolete Data or Resources'", which is a summary version of the slides I presented at the Web Archiving Cooperative meeting this summer. 

To be included in the workshop report are the…

2012-09-27: NFL Referee Kerfuffle

Image
For the first three weeks of the 2012 NFL season, replacement officials have refereed the games due to an ongoing labor dispute between the referees and the NFL. Every fan of a team that has been on the losing side of a call has voiced their opinion on the abilities of the replacement referees. Even Jon Stewart had something to say about the labor dispute.

This past Monday night during the Seahawks - Packers game, a controversial call essentially determined the winner of the game. This call was the powder keg that blew open the dam of angry recriminations and complaints directed at the replacement referees and the NFL. This was somewhat amusing to me as the people complaining seem to forget about all of the mistakes the regular referees appeared to make in all of the previous years. In 2008 one of the best referees in the NFL, Ed Hochuli made a rather horrendous call. I have to give him respect for owning up to it and apologizing. NFL fans have always complained about the officiating,…

2012-08-31: Benchmarking LANL's SiteStory

Image
On August 17th, 2012, Los Alamos National Laboratory's Herbert Van de Sompel announced the release of the anticipated transactional web archiver called SiteStory.
Very excited to announce the release of our SiteStory transactional archive solution #mementomementoweb.github.com/SiteStory/
— Herbert (@hvdsomp) August 17, 2012

The ODU WS-DL research group (in conjunction with The MITRE Corporation) performed a series of studies to measure the effect of the SiteStory on web server performance. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.
A sneak-peek at how SiteStory affects server performance is provided below. Please see the technical report for a full description of these resul…