Thursday, October 25, 2012

2012-10-24: NFL Power Rankings Week 8


After running the R script for the week 8 rankings, the first thing that struck my mind was the disparity in the size of the nodes between the AFC on the left side of our graph and the NFC on the right side.

Two weeks ago we wrote that the NFC West has been dominant so far this year. The NFC West has the best combined record and their aggregate point differential puts others to shame.  However it is not just the West division but the entire NFC conference has dominated and out-performed the AFC conference at every turn. CBS Sports rates the NFC as head and shoulders above the AFC this year.

Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week the Giants beat the Redskins 27 to 23, in the graph a directed edge from the Redskins to the Giants with a weight of 4 was created.

The season graph so far can be visualized in the following graph.


The Pagerank algorithm is run and all of the votes from losing teams are calculated. The nodes in the graph are given a final ranking and that is represented by the size of the node in the graph. This algorithm does a much better job of taking the strength of schedule into account than many of the other ranking systems that are essentially based on win loss ratios. Barring any injuries or or other problems it is a good guess that Houston will representing the AFC once the playoffs are complete. The real question is which team from the NFC will rise to surface to take them on in the Superbowl.

The numerical rankings are as follows:

RankTeam
1San Francisco
2Green Bay
3NY Giants
4Dallas
5Chicago
6Seattle
7Minnesota
8St Louis
9Washington
10Houston
11Arizona
12Atlanta
13Baltimore
14Philadelphia
15Cincinnati
16Denver
17New England
18NY Jets
19Indianapolis
20Pittsburgh
21Buffalo
22Miami
23Detroit
24San Diego
25New Orleans
26Cleveland
27Tennessee
28Carolina
29Tampa Bay
30Oakland
31Jacksonville
32Kansas City

-- Greg Szalkowski

Thursday, October 11, 2012

2012-10-11: NFL Power Rankings Week 6

It is now five weeks into the 2012 season and the season is starting to come into focus. The topic of many online discussions is this years performance of the NFC West division compared to last year. The NFC West is one of the best performing divisions so far this year, which is a far cry from last year. They are certainly doing well in our ranking system.

Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory. 

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j.  This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week the Falcons beat the Redskins 24 to 17, in the graph a directed edge from the Redskins to the Falcons with a weight of 7 was created.

The season graph so far can be visualized in the following graph.

The Pagerank algorithm is run and all of the votes from losing teams are calculated. The nodes in the graph are given a final ranking and that is represented by the size of the node in the graph. The Pagerank algorithm used in this fashion has the nice effect of representing the strength of schedule. This should be of interest to many of the Houston Texan fans out there. The majority of the NFL Power Ranking sites out there, currently have Houston ranked number one. A simple glance at the schedule for the past five weeks would show that Houston has had a pretty easy season so far. They have played well so far and this week when they play Green Bay should be a good game.

The numerical rankings are as follows:

RankTeam
1Chicago
2Green Bay
3San Francisco
4Minnesota
5Indianapolis
6St Louis
7Arizona
8Seattle
9Philadelphia
10Houston
11Baltimore
12Atlanta
13Jacksonville
14Dallas
15Detroit
16Denver
17New England
18NY Giants
19Cincinatti
20San Diego
21Pittsburgh
22Miami
23Washington
24Buffalo
25NY Jets
26Carolina
27Tennesse
28New Orleans
29Oakland
30Tampa Bay
31Kansas City
32Cleveland

-- Greg Szalkowski

Wednesday, October 10, 2012

2012-10-10: Zombies in the Archives

Image provided from http://www.taxhelpattorney.com/
In our current research, the WS-DL group has observed leakage in archived sites. Leakage occurs when archived resources include current content. I enjoy referring to such occurrences as "zombie" resources (which is appropriate given the upcoming Halloween holiday). That is to say, these resources are expected to be archived ("dead") but still reach into the current Web.

In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web. 

We provide two examples with humorous juxtaposition of past and present content. Because of  JavaScript, rendering a page from the past will include advertisements from the present Web.


2008 memento of cnn.com from the Wayback Machine
First, we look at cnn.com. We can observe an archived resource from the Wayback Machine at http://web.archive.org/web/20080903204222/http://www.cnn.com/. This memento from September 16th, 2008 includes links to the 2008 presidential race between McCain-Palin and Obama-Biden. However, this memento was observed on September 28, 2012 -- during the 2012 presidential race between Romney-Ryan and Obama-Biden. The memento includes embedded JavaScript that pulls advertisements from the live Web. The advertisement included in the memento is a zombie resource that promotes the 2012 presidential debate between Romney and Obama. This drift from the expected archived time seems to provide a prophetic look at the 2012 presidential candidates in a 2008 resource. The current cnn.com homepage gives the same advertisement as the archived version.


Current cnn.com homepage as observed in 2012

A second case study comes from the IMDB movie database site. We observed the July 28th, 2011 memento of the IMDB homepage at http://web.archive.org/web/20110728165802/http://www.imdb.com/. This memento advertises the movie Cowboys and Aliens. This movies is set to start "tomorrow" according to our observed July 28th, 2011 memento. Additionally, we see the current feature movie is Captain America

2011 memento of IMDB.com from the Wayback Machine 

According to the currently observed IMDB site, Cowboys and Aliens was released in 2011 and Captain American was released in 2011, in keeping with our observed memento. However, the ad included on the IMDB memento promotes the movie "Won't Back Down." According to IMDB, this movie won't be released until 2012. Again, we can observed a memento with reference to present-day events.

Cowboys and Aliens was released in 2011
Captain American was released in 2011
Won't Back Down is scheduled to be released in 2012

When we observe the HTTP requests that are made when loading the mementos there is evidence of reach into the current Web. We've stored all HTTP headers from the archive into a text file for analysis.   The requests should be to other archive.org resources. However, we can get the requests for live-Web resources:

$ grep Host: headers.txt | grep -v archive.org
Host: ocsp.incommon.org
Host: ocsp.usertrust.com
Host: exchange.cs.odu.edu
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: core.insightexpressai.com
Host: ad.doubleclick.net
Host: ad.doubleclick.net
Host: ia.media-imdb.com
Host: ad.doubleclick.net


These requests from archives into the live Web are initiated by embedded JavaScript:

<iframe src="http://www.imdb.com/images/SF99c7f777fc74f1d954417f99b985a4af/a/ifb/doubleclick/expand.html#imdb2.consumer.homepage/;tile=5;sz=1008x60,1008x66,7x1;p=ns;ct=com
;[PASEGMENTS];u=[CLIENT_SIDE_ORD];ord=[CLIENT_SIDE_ORD]?" ... onload="ad_utils.on_ad_load(this)"></iframe>


During our investigation of these zombie resources, we observed that this leakage of live content into archived resources is not consistent. We noticed that some versions of some browsers would not produce the leakage; this is potentially due to the browsers' different methods of handling JavaScript and Ajax calls. In our experience, older browsers have a higher percentage of leakage, while the newer browsers demonstrate the leakage less frequently.

The CNN and IMDB mementos mentioned above were rendered in Mozilla Firefox version 3.6.3. Below are two examples of our CNN and IMDB mementos rendered in a Mozilla Firefox 15.0.1. Note that the below examples attempt to load the advertisements but produce a "Not Found In Archive" message.

CNN.com memento rendered in a newer browser with no leakage.

IMDB.com memento rendered in a newer browser with no leakage.

When analyzing the headers with these new browsers, we get fewer requests for live content. More importantly, we get different requests than we saw in the other browsers:

$ grep Host: headers.txt | grep -v archive.org
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: ia.media-imdb.com
Host: b.scorecardresearch.com
Host: s0.2mdn.net
Host: s0.2mdn.net
Host: b.voicefive.com
Host: b.scorecardresearch.com


These mementos still attempted to load wrong resources, albeit unsuccessfully. Essentially, these mementos are shown as incomplete instead of incorrect (and without our humorous results). The exact relationship between browser, mementos, and zombie resources will required additional investigation before we can establish a cause and solution for these leakages.

The Internet Archive is not the only archive that contains these leakages. We found an example in the following WebCite memento of cnn.com archive on 2012-09-09.

WebCite memento of cnn.com.

The "Popular on Facebook" section of the page has activity from two of my "friends." The page that was shared was the 10 questions for Obama to answer page, which was published on October 1st, 2012 and is shown below. It should be obvious that my "friends" shouldn't have been able to share a page that hasn't been published, yet (2012-09-09 occurs before 2012-10-01). So, the WebCite page allow live-Web leakage in the cnn.com memento.

Live cnn.com resource

Such occurrences of leakage and zombie resources are not uncommon in today's archives. Current Web technologies such as JavaScript make a pure, unchanging capture difficult in the modern Web. However, it is useful for us as Web users and Web scientists to understand that zombies do exist in our archives.

--Justin F. Brunelle