Sunday, February 5, 2012

2012-02-05: Superbowl 46


Superbowl 46 is today and whether you love football or if you just watch for the commercials you are in for some entertainment tonight. Tonight's game is one of the closest games in recent history.
There is no doubt that New England has a great offense led by Tom Brady. New England has an Offensive Passing Efficiency of 7.65 yards per play compared to a league average of 5.97. The Giants led by Eli Manning are not far behind with a Passing efficiency of 7.32 yards per play. Both teams are in the top five for offensive passing. However the differences are more dramatic on the defensive side of the house. The Giants have given up 5.97 yards per play which is the league average. The patriots have the 29th worst pass defense and have have given up an average of 6.68 yards per play.
Running the algorithms the same way we have all year has the Patriots winning the Superbowl. The predicted margin of victory matches the Vegas Line exactly so this will be a close game. Because this is Science here are the outputs of the algorithms run in the exact same manner as they have all year.
Favorite Spread Underdog Discrete Pagerank
At NE 3 NYG NE NE

Now with that out of the way and adding a bit of logic. The Superbowl is not played at either of the teams home fields. New England is considered the home team but they are not playing at their home stadium. They are playing at the home stadium of the Indianapolis Colts and quarterback for the Colts is Peyton Manning, the brother of Eli Manning who is the quarterback for the Giants. Additionally the patriots are rivals of the Colts and the home crowd is not likely to be favorably disposed to the Patriots. Following that train of thought I swapped the home team to the Giants and ran the algorithms again. The Eigen Vector algorithm did not change as it does not take home team into account. The SVM algorithm switched its vote to the Giants, and the output of the Neural network model dropped below the Vegas Line.
Taking this into account we think that the Giants will either cover the spread or win the Superbowl outright.

-- Greg Szalkowski

Tuesday, January 24, 2012

2012-01-23: Release of Warrick 2.0 Beta

After a long hiatus, the Warrick tool has been resurrected with some modifications. Warrick is a free utility for reconstructing (or recovering) a website. The original version of Warrick discovered archived versions of resources by searching the Web Infrastructure (which includes search engine caches and the Internet Archive) for archived versions of web resources. It would automatically download and organize the best versions of the archived resources and package them into a copy of the deleted site.

As discussed by Warrick's creator, Frank McCown, the original version of Warrick was prone to breaking due to frequent changes to search engine APIs and archive URLs. Warrick 2.0, adapted from Dr. McCown's original code by Justin F. Brunelle, interfaces with the Memento framework via the mcurl program (developed by Ahmed AlSum). By incorporating Memento timemaps, Warrick no longer has the responsibility of directly searching and communicating with the caches and archives, or learning about new repositories. Instead, Memento handles the interface and communication with the archives, allowing Warrick to remain unaffected by API or URL changes. This makes Warrick more resistant to failures when repositories change, appear, or disappear. Memento allows Warrick to provide additional functionality, such as the ability to recover sites from a specific point in time by utilizing timemaps and the mcurl program.

Warrick 2.0 has already been helping individuals recover lost web sites. Dag Forssell reached out with the following message:

"I just Googled the idea of restoring a website from the Wayback Machine and discovered your work on Warrick. .... Perhaps you can use my project as one of your guinea pigs.

I am restoring a book by Professor of Law Hugh Gibbons and found when listing his references that he created a website on the principles of law in 2002, then abandoned it in 2006 or so, when he retired. I have downloaded 143 htm files from the Wayback Machine. The site looks complete. But of course, each file comes with its own folder (css, jpgs and such) and the links all point back to the wayback machine. Cleaning it all up will be a lot of work.

If it is in the cards for you ... to take this under your wing, I will be overjoyed."

Dag's site was successfully recovered, and helped us to work out some last remaining bugs before our beta release. Further, Dag mentioned that utilizing Warrick eliminated much of the effort on his part by allowing Warrick to deduplicate resources downloaded from the Internet Archive and arrange the resources in the correct site structure. Further, it allow him a deeper understanding of how the resources interact within the page. His recovered content will reportedly be available live at http://www.biologyoflaw.org/ (the website is not available at the time of this blog posting).

We are happy to announce the release of the Beta source of the project which can be downloaded from its Google Code site. Installation and usage instructions are available from the Google Code site.

Warrick is run with a series of command line flags, or options. These are largely unchanged from the original Warrick, but some flags are new. For example, the user now has the -dr and -R options. The -dr option allows the user to specify the date at which the site should be recovered. For example:

warrick.pl -dr 2004-02-01 http://www.cs.odu.edu/

will recover the ODU Computer Science homepage as close as possible to February 1st, 2004.

-R is the resume flag, which allows a user to resume a suspended reconstruction job from a saved file.

Let's say we run the following recovery job:

warrick.pl -D MyRecoveryDirectory -k -n 100 http://www.justinfbrunelle.com/

This will recovery Justin Brunelle's homepage into the directory MyRecoveryDirectory with the -D flag, convert all links from absolute to be relative to the local disk with the -k flag, and stop the recovery after 100 resources are recovered with the -n flag.

When Warrick completes the recovery session of 100 files, it saves the recovery state in a save file. A user can resume the state by using the -R flag as follows:

warrick.pl -R MYSAVEFILE.save

This will resume the suspended job stored in MYSAVEFILE.save. This will recover an additional 100 files.

For a visual example, let's look at one of the aforementioned commands and demonstrate how it can recover a page.

warrick.pl -dr 2004-02-01 http://www.cs.odu.edu/

We can visit the current (as of 2012-01-23) ODU CS website (http://www.cs.odu.edu/) to see the following representation:


To get an idea of what Warrick will recover, we can observe the ODU CS homepage archived at the Internet Archive on 2004-02-06.

After running Warrick, we can view the reconstructed page at my local directory.

This resource is nearly identical to the copy at the Internet Archive. The branding at the top of the page has been removed to keep the representation as faithful as possible to the original resource at the time of archiving. However, we can see from the reporting that all 10 of the recovered resources that make up the recovered page came from the Internet Archive.

#############################################
Memento Timegate Accesses: 11
Internet Archive Contributions: 10
Bing Contributions: 0
Google Contributions: 0
WebCitation Contributions: 0
Diigo Contributions: 0
UK Archives Contributions: 0
URIs obtained from lister Queries: 0
####
Total recoveries completed: 10
Number of cache resources used: 0
Number of resources overwritten: 0
Number of avoided overwrites: 0
Total failed recoveries: 1
Images recovered: 8
HTML pages recovered: 1
Other resources recovered: 1
URIs left in the Frontier: 0
#############################################



More examples can be viewed in the README file available in the source archive or at Warrick's Google Code Wiki.

Please note that this is only a beta version of the software. Also, it is only runnable via Perl on a Linux command-line. A new version of warrick.cs.odu.edu is in development and will be release soon. This web interface will allow users to run Warrick from a browser which will provide tech-savvy and non-tech-savvy users, alike, to benefit from Warrick.

If you utilize Warrick to recover a web site, we are very interested in learning about your experience; this will help us improve Warrick for future users. Please reach out to us via email by joining the WarrickRecovery Google Group (warrickrecovery@googlegroups.com) to learn how you may help.


--Justin F. Brunelle

Sunday, January 22, 2012

2012-01-221: 2011 NFL Season Conference Championship


The NFL Conference championship games are today. Our models have a tendency to reward teams that can pass the ball well as Passing efficiency correlates with wins rather well. Therefore it is no surprise that two out of the three models predict New England will win over Baltimore. However the Neural Network is predicting that it will be a close game and that New England will not cover the spread of 7 points.

The San Francisco / New York game is going to be a good game to watch. Both teams are very close but the Giants have the edge on passing efficiency.

Favorite Spread Underdog Discrete Pagerank
At NE 4 BAL NE BAL
At SF 1 NYG NYG SF


-- Greg Szalkowski

Sunday, January 1, 2012

2012-01-01: 2011 NFL Season Week 17

The last week of the regular season games is here. Week 17 traditionally exhibits greater statistical dispersion than the other weeks. Teams that have locked in playoff spots will be resting the starting players and teams that do not have a chance at the playoffs may be looking for a better draft pick for next year.

Our algorithms once again have picked Green Bay to win but most likely they will rest Aaron Rodgers and most of the starters and Detroit will win the game. Green Bay is an enigma this year, they are 14-1 so far and they have given up more yards than they have gained over the year which invites some interesting analysis.




Favorite Spread Underdog Discrete Pagerank
At PHI 10 WAS PHI PHI
At ATL 14 TB ATL ATL
SF 5 At STL SF SF
At MIN 6 CHI CHI CHI
At GB 8 DET GB GB
At NYG 4 DAL NYG NYG
At NO 4 CAR NO NO
At HOU 5 TEN TEN TEN
BAL 2 At CIN BAL BAL
PIT 7 CLE PIT PIT
At JAX 3 IND JAX JAX
At MIA 3 NYJ MIA NYJ
At NE 11 BUF NE NE
At OAK 3 SD OAK SD
At DEN 3 KC DEN KC
At ARI 3 SEA ARI SEA


-- Greg Szalkowski

Saturday, December 17, 2011

2011-12-15: 2011 NFL Season Week 15

So far this year all three of the prediction algorithms are 68% correct straight up. This is better than the predictions of most of the NFL "experts" such as the guys at ESPN. Last year we ended up right below 70% correct as well. Breaking the 70% barrier over the season seems to be rather hard to do as seen on the Prediction Tracker. Looking into the statistics of those games reveals some interesting information. In the majority of those games, the losing team had better box scores but still lost the game. We had thought that incorporating the betting line data this year would have had impact but the accuracy of the straight up predictions is not significantly better than last year.

The season isn't over yet and anything can happen so here are the predictions for week 15.


Favorite Spread Underdog Discrete Pagerank
DAL 7 at TB DAL DAL
at NYG 10 WAS NYG NYG
GB 9 at KC GB GB
NO 9 at MIN NO NO
at CHI 3 SEA CHI SEA
at BUF 2 MIA BUF BUF
at HOU 4 CAR HOU HOU
TEN 4 at IND TEN TEN
CIN 7 at STL CIN CIN
at OAK 4 DET OAK OAK
NE 9 at DEN NE NE
at PHI 3 NYJ PHI NYJ
at ARI 3 CLE ARI ARI
at SD 4 BAL BAL BAL
PIT 2 at SF PIT PIT


-- Greg Szalkowski

Wednesday, December 14, 2011

2011-12-14 Python & Memento Presentation for the ODU ACM

Earlier this semester, I was invited to present Python at an ODU ACM meeting. I presented a brief overview of the Python language and followed up with a code walk through of the code I use to parse Memento timemaps in my current research.

Python, of course, has advantages and disadvantages compared to other languages. Since most ODU undergrads have experience with C++, the presentation presents Python with respect to C++. Pythons advantages include a fast development cycle and an extensive collection of community libraries. Its primary disadvantage compared to C++ is execution speed. My experience is that Python is sometimes over 100 times slower.

Python's basic syntax and semantics are straight forward, so the presentation focused on the Python equivalents of commonly-used C++ constructs and the differences between static (C++) and dynamic (Python) typing. Python's implementation of high-level data types (lists, dictionaries, tuples, and sets) and functional code were compared to the complexity of the C++ equivalents.



To bring all the pieces together, I did a code walk through of the python.py module I use to parse Memento timemaps (see the Memento Introduction and Internet Draft for more information). The module has two classes. The TimeMap class is a parser and dictionary for timemap data. The TimeMapTokenizer class is a tokenizer for link-style timemaps.

To load a timemap, a new instance of TimeMap is created using the timemap's URI, which is the constructor's only argument. A TimeMapTokenizer instance returns individual tokens, simplifying the parsing code in the get_next_link function. TimeMap implements the __getitem__ function, allowing it to act as a Python dictionary. TimeMapTokenizer implements the __iter__ and next functions, which the use of Python iteratation constructs over the list of tokens.

— Scott G. Ainsworth

2011-12-14: CS 495/595 Web Server Development for Spring 2012

The only WS-DL related class that will be offered in spring 2012 is CS 495/595 "Web Server Development". I had planned to offer CS 751/851 "Introduction to Digital Libraries, but I've taught that the last two springs and it has been a while since I've taught the web server development class (the last offering was actually from Martin Klein in spring 2010).

The premise of this course is that the best way to really get to know HTTP is to build a fully-functional web server from scratch in the language of your choice. That sounds simple enough, but it becomes quite challenging, in part because if you do a poor job at design at the beginning you have to live with the consequences the entire semester. On the other hand, do a good job up front and each assignment will just drop into place (hello, software design). Along the way, you'll also become quite familiar with reading RFCs and the REST architectural model.

Take a look at past offerings of the class for an idea of what the structure will be. The CRNs are 35757 (CS 495) and 35758 (CS 595). The class will be on Tuesdays, 4:20 -- 7:00 pm in r. 2120.

--Michael

2012-01-09 edit: The class homepage is now available.