Saturday, December 17, 2011

2011-12-15: 2011 NFL Season Week 15

So far this year all three of the prediction algorithms are 68% correct straight up. This is better than the predictions of most of the NFL "experts" such as the guys at ESPN. Last year we ended up right below 70% correct as well. Breaking the 70% barrier over the season seems to be rather hard to do as seen on the Prediction Tracker. Looking into the statistics of those games reveals some interesting information. In the majority of those games, the losing team had better box scores but still lost the game. We had thought that incorporating the betting line data this year would have had impact but the accuracy of the straight up predictions is not significantly better than last year.

The season isn't over yet and anything can happen so here are the predictions for week 15.


Favorite Spread Underdog Discrete Pagerank
DAL 7 at TB DAL DAL
at NYG 10 WAS NYG NYG
GB 9 at KC GB GB
NO 9 at MIN NO NO
at CHI 3 SEA CHI SEA
at BUF 2 MIA BUF BUF
at HOU 4 CAR HOU HOU
TEN 4 at IND TEN TEN
CIN 7 at STL CIN CIN
at OAK 4 DET OAK OAK
NE 9 at DEN NE NE
at PHI 3 NYJ PHI NYJ
at ARI 3 CLE ARI ARI
at SD 4 BAL BAL BAL
PIT 2 at SF PIT PIT


-- Greg Szalkowski

Wednesday, December 14, 2011

2011-12-14 Python & Memento Presentation for the ODU ACM

Earlier this semester, I was invited to present Python at an ODU ACM meeting. I presented a brief overview of the Python language and followed up with a code walk through of the code I use to parse Memento timemaps in my current research.

Python, of course, has advantages and disadvantages compared to other languages. Since most ODU undergrads have experience with C++, the presentation presents Python with respect to C++. Pythons advantages include a fast development cycle and an extensive collection of community libraries. Its primary disadvantage compared to C++ is execution speed. My experience is that Python is sometimes over 100 times slower.

Python's basic syntax and semantics are straight forward, so the presentation focused on the Python equivalents of commonly-used C++ constructs and the differences between static (C++) and dynamic (Python) typing. Python's implementation of high-level data types (lists, dictionaries, tuples, and sets) and functional code were compared to the complexity of the C++ equivalents.



To bring all the pieces together, I did a code walk through of the python.py module I use to parse Memento timemaps (see the Memento Introduction and Internet Draft for more information). The module has two classes. The TimeMap class is a parser and dictionary for timemap data. The TimeMapTokenizer class is a tokenizer for link-style timemaps.

To load a timemap, a new instance of TimeMap is created using the timemap's URI, which is the constructor's only argument. A TimeMapTokenizer instance returns individual tokens, simplifying the parsing code in the get_next_link function. TimeMap implements the __getitem__ function, allowing it to act as a Python dictionary. TimeMapTokenizer implements the __iter__ and next functions, which the use of Python iteratation constructs over the list of tokens.

— Scott G. Ainsworth

2011-12-14: CS 495/595 Web Server Development for Spring 2012

The only WS-DL related class that will be offered in spring 2012 is CS 495/595 "Web Server Development". I had planned to offer CS 751/851 "Introduction to Digital Libraries, but I've taught that the last two springs and it has been a while since I've taught the web server development class (the last offering was actually from Martin Klein in spring 2010).

The premise of this course is that the best way to really get to know HTTP is to build a fully-functional web server from scratch in the language of your choice. That sounds simple enough, but it becomes quite challenging, in part because if you do a poor job at design at the beginning you have to live with the consequences the entire semester. On the other hand, do a good job up front and each assignment will just drop into place (hello, software design). Along the way, you'll also become quite familiar with reading RFCs and the REST architectural model.

Take a look at past offerings of the class for an idea of what the structure will be. The CRNs are 35757 (CS 495) and 35758 (CS 595). The class will be on Tuesdays, 4:20 -- 7:00 pm in r. 2120.

--Michael

2012-01-09 edit: The class homepage is now available.

Thursday, December 8, 2011

2011-12-08: Summer Microsoft Internship

It all started in San Francisco airport while waiting to get my luggage on my way to the PDA2011 conference. The recruiter from Microsoft called me to inform me that I have been accepted to intern at Microsoft Silicon Valley this summer. I was ecstatic and after a couple of months of bureaucracy and a ton of documents I was ready to leave Norfolk by the end of May. Since I haven’t been on an adventure or a trip for a long time, and since I will definitely need a car in California for the three months of the summer, I decided to drive my car all across the continent. I have always wanted to make a road trip like that where I can stop in every city or town along the way, check out their attractions and eat from their authentic cuisines.

At the same time, our colleague and best friend Moustafa Aly managed to secure a job at Amazon’s engineering office in San Francisco. So when he knew I was going to drive all the way there he told me: “forget the plane, I will join you!”

We left Norfolk on the 24th, set the odometer of the car to 0 and having in mind since we are information retrieval and social networking people we will make our status updates and check-ins on Facebook our trip’s record keeper. We picked the route, filled up the car and drove. From Norfolk, stopping at Richmond and Nashville we drove through a tornado passing Tennessee, almost ran out of gas in Texas in the middle of no where, changed the clock twice in one day, eating the best steak I have ever had in Texas and the best burritos on earth in Las Cruses, playing with rockets in White sands missile range, passing over the Hoover dam and the burning the car’s AC compressor in the desert of Nevada we finally made it to Las Vegas where we wanted to spend an entire day relaxing. Next day we started driving and after 9 more hours we made it to San Francisco finishing 3559.6 miles in 5.5 days.

Working at Microsoft Silicon Valley definitely has its perks. The location was amazing and the engineers there are really incredible. I joined the office 365 server-side team for PowerPoint where I shared my office with another intern from UC Berkeley. Working with this team I had the most liberty I had in years working for companies. We sat together and set the goals I need to reach for this internship and they gave me the entire freedom to pick the way I was going to build it, which is more my style in working. I was supposed to start the implementation of a certain fraction of the distribution and investigate two other things but to my surprise they liked what I did with the first task so they decided to modify my internship goals to finish this project completely, reach ship quality and release it in the next version. With this I passed all the phases of software development from meeting with managers, architects and program managers to setting the design to development to finally quality and integration testing. Finally I had to demo my work to the three department managers to see if this could be incorporated in the next shipping release, and to my delight they were fascinated by it and it will be shipped!

The first day I attended the orientation and they gave us an overview to what we will be doing this summer and how are we going to be evaluated. Our mentors then came and took us and I was introduced to my team, the PowerPoint team. Immediately after that I was introduced to the available projects and I choose the one that was more appealing to me. Immediately after that I was granted permissions to access the codebase. Imagine having the source code of both PowerPoint and the server cloud back-end, it felt awesome! for the next two weeks I tried to break in the thousands of lines of code and produced a prototype proof of concept that I was on the right track. By the end of the first week I set my internship goals with my mentor but after the fast prototype I produced I was called to a meeting with both the test and the product management team, I was representing the development team. They decided to change my goals completely to actually build the entire feature and its backend support from scratch and have the opportunity to ship it. Knowing the task in hand of rebuilding the PowerPoint backend on the cloud with the appropriate interface to match the latest award-wining rich-client application I had to go back to the basics. I had several one-on-ones with the development team of PowerPoint client-side to understand piece by piece the functionality of each module of the application. The problem with a project like PowerPoint that it is fairly old and fairly stable with more than 20+ years of development and thousands of legacy code. I was completely lost in the beginning but my mentor didn't let me stumble much, I was practically staying in his office the first couple of weeks. We used C++ and C in the backend with javascript and C# for the matching interface. This was the trickiest part, the ability to match functionalities between two very different frameworks. At a certain point I found a severe gap in the design document related to the functionality. I talked with my manager and he told me a change like the one you want in the design document needs to be escalated. A couple of hours later I was sitting in a room full of Microsoft's elite developers, testers, PMs and managers, the least of which has 7 years work experience under his belt,...and me! That what I loved about Microsoft, even though I was just an intern I owned the project and they appreciated that. I explained my case and it was approved and the design document was changed! I was so proud of myself that day.

The atmosphere within the office was relaxing, cool, upbeat and always challenging. I can fairly say I was spoiled this summer. I was residing in the corporate housing complexes where I got a spacious studio apartment fully furnished with maid service that come clean weekly! Courts, swimming pool and a huge hot tub all provided for free within the apartment complex. Every other week the recruiters and the PR managers created an event, party or outing for all the interns on campus. We went hiking, bowling, watching movies and they even flew us to Seattle to visit the headquarters for the summer intern event. They paid flight tickets, the luxury hotel and even a car rental. Steven Sinofsky gave us a wonderful presentation where they show us classified sneak peeks to the all-new amazing Windows 8 and I was genuinely impressed. At the company store we got lots of t-shirts, games and gadgets with our employee discount. After that they rented the Zoo for us since we were about 1000 interns from all over the country and they got us the “Dave Matthews” band and gave each one of us a brand new xbox360 with Kinect!


It was definitely unique and rewarding to work with all those interns from the top universities all over the country: MIT, UC Berkeley, Stanford, …etc. I asked around and I found that I was the only representative from ODU so I was definitely proud and tried to behave. Me and the other interns became friends and since most of us are residing on the same apartment complex we gathered almost every night and on the weekends we went and discovered the city and the surrounding area. Unfortunately I didn’t join them in the Yosemite hiking/camping trip, as I was sick that day. One day we all decided to wear suits and sunglasses all day at work and call it "Brogramming" day. Someone took a photo of us and it gone viral on twitter and facebook!

In conclusion I feel honored and blessed for being able to work at this wonderful fascinating place with all those extremely intelligent colleagues. My manager/team lead told me on my first day one thing that I believe it changed everything. He said you were only an intern during the 2-hour orientation session, now consider yourself a full time software engineer and own your work. This definitely helped me to shine, participate, own my work, suggest enhancements, which actually were considered, and we changed the design document. Now, I can proudly say that my product is being used currently by millions of users; probably you are using it right now!

-- Hany SalahEldeen

Wednesday, December 7, 2011

2011-12-07: 2011 NFL Season Week 14

Week 14 of the 2011 NFL season is upon us. Talk of play-off teams and Superbowl probabilities fill the airwaves even more than Christmas music. Sitting in traffic on the drive home from work tonight I was listening to a few on-air personalities discussing Green Bay and New England for the Superbowl. Green Bay has already clinched a playoff berth and many people would say they are headed to the Superbowl this year. The comment that caught my attention was that the defense for both teams was terrible this year and the only reason they were doing well this year is that their offenses were so good that they could "outscore their mistakes".

This led me to think about the Colts without Peyton Manning this year. For the past 3 or 4 years the Colts with Manning as their quarterback have dominated the sport. It would seem that they built the entire team around Manning. The Colts would run up the score on offense and then the opposing team would be forced to attempt to pass often just to catch up. Then the Colts defense would focus on the opposing teams quarterback to keep him from making plays. Now this year without Manning the Colts have no game. Are Green Bay and New England in a similar situation?

Contemplating statistics during rush hour traffic is a good way to become a statistic so I did not get much more in depth listening to the show, but after arriving home I ran some SQL queries to check the veracity of the claims made by the radio show.

Indeed it is true that the defense for both Green Bay and New England have given up more than the average number of yards this year. They are both almost dead last in defensive performance. Here is a list of the teams with the average number of yards given up per play on both passing and rushing plays.


Team
Yards given up per play
Atlanta
4.3638
Pittsburgh
4.7886
Baltimore
4.8224
Houston
4.9090
Cincinnati
5.1182
San Francisco
5.1645
New York
5.1786
Cleveland
5.2142
Jacksonville
5.2292
Seattle
5.3491
Tennessee
5.3568
Washington
5.4359
Detroit
5.5313
Arizona
5.5506
Miami
5.5726
Denver
5.6300
Kansas City
5.6887
Chicago
5.6887
Oakland
5.7423
Dallas
5.7464
San Diego
5.7867
Minnesota
5.8281
St. Louis
5.8436
Indianapolis
5.8923
Philadelphia
6.0145
Buffalo
6.0374
New York
6.0508
New Orleans
6.0600
Carolina
6.3130
New England
6.3642
Tampa Bay
6.4829
Green Bay
6.5041

I have a feeling that a team with a balanced offense and a good pass defense like Pittsburgh or Baltimore could give New England and/or Green Bay a tough time in the post season but maybe we will cover that next week.

The predictions for week 14 are:

Favorite Spread Underdog Discrete Pagerank
at PIT 9 CLE PIT PIT
at BAL 7 IND BAL BAL
HOU 5 at CIN CIN HOU
at GB 15 OAK GB GB
at NYJ 7 KC NYJ NYJ
at DET 7 MIN DET DET
NO 2 at TEN NO TEN
PHI 9 at MIA PHI PHI
NE 15 at WAS NE NE
ATL 4 at CAR ATL ATL
at JAX 5 TB TB JAX
SF 3 at ARI SF SF
DEN 2 CHI DEN CHI
at SD 12 BUF SD BUF
at DAL 3 NYG DAL DAL
at SEA 4 STL SEA SEA


-- Greg Szalkowski

Thursday, December 1, 2011

2011-12-01: 2011 NFL Season Week 13

Week 13 of the 2011 NFL season is upon us. This week New England is a 20 point favorite over Indianapolis. 20 points is rather rather significant for a line value. In fact since 2002 there have only been six games with a line value of 20 or greater. Of those six games, New England was the favorite in five of them. In none of the five games did New England cover the spread but they came close to covering the spread in the 2007 game against Miami winning by 21 points with a 22 point line value.




Favorite Spread Underdog Discrete Pagerank
PHI 5 at SEA PHI SEA
TEN 3 at BUF BUF TEN
at CHI 4 KC CHI CHI
MIA 7 at OAK MIA OAK
at PIT 6 CIN PIT PIT
BAL 1 at CLE BAL BAL
NYJ 1 WAS NYJ NYJ
at HOU 7 ATL ATL HOU
CAR 6 at TB TB CAR
at NO 7 DET NO NO
At MIN 6 DEN DEN DEN
at SF 10 STL SF SF
DAL 8 at ARI DAL DAL
GB 2 NYG GB GB
NE 10 IND NE NE
SD 4 JAX SD SD


-- Greg Szalkowski

Thursday, November 24, 2011

2011-11-24: 2011 NFL Season Week 12

Happy Thanksgiving!

I apologize for posting these a little late but I have been cooking food for the past three days. When I was not cooking I was reading papers about modifications to Support Vector Models to get some ideas to improve the accuracy of our predictions.
Adapting Ranking SVM to Document Retrieval concentrated on a modification of the hinge loss function when training the model to increase accuracy. Training a Support Vector Machine in the Primal points out that much literature jumps right to the dual optimization aspect of SVMs and does not pay enough attention to the primal problem. A portion of the paper mentions replacing the hinge loss function with one that is differentiable such as the Huber loss function.

While experimenting with SVM training I observed an interesting data point. Using NFL statistics from 2002 to 2010, one of the training methods assigned the following weights to the teams.
1.1276 Indianapolis Colts
1.1055 New England Patriots
0.5704 Philadelphia Eagles
0.4269 Pittsburgh Steelers
0.3317 Tennessee Titans
0.3184 San Diego Chargers
0.2922 Baltimore Ravens
0.2248 New Orleans Saints
0.1976 Green Bay Packers
0.1718 Jacksonville Jaguars
0.1621 Tampa Bay Buccaneers
0.0957 Carolina Panthers
0.0946 Atlanta Falcons
0.0877 New York Jets
0.0704 Chicago Bears
0.0504 New York Giants
-0.0569 Dallas Cowboys
-0.0689 Buffalo Bills
-0.0740 Miami Dolphins
-0.0812 Houston Texans
-0.1085 Denver Broncos
-0.1752 Cincinnati Bengals
-0.2420 Kansas City Chiefs
-0.3457 Cleveland Browns
-0.3631 Minnesota Vikings
-0.3634 Seattle Seahawks
-0.4150 Arizona Cardinals
-0.5037 St. Louis Rams
-0.5121 Oakland Raiders
-0.5633 Washington Redskins
-0.6925 San Francisco 49ers
-0.7623 Detroit Lions

Well it is time for me to break out the whipping cream and make some homemade whip cream for the pies I baked the other day. Here are the picks for this week.

Favorite Spread Underdog Discrete Pagerank
GB 6 at DET GB GB
at DAL 8 MIA DAL DAL
at BAL 3 SF BAL BAL
at STL 5 ARI STL STL
at NYJ 1 BUF NYJ BUF
at CIN 8 CLE CIN CIN
HOU 4 at JAX HOU HOU
CAR 7 at IND CAR IND
at TEN 20 TB TEN TEN
at ATL 11 MIN ATL ATL
at OAK 6 CHI OAK CHI
at SEA 5 WAS SEA SEA
NE 6 at PHI NE PHI
at SD 6 DEN SD DEN
PIT 11 at KC PIT PIT
at NO 3 NYG NO NO


-- Greg Szalkowski

Thursday, November 17, 2011

2011-11-17: 2011 NFL Season Week 11


Thursday Night Football, this week the NY Jets play at Denver. The Jets have a number of players on the injured list this week. Even with those injuries all three of our algorithms picked the Jets to win on Thursday.

The Jets injury list is not as bad as some of the other teams. Philadelphia's quarterback, Vick has two broken ribs and has not been at practice all week. Kansas City's quarterback Matt Cassel underwent hand surgery and will probably be out for the rest of the season.

A weakness of our algorithms is that they are heavily based on this years performance to date. A major injury to an important player that may or may not have an impact of game performance is not really taken into account. That is one of the reasons we have incorporated the Line data this year. Hoping that the "Collective Intelligence" of the crowd would help to point out teams that may not perform differently.




Favorite Spread Underdog Discrete Pagerank
NYJ 4.5 at DEN NYJ NYJ
at ATL 4 TEN ATL TEN
BUF 3.6 at MIA MIA BUF
at BAL 5 CIN BAL BAL
JAX 7 at CLE JAX JAX
at MIN 5 OAK MIN OAK
at DET 11 CAR DET DET
at GB 18 TB GB GB
DAL 8 WAS DAL DAL
at SF 8 ARI SF SF
at STL 2 SEA SEA SEA
at CHI 5 SD CHI CHI
at NYG 7 PHI NYG PHI
at NE 8 KC NE NE


-- Greg Szalkowski

Thursday, November 10, 2011

2011-11-10: 2011 NFL Season Week 10

Thursday Night Football is back! The match-up for tonight features the San Diego Chargers at home vs the Oakland Raiders.

This is a very close matchup according to the stats. San Diego has a offensive pass efficiency of 7.2 and Oakland 6.7. Oakland has a better run game but not by much. The defensive ratings are almost exactly the same with San Diego leading by a little bit.

The SVM and Neural Network both chose San Diego to win but the PageRank algorithm decided Oakland was a better choice.

Favorite Spread Underdog Discrete Pagerank
At SAN 8.5 OAK SAN OAK
PIT 1.5 At CIN PIT PIT
At KC 5 DEN KC KC
At IND 4 JAX JAX JAX
At DAL 3 BUF DAL DAL
HOU 7 At TB HOU HOU
At CAR 6 TEN Car TEN
At MIA 5 WAS MIA WAS
At ATL 6 NO ATL NO
At CHI 6 DET CHI CHI
At CLE 4 STL CLE STL
At PHI 11 ARI PHI PHI
BAL 4 At SEA BAL BAL
NYG 3 At SF SF SF
NE 10 At NYJ NE NYJ
At GB 14 MIN GB GB

-- Greg Szalkowski

2011-11-10: Day in the Life of a Computer Scientist

Old Dominion University has a freshmen computer science course that focuses on what it means to be a computer scientist. This course discusses career opportunities, current research being performed, and serves to debunk myths and misconceptions about the field of computer science. Such myths include: we never talk to humans, we code our entire lives away, and we are nocturnal. I was invited to be a guest lecturer for the class last night. Even though the last myth is sometimes true, I did my best to touch on each of these talking points during the presentation embedded below.




The first topics I spoke about in the presentation generated the majority of questions and discussion. I spoke first about the digital preservation work being performed in the WS-DL group at ODU. This, of course, included discussing the ArchiveFacebok, Warrick, and Memento projects at a cursory level. During our discussion, we hit on the issues of copyright and web crawling, and why, as computer scientists, we find these problems interesting. We briefly talked about revisitation policies and change-rate studies of web pages that are important for search engines and archival methods. (Interested readers should direct their attention to Cho and Garcia-Molina's work (1999) for a canonical study on recrawl and page change rates.) We also discussed why computing theory (not just development) is important in the current research being performed in the academic and industrial communities.

The remainder of my talk was a description of what I do on a daily basis as a professional computer scientist. I mentioned that I worked at The MITRE Corporation as a developer and researcher, and discussed what my job entails. For example, I practice Agile engineering, work with people on a daily basis, and probably only spend less than a quarter of my time in actual development. The remainder of my time is spent in testing cycles, working with customers to find direction for products, writing documentation, and other "non-coding" aspects of software development. Further, I discussed that MITRE is unique company in that it is a Federally Funded Research and Development Center (FFRDC), and supports the US government in an advisory roll. This point illustrated that there are variety of opportunities available to computer scientists, and not all of them are at traditional corporations.

My lecture was meant to illustrate that a professional developer doesn't sit in a dark cubicle all night hammering out code, and goes weeks without human interaction. More importantly, this presentation provided examples of work being done in industry and academia, and how the degree they are earning will benefit them in their career.


--Justin F. Brunelle

Saturday, November 5, 2011

2011-11-4: 2011 NFL Season Week 9

Week 9 is the last of the bye weeks for this year. Two games that have a higher level of chatter this week and should be good games to watch are the New England vs Giants game and the Pittsburgh vs Baltimore game.

The Patriots with Brady and the Giants with Manning, both have potent veteran quarterbacks and these game will probably play on the passing efficiency of both teams. The offensive passing efficiency of both teams is comparable at about 7.8 yards, however the Giants pass defense is better with only 5.9 yards given up compared to 7.5 for New England.

The rhetoric for the Baltimore vs Pittsburgh game has been rather lively. The Baltimore defense is possibly one of the best this year with a pass defense of only 4.8 yards given up although the offense is has not been stellar with a pass efficiency of 5.8 yards. Pittsburgh's pass offense is better, rated at 7.02 yards while their pass defense, while still decent, is not as good as Baltimore's with a rating of 5.1 yards.

Favorite Spread Underdog Discrete Pagerank
ATL 0.5 At IND ATL ATL
At NO 12 TB NO NO
At HOU 9 CLE HOU HOU
At BUF 4 NYJ BUF BUF
At KC 3 MIA KC KC
SF 4.5 At WAS SF SF
At DAL 6 SEA DAL DAL
At OAK 5 DEN OAK OAK
At TEN 3 CIN TEN TEN
At ARI 7 STL ARI STL
At NE 6 NYG NE NYG
At SD 2 GB GB GB
At PIT 3 BAL PIT BAL
At PHI 8 CHI PHI PHI

-- Greg Szalkowski

2011-11-05: Agile Engineering - ODU's ACM Meeting

I was invited to present an overview on Agile Development to Old Dominion University's ACM chapter. More specifically, I gave an overview of the Scrum method. My work in MITRE's Agile Engineering department has allowed me to practice Agile methodologies in the work force. Through this presentation, I shared my experiences with the members of the ACM.







Agile engineering's main focus is a shift from a linear development model. The Waterfall model is the classic example of a linear process model. Agile focuses on a cyclic and adaptive model. One of the main focuses of Agile is to receive and incorporate user feedback into the development process in order to produce a better product for the user. Also, it allows the product owner to garner greater control over a project.

Each cycle in Agile includes all of the traditional development steps: Requirements, Design, Implementation, Verification, and Assessment/Maintenance. These cycles are sometimes called sprints. At the conclusion of each sprint, a fully releasable product should be available. That is, the end of the sprint produces a product that has been through all of the necessary development steps and can be sold as a subset of the end-goal product. This provides the benefit of having a complete and deliverable product even if funding is cut or production must be halted.

An Agile development model provides the benefit of Failing Early. This means the development team can encounter and solve errors earlier in the development process and solve them when the costs are lower. An overly simplistic example would be the selection of a database. If MySQL is chosen at the beginning of a project using Agile, the development team would know earlier in the process if it was suitable for the solution. However, in the Waterfall model, it is possible to not understand the requirements until too late in the process to make a cheap switch.

An ODU WS-DL alumnus (Carlton Northern) has been instrumental in releasing a handbook for implementing Agile methods. This handbook provides guidelines for implementing Agile methods in the government (specifically the DoD) environment.

These resources should serve as an introduction to Agile methods and the benefits of using this development model.


-- Justin F. Brunelle

Saturday, October 29, 2011

2011-10-28: 2011 NFL Season Week 8


I am back from San Diego and while I ran into some computer problems while I was there, thankfully the results of my trip were much better than the results of last weeks predictions.

Our discrete winner predictor is based on a Sequential Minimization Optimization (SMO) method for training the Support Vector Model (SVM). In our experiments, the SVM has proven to be one of the best binary classifiers for predicting the winner/loser of NFL games.

As I mentioned a few weeks ago, this year we have incorporated the betting line data into the classification model as a form of collective intelligence. The betting line data quickly began to dominate the output of the prediction model followed by passing efficiency and turnovers in importance to the outcome. The result of favoring the betting line is that the classifier usually follows the favorite and when there are a number of upsets like last week, then our results are below expectations.

Indeed many of the experts did not fare that well either last week. This led me to think about how " the experts" and the hypothetical average NFL fan make their choices. Are the fans influencing the betting line with with their bets or is the line influencing the bets of the fans. Some form of endogeneity rearing its head and interfering with the model.

Washington and Buffalo will be playing in Toronto this week.  
Favorite Spread Underdog Discrete Pagerank
At TEN 3 IND TEN TEN
At HOU 7.2 JAX At HOU At HOU
At CAR 3.4 MIN CAR CAR
NO 6.8 At STL NO NO
At BAL 9.3 ARI BAL BAL
At NYG 8.4 MIA NYG NYG
At BUF 4.4 WAS BUF BUF
At DEN 1.7 DET DEN DET
At PIT 0.8 NE NE NE
At SF 5.6 CLE SF SF
CIN 2.7 At SEA CIN CIN
At PHI 4.2 DAL PHI DAL
SD 0.8 At KC SD KC

-- Greg Szalkowski

Sunday, October 23, 2011

2011-10-22: 2011 NFL Season Week 7

I have been on travel in San Diego all this week and I have had computer issues the entire time. Therefore I am posting these picks at the last minute and do not have much in the way of commentary especially since I was on an airplane during most of the games this past Sunday. I would like to thank my wife for typing in the commands I told her over the phone so that we could run the algorithms in order to get the picks done. I would not have been able to do it without her.

Favorite Spread Underdog Discrete Pagerank
TB 2.2 CHI TB CHI
at CAR 4.9WAS CAR WAS
SD 0.3 At NYJ SD NYJ
At CLE 2.7 SEA CLE SEA
At TEN 5 HOU TEN TEN
At MIA 1.3 DEN MIA DEN
At DET 3.6 ATL ATL DET
At OAK 5.5 KC OAK OAK
PIT 4.9 At ARI PIT PIT
At DAL 10.3 STL DAL DAL
GB 1.5 At MIN GB GB
At NO 4.5 IND NO NO
BAL 4 At JAX BAL JAX
-- Greg Szalkowski

Sunday, October 16, 2011

2011-10-14: 2011 NFL Season Week 6

Our neural network predictor was 68% correct straight up this past week but overall our results were not awe inspiring. Two of the games that almost everyone got wrong were the Eagles-Bills and Seahawks-Giants games. In both games the favorite lost and one of the crucial stats was interceptions. Michael Vick of the Eagles threw four interceptions and Eli Manning threw three for the Giants. This is completely out of character for either of the quarterbacks.

So far this year our Support Vector Machine (SVM) predictor has tracked the favorites very closely. With the addition of the line data this year, the line value has driven the output of the SVM. Ignoring the Line values, passing efficiency and turnovers forced by the defense have been two of the most dominant statistics.

Predictions for week 6:

Favorite Spread Underdog Discrete Pagerank
At GB 14.5 STL GB GB
At PIT 9.5 JAX PIT PIT
PHI -0.7 At WAS PHI WAS
At DET 3.4 SF DET SF
At ATL -0.3 CAR ATL ATL
At CIN 3.2 IND CIN CIN
At NYG 4.7 BUF NYG BUF
At BAL 5.6 HOU BAL BAL
At OAK 6.4 CLE OAK OAK
At NE 8.6 DAL NE DAL
NO -4.5 At TB NO NO
At CHI -1.7 MIN CHI MIN
At NYJ 3.8 MIA MIA NYJ
-- Greg Szalkowski

Friday, October 7, 2011

2011-10-06: Week 5 2011 NFL Season

Week 4 performance was rather pleasing. Straight up and against the spread were 80% and 75% correct. Buffalo lost with that last minute field goal and as to why Philadelphia fell apart in the second half and lost a 20 point lead has been the subject of numerous commentator's discussions. Hopefully the predictions continue to perform at this level but pessimism indicates that they will regress to the mean. 

Week 5 of the NFL season means the commencement of bye weeks. This week's teams on bye are the Baltimore Ravens, Cleveland Browns, Dallas Cowboys, Miami Dolphins, St. Louis Rams and Washington Redskins.

For comparison purposes we have included one of the better performing algorithms from the past two years. The PageRank algorithm that we modified to indicate strong teams averaged 68% for straight up predictions over the past two years. A more detailed explanation is provided in one of our previous posts.

The predictions for Week 5:

Favorite Line Underdog Discrete PageRank
At IND 6.3 KC IND KC
At MIN 4.6 ARI MIN ARI
At BUF 1 PHI PHI BUF
At HOU 3.4 OAK HOU OAK
At CAR 0.2 NO NO NO
CIN 3.2 At JAC JAC JAC
At PIT 1 TEN PIT TEN
At NYG 9.7 SEA NYG NYG
At SF 2 TB TB SF
At NE 11.3 NYJ NE NYJ
SD 1.3 At DEN SD DEN
GB 2.9 At ATL GB GB
At DET 5.3 CHI CHI DET

-- Greg Szalkowski

Sunday, October 2, 2011

2011-10-02: 2011 NFL Season Under Way

The 2011 NFL season is underway and we are ready to put some of our improved algorithms to the test. Last year we primarily used box score data for our predictions. This resulted in adequate performance but nothing spectacular.

This year we are increasing the collective intelligence quotient in our algorithm by incorporating betting line data and line movement. The purpose of the betting line is to make the sportsbooks money by splitting the betting population in half. The line will move as a result of betting pressure presented by the betting population. e.g. The favorite team is favored by 5 points. Many bettors may feel that the favorite team is not that good and place bets on the underdog. With an unbalanced wager profile the sportsbook has the potential to lose money so they will move the bet line until the incoming bets are equal on each side. This movement is a form of collective intelligence of the betting population.

Another change this year is that in addition to choosing the winner as a discrete value (winner or loser) we will also predict the line value as a continuous variable. This line value is what we think the line should be. If the favorite team is favored by 5pts and we predict 3pts it may be wise to vote on the underdog. However if the favorite team is favored by 2pts and we predict 7pts, the favorite is the better option.

Without further ado, here is what we are looking at for week 4:
TimeFavoriteLineUnderdogDiscrete
10/2 1:00 ET At Dallas 3.1 Detroit Dallas
10/2 1:00 ET New Orleans 2.5 At Jacksonville New Orleans
10/2 1:00 ET At Philadelphia 8.6 San Francisco Philadelphia
10/2 1:00 ET Washington 2.2 At St. Louis Washington
10/2 1:00 ET Tennessee 4.3 At Cleveland Tennessee
10/2 1:00 ET At Cincinnati 2.3 Buffalo Buffalo
10/2 1:00 ET Minnesota 3.7 At Kansas City Kansas City
10/2 1:00 ET Carolina 0.7 At Chicago Chicago
10/2 1:00 ET At Houston 0.9 Pittsburgh Houston
10/2 4:05 ET Atlanta 2.0 At Seattle Atlanta
10/2 4:05 ET NY Giants 0.3 At Arizona NY Giants
10/2 4:15 ET At San Diego 5.2 Miami San Diego
10/2 4:15 ET At Green Bay 9.5 Denver Green Bay
10/2 4:15 ET New England 4.5 At Oakland New England
10/2 8:25 ET At Baltimore 6.8 NY Jets Baltimore
10/3 8:35 ET At Tampa Bay 1.3 Indianapolis Tampa Bay
--Greg Szalkowski

Wednesday, September 14, 2011

2011-09-14: Dissertation Completed

I am very happy to write about the successful completion of my dissertation work in the Computer Science Department at Old Dominion University.
My dissertation is titled "Using the Web Infrastructure for Real Time Recovery of Missing Web Pages" and, as the title suggests, it makes several contributions in the areas of digital data preservation and information retrieval. In brief, the dissertation evaluates multiple techniques for a "just-in-time" approach to web page preservation. We, for example, investigate the suitability of lexical signatures and web page titles to rediscover missing content. These two methods are based on old copies of the pages provided by the Memento framework. We also analyze the performance of tags that users have created to annotate pages as well as the most salient terms derived from a page's link neighborhood as methods to find missing pages.

On the practical side, the dissertation introduces Synchronicity, a Firefox add-on that implements all evaluated methods for web page recovery. It catches 404 "Page not Found" errors when they occur and offers alternatives in real-time, while the user is browsing. I concluded writing my thesis in June, defended on July 18th and got the degree officially awarded in August 2011.


It goes without saying this work would not have been possible without the outstanding support from my dissertation committee. It consisted of the internal members Dr. Michele C. Weigle, Dr. Yaohang Li and Dr. Mohammad Zubair and the external members Dr. Herbert Van de Sompel and Dr. Robert Sanderson.

I am deeply grateful to my advisor Dr. Michael L. Nelson for his eternal patience and superior guidance and mentoring. He truly is a role model for all aspiring academics (and he took me to a Hokies football game).

I am now looking back at six years of taking classes (MS and Ph.D. level), passing diagnostic and candidacy exams, conducting countless experiments, publishing over 20 research papers (and writing even more), teaching two classes, giving numerous guest lectures and I can finally give an answer to the ever annoying question: "When are you going to be done?".

As my next step, I am very excited to join the Research Library at the Los Alamos National Laboratory as a Postdoctoral Researcher. I will work with Herbert and Rob on Memento and on making time-based access of web resources more convenient and, of course, will enjoy the green and red chili!

--
martin

Sunday, August 28, 2011

2011-08-28: KDD 2011 Trip Report

Author: Carlton Northern

The SIGKDD 2011 conference took place August 21 - 24 at the Hyatt Manchester in San Diego, CA.  Researchers from all over the world interested in knowledge discovery and data mining were in attendance.  This conference in particular has a heavy statistical analysis flavor and many presentations were math intensive.

I was invited to present my masters project research at the Mining Data Semantics (MDS2011) Workshop of KDD.  In this paper, we present an approach to find social media profiles of people from an organization.  This is possible due to the links created between members an organization. For instance, co-workers or students will likely friend each other creating hyperlinks between their respective accounts.  These links, if public, can be mined and used to disambiguate other profiles that may share the same names as those individuals we are searching for.  The following figure shows the amount of profiles found from the ODU Computer Science student body for each respective social media site and the links found between them.


This picture represents the actual students themselves and the links between them.  Black nodes are undergrads, green nodes are grads, and red nodes are members of the WS-DL research group.



These are the slides:

Here is the paper:

I've synopsized some of the interesting presentations from the conference:


Stephen Boyd - Stanford University "From Embedded Real-Time to Large-Scale Distributed".  Stephen Boyd's talk focused on his current research area of convex optimization.  He explained that convex optimization is a mathematical technique in which many complex problems of model fitting, resource allocation, engineering design, etc. can be transformed to a simple convex optimization problem to be solved and then transformed back into the original problem to get the solution.  He went on to explain how this can be implemented in real-time embedded systems sych as a hard disk drive head seek problem, to large distributed system such as California's power grid.


Amol Ghoting - IBM "NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce".  Use Hadoop to write a map function and a reduce function where you can map anything to a (key, value) pair.  The problem with Hadoop is that it has a two-stage data flow which can be cumbersome for programming.  Also, job scheduling and data mangement is handled by the user.  Lastly, code-reuse and portability is diminished.  This toolkit tries to make the key features of Hadoop available to developers but without a Hadoop specific implementation.  NIMBLE actually decouples algorithm computation from data management, parallel communications and control.  It does this through using a series of basic datasets and basic tasks that create a DAG.  Tasks can spawn other tasks.  With this structure in place, simultaneous data and tasks parallelism is achievable.

David Haussler – UC Santa Cruz “Cancer Genomics”.  DNA sequencing cost has reduced dramatically.  DNA sequencing was following Moore’s law but is now reducing cost 10 fold every two years.  Can now cheaply sequence entire genomes.  Created the Cancer Genome Atlas.  10,000 tumors will be sequenced in the next two years using this Atlas.  Cancer genome sequencing will soon be a standard clinical practice.  Because each persons DNA is different, and each tumor resulting from a persons DNA is different, a huge computational processing problem looms in the near distant future. 


Ahmed Metwally - Google. "Estimating the number of people behind an IP Address".  Most research assumes that there is 1 person using 1 IP address, but this is not the case.  IP's also change size of users, for instance, a hotel with a conference will have many more users possibly using the same IP address than usual.  So, how would one estimate the amount of these users in a non-intrusive way?  One method is to look at trusted cookie counts.  Another method is to look at diverse traffic.  Google caps traffic volume per IP to stop people from gaming the system using the same IP address.  Google knows how many users share an IP address because they are logged in with a username and password to Googles sites.  However, some of Googles traffic is from users that don't have a Google account.  This research is for those who want to filter users without asking them for any identification, thus preserving their privacy.  This method is currently being used at Google for determining click fraud.


D. Scully - Google "Detecting Adverserial Advertisements in the Wild".  An adversarial advertiser would be an advertiser that uses Google AdWords or AdSense to advertise misleading products like counterfeit goods or scams.  Most ads are good, only a small amount are bad.  Using in-house trained people to hand build rule based models.  Allowing these people to hand-build the rules gave a great incentive and improved morale rather than just having them do repetitive tasks over and over again.  Automated methods are being used as well, but this part of the presentation went right over my head.


Chunyu Luo - University of Tennessee "Enhanced Investment Decisions in P2P Lending: An Investor Composition Perspective".  In this paper, they are trying to decide which loans are worthwhile to invest, in other words, what makes a good loan?  Use a bipartite investment network with one side investors and the other investees and the edges between them loans.  Each loan can be considered a composition of many investors.  The idea is that by looking at the past performance of the other investors of a given loan, you can improve your prediction of the return rate for that loan.  Performed experiment from dataset of prosper.com.  The composition method far outperformed the average return of investment.


Susan Imberman - College of Staten Island "From Market Baskets to Mole Rats:  Using Data Mining Techniques to Analyze RFID Data Describing Laboratory Animal Behavior".  This paper presents the data mining techniques used in analyzing RFID data from a colony of Mole Rats.  Much like we use RFID in cars for tolls like EZ Pass, they are using RFID on Mole Rats and when they pass specific points of the colony (a series of pipes and rooms) they collect that sample.  They used k-means clustering which showed animal place preference.  Used an adjacency matrix to get an idea of which Mole Rats liked to be near one another.  This created 3 distinct sub graphs which corresponded well to the different colony structure of Mole Rats, queen workers, large workers and small workers.  Next they correlated common transactions made in the grocery store with items in a basket to repeat behavior of Mole Rats.

After the conference ended on Wednesday, Hurricane Irene was on track for a direct hit to Hampton Roads.  My flight was scheduled to arrive in Norfolk Friday night which was cutting it very close to the storm hitting on Saturday.  So I decided to extend the trip till Monday and ride out the storm here in sunny San Diego.  In total, I managed to miss a hurricane, a tornado, an earthquake, and a swamp fire.  I think I made a good decision...