Showing posts from 2011

2011-12-15: 2011 NFL Season Week 15

So far this year all three of the prediction algorithms are 68% correct straight up. This is better than the predictions of most of the NFL "experts" such as the guys at ESPN . Last year we ended up right below 70% correct as well. Breaking the 70% barrier over the season seems to be rather hard to do as seen on the Prediction Tracker . Looking into the statistics of those games reveals some interesting information. In the majority of those games, the losing team had better box scores but still lost the game. We had thought that incorporating the betting line data this year would have had impact but the accuracy of the straight up predictions is not significantly better than last year. The season isn't over yet and anything can happen so here are the predictions for week 15. Favorite Spread Underdog Discrete Pagerank DAL 7 at TB DAL DAL at NYG 10 WAS NYG NYG

2011-12-14 Python & Memento Presentation for the ODU ACM

Earlier this semester, I was invited to present Python at an ODU ACM meeting . I presented a brief overview of the Python language and followed up with a code walk through of the code I use to parse Memento timemaps in my current research. Python, of course, has advantages and disadvantages compared to other languages. Since most ODU undergrads have experience with C++, the presentation presents Python with respect to C++. Pythons advantages include a fast development cycle and an extensive collection of community libraries. Its primary disadvantage compared to C++ is execution speed. My experience is that Python is sometimes over 100 times slower. Python's basic syntax and semantics are straight forward, so the presentation focused on the Python equivalents of commonly-used C++ constructs and the differences between static (C++) and dynamic (Python) typing. Python's implementation of high-level data types (lists, dictionaries, tuples, and sets) and functional code

2011-12-14: CS 495/595 Web Server Development for Spring 2012

The only WS-DL related class that will be offered in spring 2012 is CS 495/595 "Web Server Development". I had planned to offer CS 751/851 "Introduction to Digital Libraries, but I've taught that the last two springs and it has been a while since I've taught the web server development class (the last offering was actually from Martin Klein in spring 2010 ). The premise of this course is that the best way to really get to know HTTP is to build a fully-functional web server from scratch in the language of your choice. That sounds simple enough, but it becomes quite challenging, in part because if you do a poor job at design at the beginning you have to live with the consequences the entire semester. On the other hand, do a good job up front and each assignment will just drop into place (hello, software design ). Along the way, you'll also become quite familiar with reading RFCs and the REST architectural model. Take a look at past offerings of the c

2011-12-08: Summer Microsoft Internship

It all started in San Francisco airport while waiting to get my luggage on my way to the PDA2011 conference. The recruiter from Microsoft called me to inform me that I have been accepted to intern at Microsoft Silicon Valley this summer. I was ecstatic and after a couple of months of bureaucracy and a ton of documents I was ready to leave Norfolk by the end of May. Since I haven’t been on an adventure or a trip for a long time, and since I will definitely need a car in California for the three months of the summer, I decided to drive my car all across the continent. I have always wanted to make a road trip like that where I can stop in every city or town along the way, check out their attractions and eat from their authentic cuisines. At the same time, our colleague and best friend Moustafa Aly managed to secure a job at Amazon’s engineering office in San Francisco . So when he knew I was going to drive all the way there he told me: “forget the plane, I will join you!” We left Nor

2011-12-07: 2011 NFL Season Week 14

Week 14 of the 2011 NFL season is upon us. Talk of play-off teams and Superbowl probabilities fill the airwaves even more than Christmas music. Sitting in traffic on the drive home from work tonight I was listening to a few on-air personalities discussing Green Bay and New England for the Superbowl. Green Bay has already clinched a playoff berth and many people would say they are headed to the Superbowl this year. The comment that caught my attention was that the defense for both teams was terrible this year and the only reason they were doing well this year is that their offenses were so good that they could "outscore their mistakes". This led me to think about the Colts without Peyton Manning this year. For the past 3 or 4 years the Colts with Manning as their quarterback have dominated the sport. It would seem that they built the entire team around Manning. The Colts would run up the score on offense and then the opposing team would be forced to attempt to pass often jus

2011-12-01: 2011 NFL Season Week 13

Week 13 of the 2011 NFL season is upon us. This week New England is a 20 point favorite over Indianapolis. 20 points is rather rather significant for a line value. In fact since 2002 there have only been six games with a line value of 20 or greater. Of those six games, New England was the favorite in five of them. In none of the five games did New England cover the spread but they came close to covering the spread in the 2007 game against Miami winning by 21 points with a 22 point line value. Favorite Spread Underdog Discrete Pagerank PHI 5 at SEA PHI SEA TEN 3 at BUF BUF TEN at CHI 4 KC CHI CHI MIA 7 at OAK MIA OAK at PIT 6 CIN PIT PIT BAL 1 at CLE BAL BAL NYJ 1 WAS NYJ NYJ at HOU 7 ATL ATL

2011-11-24: 2011 NFL Season Week 12

Happy Thanksgiving! I apologize for posting these a little late but I have been cooking food for the past three days. When I was not cooking I was reading papers about modifications to Support Vector Models to get some ideas to improve the accuracy of our predictions. Adapting Ranking SVM to Document Retrieval concentrated on a modification of the hinge loss function when training the model to increase accuracy. Training a Support Vector Machine in the Primal points out that much literature jumps right to the dual optimization aspect of SVMs and does not pay enough attention to the primal problem. A portion of the paper mentions replacing the hinge loss function with one that is differentiable such as the Huber loss function. While experimenting with SVM training I observed an interesting data point. Using NFL statistics from 2002 to 2010, one of the training methods assigned the following weights to the teams. 1.1276 Indianapolis Colts 1.1055 New England Pa

2011-11-17: 2011 NFL Season Week 11

Thursday Night Football, this week the NY Jets play at Denver. The Jets have a number of players on the injured list this week. Even with those injuries all three of our algorithms picked the Jets to win on Thursday. The Jets injury list is not as bad as some of the other teams. Philadelphia's quarterback, Vick has two broken ribs and has not been at practice all week. Kansas City's quarterback Matt Cassel underwent hand surgery and will probably be out for the rest of the season. A weakness of our algorithms is that they are heavily based on this years performance to date. A major injury to an important player that may or may not have an impact of game performance is not really taken into account. That is one of the reasons we have incorporated the Line data this year. Hoping that the "Collective Intelligence" of the crowd would help to point out teams that may not perform differently. Favorite Spread Underdog Discrete

2011-11-10: 2011 NFL Season Week 10

Thursday Night Football is back! The match-up for tonight features the San Diego Chargers at home vs the Oakland Raiders. This is a very close matchup according to the stats. San Diego has a offensive pass efficiency of 7.2 and Oakland 6.7. Oakland has a better run game but not by much. The defensive ratings are almost exactly the same with San Diego leading by a little bit. The SVM and Neural Network both chose San Diego to win but the PageRank algorithm decided Oakland was a better choice. Favorite Spread Underdog Discrete Pagerank At SAN 8.5 OAK SAN OAK PIT 1.5 At CIN PIT PIT At KC 5 DEN KC KC At IND 4 JAX JAX JAX At DAL 3 BUF DAL DAL HOU 7 At TB HOU HOU At CAR 6 TEN Car TEN At MIA 5 WAS MIA W

2011-11-10: Day in the Life of a Computer Scientist

Old Dominion University has a freshmen computer science course that focuses on what it means to be a computer scientist. This course discusses career opportunities, current research being performed, and serves to debunk myths and misconceptions about the field of computer science. Such myths include: we never talk to humans, we code our entire lives away, and we are nocturnal. I was invited to be a guest lecturer for the class last night. Even though the last myth is sometimes true, I did my best to touch on each of these talking points during the presentation embedded below. Day in the Life of a Computer Scientist from Justin Brunelle The first topics I spoke about in the presentation generated the majority of questions and discussion. I spoke first about the digital preservation work being performed in the WS-DL group at ODU. This, of course, included discussing the ArchiveFacebok , Warrick , and Memento projects at a cursory level. During our discussion, we hit on th

2011-11-4: 2011 NFL Season Week 9

Week 9 is the last of the bye weeks for this year. Two games that have a higher level of chatter this week and should be good games to watch are the New England vs Giants game and the Pittsburgh vs Baltimore game. The Patriots with Brady and the Giants with Manning, both have potent veteran quarterbacks and these game will probably play on the passing efficiency of both teams. The offensive passing efficiency of both teams is comparable at about 7.8 yards, however the Giants pass defense is better with only 5.9 yards given up compared to 7.5 for New England. The rhetoric for the Baltimore vs Pittsburgh game has been rather lively. The Baltimore defense is possibly one of the best this year with a pass defense of only 4.8 yards given up although the offense is has not been stellar with a pass efficiency of 5.8 yards. Pittsburgh's pass offense is better, rated at 7.02 yards while their pass defense, while still decent, is not as good as Baltimore's with a rating of 5.1 yards.

2011-11-05: Agile Engineering - ODU's ACM Meeting

I was invited to present an overview on Agile Development to Old Dominion University 's ACM chapter . More specifically, I gave an overview of the Scrum method . My work in MITRE 's Agile Engineering department has allowed me to practice Agile methodologies in the work force. Through this presentation, I shared my experiences with the members of the ACM. Agile Engineering - ODU ACM from Justin Brunelle Agile engineering's main focus is a shift from a linear development model. The Waterfall model is the classic example of a linear process model. Agile focuses on a cyclic and adaptive model. One of the main focuses of Agile is to receive and incorporate user feedback into the development process in order to produce a better product for the user. Also, it allows the product owner to garner greater control over a project. Each cycle in Agile includes all of the traditional development steps: Requirements, Design, Implementation, Verification, and Assessment

2011-10-28: 2011 NFL Season Week 8

I am back from San Diego and while I ran into some computer problems while I was there, thankfully the results of my trip were much better than the results of last weeks predictions. Our discrete winner predictor is based on a Sequential Minimization Optimization (SMO) method for training the Support Vector Model (SVM). In our experiments, the SVM has proven to be one of the best binary classifiers for predicting the winner/loser of NFL games. As I mentioned a few weeks ago, this year we have incorporated the betting line data into the classification model as a form of collective intelligence. The betting line data quickly began to dominate the output of the prediction model followed by passing efficiency and turnovers in importance to the outcome. The result of favoring the betting line is that the classifier usually follows the favorite and when there are a number of upsets like last week, then our results are below expectations. Indeed many of the experts did not fare that w

2011-10-22: 2011 NFL Season Week 7

I have been on travel in San Diego all this week and I have had computer issues the entire time. Therefore I am posting these picks at the last minute and do not have much in the way of commentary especially since I was on an airplane during most of the games this past Sunday. I would like to thank my wife for typing in the commands I told her over the phone so that we could run the algorithms in order to get the picks done. I would not have been able to do it without her. Favorite Spread Underdog Discrete Pagerank TB 2.2 CHI TB CHI at CAR 4.9 WAS CAR WAS SD 0.3 At NYJ SD NYJ At CLE 2.7 SEA CLE SEA At TEN 5 HOU TEN TEN At MIA 1.3 DEN MIA DEN At DET 3.6 ATL ATL DET At OAK 5.5 KC OAK OAK PIT 4.9

2011-10-14: 2011 NFL Season Week 6

Our neural network predictor was 68% correct straight up this past week but overall our results were not awe inspiring. Two of the games that almost everyone got wrong were the Eagles-Bills and Seahawks-Giants games. In both games the favorite lost and one of the crucial stats was interceptions. Michael Vick of the Eagles threw four interceptions and Eli Manning threw three for the Giants. This is completely out of character for either of the quarterbacks. So far this year our Support Vector Machine (SVM) predictor has tracked the favorites very closely. With the addition of the line data this year, the line value has driven the output of the SVM. Ignoring the Line values, passing efficiency and turnovers forced by the defense have been two of the most dominant statistics. Predictions for week 6: Favorite Spread Underdog Discrete Pagerank At GB 14.5 STL GB GB At PIT 9.5 JAX PIT PIT PHI

2011-10-06: Week 5 2011 NFL Season

Week 4 performance was rather pleasing. Straight up and against the spread were 80% and 75% correct. Buffalo lost with that last minute field goal and as to why Philadelphia fell apart in the second half and lost a 20 point lead has been the subject of numerous commentator's discussions. Hopefully the predictions continue to perform at this level but pessimism indicates that they will regress to the mean.  Week 5 of the NFL season means the commencement of bye weeks. This week's teams on bye are the Baltimore Ravens, Cleveland Browns, Dallas Cowboys, Miami Dolphins, St. Louis Rams and Washington Redskins. For comparison purposes we have included one of the better performing algorithms from the past two years. The PageRank algorithm that we modified to indicate strong teams averaged 68% for straight up predictions over the past two years. A more detailed explanation is provided in one of our previous posts . The predictions for Week 5: Favorite Line Unde

2011-10-02: 2011 NFL Season Under Way

The 2011 NFL season is underway and we are ready to put some of our improved algorithms to the test. Last year we primarily used box score data for our predictions. This resulted in adequate performance but nothing spectacular. This year we are increasing the collective intelligence quotient in our algorithm by incorporating betting line data and line movement. The purpose of the betting line is to make the sportsbooks money by splitting the betting population in half. The line will move as a result of betting pressure presented by the betting population. e.g. The favorite team is favored by 5 points. Many bettors may feel that the favorite team is not that good and place bets on the underdog. With an unbalanced wager profile the sportsbook has the potential to lose money so they will move the bet line until the incoming bets are equal on each side. This movement is a form of collective intelligence of the betting population. Another change this year is that in addition to choosing

2011-09-14: Dissertation Completed

I am very happy to write about the successful completion of my dissertation work in the Computer Science Department at Old Dominion University . My dissertation is titled "Using the Web Infrastructure for Real Time Recovery of Missing Web Pages" and, as the title suggests, it makes several contributions in the areas of digital data preservation and information retrieval. In brief, the dissertation evaluates multiple techniques for a "just-in-time" approach to web page preservation. We, for example, investigate the suitability of lexical signatures and web page titles to rediscover missing content . These two methods are based on old copies of the pages provided by the Memento framework. We also analyze the performance of tags that users have created to annotate pages as well as the most salient terms derived from a page's link neighborhood as methods to find missing pages. On the practical side, the dissertation introduces Synchronicity , a Firefox add-on

2011-08-28: KDD 2011 Trip Report

Author:  Carlton Northern The SIGKDD 2011 conference took place August 21 - 24 at the Hyatt Manchester in San Diego, CA.  Researchers from all over the world interested in knowledge discovery and data mining were in attendance.  This conference in particular has a heavy statistical analysis flavor and many presentations were math intensive. I was invited to present my masters project research at the Mining Data Semantics (MDS2011) Workshop of KDD.  In this paper, we present an approach to find social media profiles of people from an organization.  This is possible due to the links created between members an organization. For instance, co-workers or students will likely friend each other creating hyperlinks between their respective accounts.  These links, if public, can be mined and used to disambiguate other profiles that may share the same names as those individuals we are searching for.  The following figure shows the amount of profiles found from the ODU Computer Science st

2011-08-28: Fall 2011 WS-DL Classes

The Web Science and Digital Libraries Research Group is offering two classes for the fall 2011 semester. CS 895 Web-Based Information Retrieval will be offered on Tuesdays, 4:20-7:00 in room 2120 of the ECS building. This class will use the recent Croft, Metzler & Strohman book as the required text, and the Manning, Ragahavan, & Schutze book as the recommended text. By choosing the former book as the primary guide for the course, we are intentionally provided a strong engineering component to the class (i.e., a level of coding and development is expected) as opposed to just a theoretical exploration of information retrieval. CS 751/851 Introduction to Digital Libraries is not a prerequisite, but it would help to be familiar with the material covered in that class. Dr. Weigle will be teaching CS 795/895 Information Visualization on Thursdays, 9:30-12:15 in room 2120 of the ECS building. This class is a follow-on to the CS 796/896 Visual Analytics Seminar from last