Sunday, January 13, 2013

/a week for work and ransac


Too Much Work

My first week back at Big Blue in 2013

Last week was my first week back at big blue in 2013, and as one can imagine I was fairly busy playing catch up   Mostly model production readiness tests, sprinkled with some meetings and emails.  Production readiness in this case entailed pushing as much code to run  "in database" as possible, and there was a lot of code. Go postgreSQL.....

/ransac

As a result of stupid hours put in at work, I  contributed only minimal time to iMobi.  The video tutorial (partially complete) being one, and an algorithm called ransac: http://en.wikipedia.org/wiki/RANSAC that uses the Kinect's point cloud to identify planes, and hopefully where the floor is, being the other.

More on those two next week.

Wednesday, January 2, 2013

Modeling American Football

/the Problem

Analysts in my company received a challenge to build a model that can predict wins and losses in the NFL.  

Gain Understanding

A crucial step

Knowing absolutely nothing about the sport, I decided to try my hand at the problem; how hard could it be? My first task was to collect enough data for training, test and validation sets. I started by extracting outcomes for the 2008 - 2011 seasons from the website: Pro-Football-Reference.com.  Using the same website I also gathered basic statistics for each team.  The variables created within the SRS (Simple Rating System) which calculate team offensive and defensive strength relative to average NFL team performance became the first data points. Before any modeling could take place, I needed to understand the game more. I spent a couple hours reading blogs on what statistics are the best representation of a teams likely performance, and a couple more hours just reading about various aspects of the game.


In an ideal world my model would include individual player level data, but the scope of this type of collection exercise quickly exceeded my available bandwidth.  Instead I decided only consider aggregate team statistics. 

The term "SRS" seemed to pop up a lot so I wanted to start there.  I organized each training example into a Y vector that contained the outcome of each game.  Next I transposed the data I had gathered into a matrix where each column represented a feature: SRS, SoS, OSRS, DSRS etc.  Each row in the matrix contained the aforementioned statistics for each team in a game.  So one row might look like T1_SRS, T2_SRS, T1_Home, T2_Home etc.

Exploration

Follow the white rabbit





I tried several variations of two machine learning algorithms: first I constructed a decision tree that uses the  Information Entropy to prune. Next I tried a neural network.  I have coded a custom neural network that uses backpropogation to learn the weights, but for exploration I just threw it into SPSS.  I like SPSS because it is user friendly and I can get results quickly, but I also found myself severely limited by the software.  For example, I was unable to add more then two layers, and I did not see a way to play around with the bias term.  That being said the neural network outperformed the decision tree algorithm so I decided to to work to refine the model using neural networks.


While I was able to obtain a 93% "accuracy"  on the Training and Test sets, I only saw a slight improvement over a naive model with the holdout sample.  This was proving to be more difficult then I first imagined.  I quick look at the learning curves revealed the problem.  I was severely over fitting the training data, this is a problem often caused by "variance"  or "noise" in the target and sometimes the quickest way to improve the model is to gather more data...so back to google.

First I extracted three additional years of data from the initial website, next I pulled in the vegas odds on each of the games hopping to make use of the professional betting establishment's sentiment.  Finally I dug deeper into the football blogosphere and came across the website: http://www.advancednflstats.com/.  Which includes both team efficiency ratings and predictions in an easy to extract format. The inclusion of this new data gave my models a significant boost, and I am now correctly identifying 100% of the cases in the training and test sets and percentage on the validation set good enough to go to Vegas with.  

Next steps

Refinement

There is likely a lot of co-linearity in the data, I want to reconstruct the model using my neural network in Octave so I can have more flexibility with the architecture and try non-supervised ML techniques to address the collinearity.  Hopefully these steps will improve the model performance.