/the Problem
Analysts in my company received a challenge to build a model that can predict wins and losses in the NFL.Gain Understanding
A crucial step
Knowing absolutely nothing about the sport, I decided to try my hand at the problem; how hard could it be? My first task was to collect enough data for training, test and validation sets. I started by extracting outcomes for the 2008 - 2011 seasons from the website: Pro-Football-Reference.com. Using the same website I also gathered basic statistics for each team. The variables created within the SRS (Simple Rating System) which calculate team offensive and defensive strength relative to average NFL team performance became the first data points. Before any modeling could take place, I needed to understand the game more. I spent a couple hours reading blogs on what statistics are the best representation of a teams likely performance, and a couple more hours just reading about various aspects of the game.In an ideal world my model would include individual player level data, but the scope of this type of collection exercise quickly exceeded my available bandwidth. Instead I decided only consider aggregate team statistics.
The term "SRS" seemed to pop up a lot so I wanted to start there. I organized each training example into a Y vector that contained the outcome of each game. Next I transposed the data I had gathered into a matrix where each column represented a feature: SRS, SoS, OSRS, DSRS etc. Each row in the matrix contained the aforementioned statistics for each team in a game. So one row might look like T1_SRS, T2_SRS, T1_Home, T2_Home etc.
Exploration
Follow the white rabbit
While I was able to obtain a 93% "accuracy" on the Training and Test sets, I only saw a slight improvement over a naive model with the holdout sample. This was proving to be more difficult then I first imagined. I quick look at the learning curves revealed the problem. I was severely over fitting the training data, this is a problem often caused by "variance" or "noise" in the target and sometimes the quickest way to improve the model is to gather more data...so back to google.
First I extracted three additional years of data from the initial website, next I pulled in the vegas odds on each of the games hopping to make use of the professional betting establishment's sentiment. Finally I dug deeper into the football blogosphere and came across the website: http://www.advancednflstats.com/. Which includes both team efficiency ratings and predictions in an easy to extract format. The inclusion of this new data gave my models a significant boost, and I am now correctly identifying 100% of the cases in the training and test sets and percentage on the validation set good enough to go to Vegas with.
No comments:
Post a Comment