Wednesday, September 11, 2013

New EPL predictions

I spent the last week and a half reworking the model and building out the skeleton of a web app to better convey the results. I will expand on the web app in the near future to include a more holistic, unsupervised look at the match.



For those interested - the tools in use here include Python for the data munging, Orange (a python ml library) for the modeling, and d3.js for the visual.




Monday, September 2, 2013

Premier League Predictions




This is my first attempt at using machine learning to predict EPL matches. There are significant  improvements to be made - which I will gradually incorporate in future updates.

A brief walkthrough of the visual - the model results describe the likelihood of a favorable outcome - zero representing a low probability of success. The 365 Odds and the data for the model itself can be sourced from this website.

Sunday, September 1, 2013

Sunday, August 25, 2013

Sunday, August 18, 2013

Extracting Sentiment

This is just a quick follow up to the twitter sentiment visualization. The following is a description of some of the technical challenges I faced.  This is by no means a complete analysis - some might even consider it naive - but the purpose here is not to build the worlds best twitter analyzer  but rather to build a framework with which one can extract tweets, process them and begin to derive meaning. 

The tools in use here include:

Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox

I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.  

Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:

    #Convert to lower case

    tweet = tweet.lower()



    #Convert www.* or https?://* to URL

    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)



    #Convert @username to AT_USER

    tweet = re.sub('@[^\s]+','AT_USER',tweet)



    #Remove additional white spaces

    tweet = re.sub('[\s]+', ' ', tweet)


    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

    #Trim
    tweet = tweet.strip('\'"')

   #Check if the word starts with an alphabet
    val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)

   #Look for a patter of 2 or more letters and replace with the         character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)


Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.

Once I have a "clean" tweet, I use the following steps to process it:

  1. First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed.  While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value.  This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement.  More on that in the future.
  2. I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
  3. Now it is time to use the NLTK tools to extract a feature list.  For more on that - see their documentation here.
  4. With features in hand, you can go ahead and train a classifier - a good example of this can be found here.
Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.


Sunday, August 11, 2013

Monday, August 5, 2013

Twitter on the Royal Baby


I just completed a trial twitter sentiment analysis.  The larger circle represents the positive tweets associated with #Royalbaby, the smaller circle represents negative sentiment. I used python to extract an hours worth of tweets and Nodebox to construct he visual.

Sunday, May 19, 2013

Thank you JJ

Finally a solid scifi concept that balances fun, action, and philosophy into a visually amazing package. The bonus? It is accessible to people who adhere to accepted hygiene practice's, and have plans for the weekend that don't involve the words "game" and "workshop". A solid movie- looking forward to seeing it once more.

Coursera - Data Science 101

I am participating in the data science course, freely available at coursera.com.  I wanted to set up the environment on my mac, and bypass the virtual machine environment (I hate working on a virtual machine).  Here are some of the extra steps needed to get the course working on a Mac, I am using snow leopard.

  1. Download and install python 2.7 form the python website
  2. You will need to install oath2-1.5.211 in order to access the twitter stream.  Download here
  3. Install the new library by navigating to the directory of the file "setup.py" inside the oarth2 folder in the command line and typing: sudo python setup.py install (enter your password when prompted)
    1. I received an error at this point complaining about not being able to locate the setuptools package.  If you also see this error, use the following steps to rectify:
      1. Search for and download setuptools-0.6c11-py2.7.egg
      2. In the command line run sudo sh setuptools-0.6c11-py2.7.egg (password again)
  4.  Once the setuptools has been installed, try the installation of the oath2-1.5.211
  5. Follow the rest of the directions as outlined on the course website.
Hope this helps anyone who had trouble! 



Sunday, January 13, 2013

/a week for work and ransac


Too Much Work

My first week back at Big Blue in 2013

Last week was my first week back at big blue in 2013, and as one can imagine I was fairly busy playing catch up   Mostly model production readiness tests, sprinkled with some meetings and emails.  Production readiness in this case entailed pushing as much code to run  "in database" as possible, and there was a lot of code. Go postgreSQL.....

/ransac

As a result of stupid hours put in at work, I  contributed only minimal time to iMobi.  The video tutorial (partially complete) being one, and an algorithm called ransac: http://en.wikipedia.org/wiki/RANSAC that uses the Kinect's point cloud to identify planes, and hopefully where the floor is, being the other.

More on those two next week.

Wednesday, January 2, 2013

Modeling American Football

/the Problem

Analysts in my company received a challenge to build a model that can predict wins and losses in the NFL.  

Gain Understanding

A crucial step

Knowing absolutely nothing about the sport, I decided to try my hand at the problem; how hard could it be? My first task was to collect enough data for training, test and validation sets. I started by extracting outcomes for the 2008 - 2011 seasons from the website: Pro-Football-Reference.com.  Using the same website I also gathered basic statistics for each team.  The variables created within the SRS (Simple Rating System) which calculate team offensive and defensive strength relative to average NFL team performance became the first data points. Before any modeling could take place, I needed to understand the game more. I spent a couple hours reading blogs on what statistics are the best representation of a teams likely performance, and a couple more hours just reading about various aspects of the game.


In an ideal world my model would include individual player level data, but the scope of this type of collection exercise quickly exceeded my available bandwidth.  Instead I decided only consider aggregate team statistics. 

The term "SRS" seemed to pop up a lot so I wanted to start there.  I organized each training example into a Y vector that contained the outcome of each game.  Next I transposed the data I had gathered into a matrix where each column represented a feature: SRS, SoS, OSRS, DSRS etc.  Each row in the matrix contained the aforementioned statistics for each team in a game.  So one row might look like T1_SRS, T2_SRS, T1_Home, T2_Home etc.

Exploration

Follow the white rabbit





I tried several variations of two machine learning algorithms: first I constructed a decision tree that uses the  Information Entropy to prune. Next I tried a neural network.  I have coded a custom neural network that uses backpropogation to learn the weights, but for exploration I just threw it into SPSS.  I like SPSS because it is user friendly and I can get results quickly, but I also found myself severely limited by the software.  For example, I was unable to add more then two layers, and I did not see a way to play around with the bias term.  That being said the neural network outperformed the decision tree algorithm so I decided to to work to refine the model using neural networks.


While I was able to obtain a 93% "accuracy"  on the Training and Test sets, I only saw a slight improvement over a naive model with the holdout sample.  This was proving to be more difficult then I first imagined.  I quick look at the learning curves revealed the problem.  I was severely over fitting the training data, this is a problem often caused by "variance"  or "noise" in the target and sometimes the quickest way to improve the model is to gather more data...so back to google.

First I extracted three additional years of data from the initial website, next I pulled in the vegas odds on each of the games hopping to make use of the professional betting establishment's sentiment.  Finally I dug deeper into the football blogosphere and came across the website: http://www.advancednflstats.com/.  Which includes both team efficiency ratings and predictions in an easy to extract format. The inclusion of this new data gave my models a significant boost, and I am now correctly identifying 100% of the cases in the training and test sets and percentage on the validation set good enough to go to Vegas with.  

Next steps

Refinement

There is likely a lot of co-linearity in the data, I want to reconstruct the model using my neural network in Octave so I can have more flexibility with the architecture and try non-supervised ML techniques to address the collinearity.  Hopefully these steps will improve the model performance.