Friday, March 21, 2014

Updated Prediction app


Noise

Over the last few months I have been developing a new application.  It's purpose is to visualize Premier League trends and predictions by leveraging the freakishly awesome D3.js libraries found at http://d3js.org/.  I call the app noise, as it is intended to both counteract the noise associated with sports punditry, while recognizing the inherent complexity of the tracking and measuring team performance and fan engagement. You can view the site here.

I have built two features, with a third feature in the works.  The first is a simple social media tracker (in this case Twitter- I plan to add additional sites such as reddit in the near future). This provides a snapshot of the current volume of information being exchanged concerning a team. 

Additionally I have trained a statistical model to predict the outcome of each match.  The model is built by comparing the home teams home performance: average wins, goals, corners, and shots on target against the away team's away performance across the same metrics.  Additionally I calculate both teams statistics as an expanding average over the season, and a rolling average for the last three games. 

If any one is interested in the model used, or wants to add something to it the code, the project is located on github.






Wednesday, September 11, 2013

New EPL predictions

I spent the last week and a half reworking the model and building out the skeleton of a web app to better convey the results. I will expand on the web app in the near future to include a more holistic, unsupervised look at the match.



For those interested - the tools in use here include Python for the data munging, Orange (a python ml library) for the modeling, and d3.js for the visual.




Monday, September 2, 2013

Premier League Predictions




This is my first attempt at using machine learning to predict EPL matches. There are significant  improvements to be made - which I will gradually incorporate in future updates.

A brief walkthrough of the visual - the model results describe the likelihood of a favorable outcome - zero representing a low probability of success. The 365 Odds and the data for the model itself can be sourced from this website.

Sunday, September 1, 2013

Sunday, August 18, 2013

Extracting Sentiment

This is just a quick follow up to the twitter sentiment visualization. The following is a description of some of the technical challenges I faced.  This is by no means a complete analysis - some might even consider it naive - but the purpose here is not to build the worlds best twitter analyzer  but rather to build a framework with which one can extract tweets, process them and begin to derive meaning. 

The tools in use here include:

Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox

I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.  

Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:

    #Convert to lower case

    tweet = tweet.lower()



    #Convert www.* or https?://* to URL

    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)



    #Convert @username to AT_USER

    tweet = re.sub('@[^\s]+','AT_USER',tweet)



    #Remove additional white spaces

    tweet = re.sub('[\s]+', ' ', tweet)


    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

    #Trim
    tweet = tweet.strip('\'"')

   #Check if the word starts with an alphabet
    val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)

   #Look for a patter of 2 or more letters and replace with the         character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)


Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.

Once I have a "clean" tweet, I use the following steps to process it:

  1. First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed.  While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value.  This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement.  More on that in the future.
  2. I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
  3. Now it is time to use the NLTK tools to extract a feature list.  For more on that - see their documentation here.
  4. With features in hand, you can go ahead and train a classifier - a good example of this can be found here.
Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.


Sunday, August 11, 2013