Sunday, May 19, 2013

Thank you JJ

Finally a solid scifi concept that balances fun, action, and philosophy into a visually amazing package. The bonus? It is accessible to people who adhere to accepted hygiene practice's, and have plans for the weekend that don't involve the words "game" and "workshop". A solid movie- looking forward to seeing it once more.

Coursera - Data Science 101

I am participating in the data science course, freely available at coursera.com.  I wanted to set up the environment on my mac, and bypass the virtual machine environment (I hate working on a virtual machine).  Here are some of the extra steps needed to get the course working on a Mac, I am using snow leopard.

  1. Download and install python 2.7 form the python website
  2. You will need to install oath2-1.5.211 in order to access the twitter stream.  Download here
  3. Install the new library by navigating to the directory of the file "setup.py" inside the oarth2 folder in the command line and typing: sudo python setup.py install (enter your password when prompted)
    1. I received an error at this point complaining about not being able to locate the setuptools package.  If you also see this error, use the following steps to rectify:
      1. Search for and download setuptools-0.6c11-py2.7.egg
      2. In the command line run sudo sh setuptools-0.6c11-py2.7.egg (password again)
  4.  Once the setuptools has been installed, try the installation of the oath2-1.5.211
  5. Follow the rest of the directions as outlined on the course website.
Hope this helps anyone who had trouble! 



Sunday, January 13, 2013

/a week for work and ransac


Too Much Work

My first week back at Big Blue in 2013

Last week was my first week back at big blue in 2013, and as one can imagine I was fairly busy playing catch up   Mostly model production readiness tests, sprinkled with some meetings and emails.  Production readiness in this case entailed pushing as much code to run  "in database" as possible, and there was a lot of code. Go postgreSQL.....

/ransac

As a result of stupid hours put in at work, I  contributed only minimal time to iMobi.  The video tutorial (partially complete) being one, and an algorithm called ransac: http://en.wikipedia.org/wiki/RANSAC that uses the Kinect's point cloud to identify planes, and hopefully where the floor is, being the other.

More on those two next week.

Wednesday, January 2, 2013

Modeling American Football

/the Problem

Analysts in my company received a challenge to build a model that can predict wins and losses in the NFL.  

Gain Understanding

A crucial step

Knowing absolutely nothing about the sport, I decided to try my hand at the problem; how hard could it be? My first task was to collect enough data for training, test and validation sets. I started by extracting outcomes for the 2008 - 2011 seasons from the website: Pro-Football-Reference.com.  Using the same website I also gathered basic statistics for each team.  The variables created within the SRS (Simple Rating System) which calculate team offensive and defensive strength relative to average NFL team performance became the first data points. Before any modeling could take place, I needed to understand the game more. I spent a couple hours reading blogs on what statistics are the best representation of a teams likely performance, and a couple more hours just reading about various aspects of the game.


In an ideal world my model would include individual player level data, but the scope of this type of collection exercise quickly exceeded my available bandwidth.  Instead I decided only consider aggregate team statistics. 

The term "SRS" seemed to pop up a lot so I wanted to start there.  I organized each training example into a Y vector that contained the outcome of each game.  Next I transposed the data I had gathered into a matrix where each column represented a feature: SRS, SoS, OSRS, DSRS etc.  Each row in the matrix contained the aforementioned statistics for each team in a game.  So one row might look like T1_SRS, T2_SRS, T1_Home, T2_Home etc.

Exploration

Follow the white rabbit





I tried several variations of two machine learning algorithms: first I constructed a decision tree that uses the  Information Entropy to prune. Next I tried a neural network.  I have coded a custom neural network that uses backpropogation to learn the weights, but for exploration I just threw it into SPSS.  I like SPSS because it is user friendly and I can get results quickly, but I also found myself severely limited by the software.  For example, I was unable to add more then two layers, and I did not see a way to play around with the bias term.  That being said the neural network outperformed the decision tree algorithm so I decided to to work to refine the model using neural networks.


While I was able to obtain a 93% "accuracy"  on the Training and Test sets, I only saw a slight improvement over a naive model with the holdout sample.  This was proving to be more difficult then I first imagined.  I quick look at the learning curves revealed the problem.  I was severely over fitting the training data, this is a problem often caused by "variance"  or "noise" in the target and sometimes the quickest way to improve the model is to gather more data...so back to google.

First I extracted three additional years of data from the initial website, next I pulled in the vegas odds on each of the games hopping to make use of the professional betting establishment's sentiment.  Finally I dug deeper into the football blogosphere and came across the website: http://www.advancednflstats.com/.  Which includes both team efficiency ratings and predictions in an easy to extract format. The inclusion of this new data gave my models a significant boost, and I am now correctly identifying 100% of the cases in the training and test sets and percentage on the validation set good enough to go to Vegas with.  

Next steps

Refinement

There is likely a lot of co-linearity in the data, I want to reconstruct the model using my neural network in Octave so I can have more flexibility with the architecture and try non-supervised ML techniques to address the collinearity.  Hopefully these steps will improve the model performance. 

Sunday, December 23, 2012

a new sensor

/theKinect

First Impressions



I just received the Microsoft Kinect for Developers from Amazon.  I intend to use it as the primary sensor for my robot; It has a lot of interesting features including an IR sensor, depth sensors, audio, camera, and a fantastic set of API's.  For the cost -200.00 U.S. dollars- it seems like the easiest place to start. After the unboxing I downloaded the Kinect SDK and Developers Toolkit 
-http://www.microsoft.com/en-us/kinectforwindows/develop/developer-downloads.aspx-
and spent a couple days playing with the code.  The two main development languages are C# and C++ and are numerous examples using both - I chose C# to get started. 


A side note, I have never coded in C# before, but I was able to get up to speed with the help of some tutorials posted by Microsoft 
-http://channel9.msdn.com/Series/C-Sharp-Fundamentals-Development-for-Absolute-Beginners. 
I spent a couple hours and was able to figure the rest out from there.

My first step was to just turn the sensors on and load the video into a window; next I played around with the face tracking API's. In all it was fairly easy to get things up and running, and to start doing some really cool thins with the data streams the Kinect has on offer.  Latter on I will post some videos with actual code.

Saturday, December 22, 2012

iMobi

Winter [vacation] is coming 

iMobi- This idea has been forming over the past two years.  I took the +Coursera +Machine Learning  course  by Andrew Ng and latter two +Udacity courses (AI for robotics, and CS101).


Concept

A to B to C 

A robot that can sense it's surroundings, build models based on those sensor inputs and make decisions based on those models.   I want  to drop it in a room, or outside in a desert or another planet.....just kidding, but that would be cool...  and it should start learning about it's environment.  I have no idea if I can ever finish this alone, but I hope my rather simplistic outline will provide a structure for the systems I will need to develop.  

Here is what I have in mind:  Picture a human brain and it's perceptions of the world.  It is able to perform  relatively simple tasks such as recognizing a handwritten "t"  to fairly complex tasks like understanding irony, or making an apple pie.   One might be tempted to think the brain's circuitry operates on a scale where less complex tasks are a assigned to less complex circuitry while the more complex tasks are left to complicated computational unites.  As it turns out this is not the case; each computational unit in the brain is no more or less complicated than the other- the brain makes sense of the world by layering simple computations. The computational unit that recognizes irony relies on input from dozens of other units that have already done their job. something had to process the visual images, something had to recognize a spoken word, something had to recognize grammar etc. Each unit performs it's task and eventually we understand irony.


I will borrow an idea from +Ray Kurzweil's book, How to Create a Mind, and call these computational units recognizes. In a similar way, perhaps the only path from input to model to decision is to take a page out of natures book and construct separate model's or "recognizes" for different tasks.