Sunday, August 18, 2013

Extracting Sentiment

This is just a quick follow up to the twitter sentiment visualization. The following is a description of some of the technical challenges I faced.  This is by no means a complete analysis - some might even consider it naive - but the purpose here is not to build the worlds best twitter analyzer  but rather to build a framework with which one can extract tweets, process them and begin to derive meaning. 

The tools in use here include:

Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox

I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.  

Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:

    #Convert to lower case

    tweet = tweet.lower()



    #Convert www.* or https?://* to URL

    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)



    #Convert @username to AT_USER

    tweet = re.sub('@[^\s]+','AT_USER',tweet)



    #Remove additional white spaces

    tweet = re.sub('[\s]+', ' ', tweet)


    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

    #Trim
    tweet = tweet.strip('\'"')

   #Check if the word starts with an alphabet
    val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)

   #Look for a patter of 2 or more letters and replace with the         character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)


Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.

Once I have a "clean" tweet, I use the following steps to process it:

  1. First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed.  While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value.  This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement.  More on that in the future.
  2. I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
  3. Now it is time to use the NLTK tools to extract a feature list.  For more on that - see their documentation here.
  4. With features in hand, you can go ahead and train a classifier - a good example of this can be found here.
Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.