aModerate: Extracting Sentiment

This is just a quick follow up to the twitter sentiment visualization. The following is a description of some of the technical challenges I faced. This is by no means a complete analysis - some might even consider it naive - but the purpose here is not to build the worlds best twitter analyzer but rather to build a framework with which one can extract tweets, process them and begin to derive meaning.

The tools in use here include:

Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox

I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.

Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:

#Convert to lower case

tweet = tweet.lower()

#Convert www.* or https?://* to URL

tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)

#Convert @username to AT_USER

tweet = re.sub('@[^\s]+','AT_USER',tweet)

#Remove additional white spaces

tweet = re.sub('[\s]+', ' ', tweet)

#Replace #word with word

tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

#Trim

tweet = tweet.strip('\'"')

#Check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)

#Look for a patter of 2 or more letters and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)

Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.

Once I have a "clean" tweet, I use the following steps to process it:

First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed. While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value. This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement. More on that in the future.
I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
Now it is time to use the NLTK tools to extract a feature list. For more on that - see their documentation here.
With features in hand, you can go ahead and train a classifier - a good example of this can be found here.

Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.

aModerate

Sunday, August 18, 2013

Extracting Sentiment

No comments:

Post a Comment