aModerate: August 2013

This is just a quick follow up to the twitter sentiment visualization. The following is a description of some of the technical challenges I faced. This is by no means a complete analysis - some might even consider it naive - but the purpose here is not to build the worlds best twitter analyzer but rather to build a framework with which one can extract tweets, process them and begin to derive meaning.

The tools in use here include:

Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox

I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.

Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:

#Convert to lower case

tweet = tweet.lower()

#Convert www.* or https?://* to URL

tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)

#Convert @username to AT_USER

tweet = re.sub('@[^\s]+','AT_USER',tweet)

#Remove additional white spaces

tweet = re.sub('[\s]+', ' ', tweet)

#Replace #word with word

tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

#Trim

tweet = tweet.strip('\'"')

#Check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)

#Look for a patter of 2 or more letters and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)

Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.

Once I have a "clean" tweet, I use the following steps to process it:

First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed. While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value. This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement. More on that in the future.
I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
Now it is time to use the NLTK tools to extract a feature list. For more on that - see their documentation here.
With features in hand, you can go ahead and train a classifier - a good example of this can be found here.

Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.

aModerate

Sunday, August 25, 2013

Twitter During the Arsenal Game on Saturday

Sunday, August 18, 2013

Extracting Sentiment

Sunday, August 11, 2013

Monday, August 5, 2013

Twitter on the Royal Baby