The tools in use here include:
Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox
I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.
Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:
#Convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert @username to AT_USER
tweet = re.sub('@[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#Trim
tweet = tweet.strip('\'"')
#Check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)
#Look for a patter of 2 or more letters and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
#Check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)
#Look for a patter of 2 or more letters and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.
Once I have a "clean" tweet, I use the following steps to process it:
- First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed. While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value. This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement. More on that in the future.
- I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
- Now it is time to use the NLTK tools to extract a feature list. For more on that - see their documentation here.
- With features in hand, you can go ahead and train a classifier - a good example of this can be found here.
Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.
No comments:
Post a Comment