A visualization of twitter and the English Premier League:
Sunday, August 25, 2013
Sunday, August 18, 2013
Extracting Sentiment
This is just a quick follow up to the twitter sentiment visualization. The following is a description of some of the technical challenges I faced. This is by no means a complete analysis - some might even consider it naive - but the purpose here is not to build the worlds best twitter analyzer but rather to build a framework with which one can extract tweets, process them and begin to derive meaning.
The tools in use here include:
Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.
Once I have a "clean" tweet, I use the following steps to process it:
The tools in use here include:
Python, Twitter API's, NLTK, a word sentiment corpus (I am using the one available via the Coursera Data Science course), and for the visualization I am using Nodebox
I began by extracting tweets - here I pretty much just followed the instructions on the Courersera Data Science course - for detailed steps on setting up the Oath2 protocol and the necessary dependencies on a Mac - check out this earlier post.
Once I had tweets, I had to normalize them. Tweets are messy-they feature an extravagant use of vowels, non standard English, and special characters:
#Convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert @username to AT_USER
tweet = re.sub('@[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#Trim
tweet = tweet.strip('\'"')
#Check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)
#Look for a patter of 2 or more letters and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
#Check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", word)
#Look for a patter of 2 or more letters and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
Because of I have no idea how to format code for a blog, I will refrain from pasting in code here, but instead just describe - in detail- the process.
Once I have a "clean" tweet, I use the following steps to process it:
- First I remove all "stop words" so - words in ('is', 'are', 'the', ... ) basically any word that has no inherent emotional value is removed. While omitting stop words, I match the tweet against the word sentiment corpus mentioned earlier, and, based on the total sentiment value of the tweet I assign it a 'positive', 'negative', or 'neutral' value. This was my hacked up way of coming up with a training set, or examples with which to build a model and could use a lot of improvement. More on that in the future.
- I take all tweets that have either a positive or negative sentiment and a geotag and append them to a list
- Now it is time to use the NLTK tools to extract a feature list. For more on that - see their documentation here.
- With features in hand, you can go ahead and train a classifier - a good example of this can be found here.
Once I have a satisfactory classifier, I store the model using pickle(), and start classifying new tweets. In the coming weeks I will upload the full code (a hack job if there ever was one) to github.
Monday, August 5, 2013
Twitter on the Royal Baby
I just completed a trial twitter sentiment analysis. The larger circle represents the positive tweets associated with #Royalbaby, the smaller circle represents negative sentiment. I used python to extract an hours worth of tweets and Nodebox to construct he visual.
Subscribe to:
Posts (Atom)