As we mentioned at the beginning of this workshop, textblob will allow us to do sentiment analysis in a very simple way. We will also use the re
library from Python, which is used to work with regular expressions. For this, I'll provide you two utility functions to: a) clean text (which means that any symbol distinct to an alphanumeric value will be remapped into a new one that satisfies this condition), and b) create a classifier to analyze the polarity of each tweet after cleaning the text in it. I won't explain the specific way in which the function that cleans works, since it would be extended and it might be better understood in the official re
documentation.
The code that I'm providing is:
from textblob import TextBlob
import re
def clean_tweet(tweet):
'''
Utility function to clean the text in a tweet by removing
links and special characters using regex.
'''
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
def analize_sentiment(tweet):
'''
Utility function to classify the polarity of a tweet
using textblob.
'''
analysis = TextBlob(clean_tweet(tweet))
if analysis.sentiment.polarity > 0:
return 1
elif analysis.sentiment.polarity == 0:
return 0
else:
return -1
The way it works is that textblob already provides a trained analyzer (cool, right?). Textblob can work with different machine learning models used in natural language processing. If you want to train your own classifier (or at least check how it works) feel free to check the following link. It might result relevant since we're working with a pre-trained model (for which we don't not the data that was used).
Anyway, getting back to the code we will just add an extra column to our data. This column will contain the sentiment analysis and we can plot the dataframe to see the update:
# We create a column with the result of the analysis:
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])
# We display the updated dataframe with the new column:
display(data.head(10))
Obtaining the new output:
Tweets | len | ID | Date | Source | Likes | RTs | SA | |
---|---|---|---|---|---|---|---|---|
0 | On behalf of @FLOTUS Melania & myself, THA... | 144 | 903778130850131970 | 2017-09-02 00:34:32 | Twitter for iPhone | 24572 | 5585 | 1 |
1 | I will be going to Texas and Louisiana tomorro... | 132 | 903770196388831233 | 2017-09-02 00:03:00 | Twitter for iPhone | 44748 | 8825 | 1 |
2 | Stock Market up 5 months in a row! | 34 | 903766326631698432 | 2017-09-01 23:47:38 | Twitter for iPhone | 44518 | 9134 | 0 |
3 | 'President Donald J. Trump Proclaims September... | 140 | 903705867891204096 | 2017-09-01 19:47:23 | Media Studio | 47009 | 15127 | 0 |
4 | Texas is healing fast thanks to all of the gre... | 143 | 903603043714957312 | 2017-09-01 12:58:48 | Twitter for iPhone | 77680 | 15398 | 1 |
5 | ...get things done at a record clip. Many big ... | 113 | 903600265420578819 | 2017-09-01 12:47:46 | Twitter for iPhone | 54664 | 11424 | 1 |
6 | General John Kelly is doing a great job as Chi... | 140 | 903597166249246720 | 2017-09-01 12:35:27 | Twitter for iPhone | 59840 | 11678 | 1 |
7 | Wow, looks like James Comey exonerated Hillary... | 130 | 903587428488839170 | 2017-09-01 11:56:45 | Twitter for iPhone | 110667 | 35936 | 1 |
8 | THANK YOU to all of the incredible HEROES in T... | 110 | 903348312421670912 | 2017-08-31 20:06:35 | Twitter for iPhone | 112012 | 29064 | 1 |
9 | RT @FoxNews: .@KellyannePolls on Harvey recove... | 140 | 903234878124249090 | 2017-08-31 12:35:50 | Twitter for iPhone | 0 | 6638 | 0 |
As we can see, the last column contains the sentiment analysis (SA
). We now just need to check the results.
To have a simple way to verify the results, we will count the number of neutral, positive and negative tweets and extract the percentages.
# We construct lists with classified tweets:
pos_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] > 0]
neu_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] == 0]
neg_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] < 0]
Now that we have the lists, we just print the percentages:
# We print percentages:
print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(data['Tweets'])))
print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(data['Tweets'])))
print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(data['Tweets'])))
Obtaining the following result:
Percentage of positive tweets: 51.0%
Percentage of neutral tweets: 27.0%
Percentage de negative tweets: 22.0%
We have to consider that we're working only with the 200 most recent tweets from D. Trump (last updated: September 2nd.). For more accurate results we can consider more tweets. An interesting thing (an invitation to the readers) is to analyze the polarity of the tweets from different sources, it might be deterministic that by only considering the tweets from one source the polarity would result more positive/negative. Anyway, I hope this resulted interesting.
As we saw, we can extract, manipulate, visualize and analyze data in a very simple way with Python. I hope that this leaves some uncertainty in the reader, for further exploration using this tools.
Go back to 2. Visualization and basic statistics
Go next to 4. References