Astronomer and budding data scientist Julia Silge has been using R for less than a year, but based on the posts using R on her blog has already become very proficient at using R to analyze some interesting data sets. She has posted detailed analyses of water consumption data and health care indicators from the Utah Open Data Catalog, religious affiliation data from the Association of Statisticians of American Religious Bodies, and demographic data from the American Community Survey (that's the same dataset we mentioned on Monday).
In a two-part series, Julia analyzed another interesting dataset: her own archive of 10,000 tweets. (Julia provides all the R code for her analyses, so you can download your own Twitter archive and follow along.) In part one, Julia uses just a few lines of R to import her Twitter archive into R — in fact, that takes just one line of R code:
tweets <- read.csv("./tweets.csv", stringsAsFactors = FALSE)
She then uses the lubridate package to clean up the timestamps, and the ggplot2 package to create some simple charts of her Twitter activity. This chart takes just a few lines of R code and shows her Twitter activity over time categorized by type of tweet (direct tweets, replies, and retweets).
The really interesting part of the analysis comes in part two, where Julia uses the tm package (which provides a number of text mining functions to R) and syuzhet package (which includes the NRC Word-Emotion Association Lexicon algorithm) to analyze the sentiment of her tweets. Categorizing all 10,000 tweets as representing "anger", "fear", "surprise" and other sentiments, and generating a positive and negative sentiment score for each, is as simple as this one line of R code:
mySentiment <- get_nrc_sentiment(tweets$text)
Using those sentiment scores, Julia was easily able to summarize the sentiments expressed in her tweet history:
and create this time series chart showing her negative and positive sentiment scores over time:
If you've been thinking about applying sentiment analysis to some text data, you might find that with R it's easier than you think! Try it using your own Twitter archive by following along with Julia's posts linked below.
data science ish: Ten Thousand Tweets ; Joy to the World, and also Anticipation, Disgust, Surprise...