Analyze party platforms in R the tidy way
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s face it: When it comes to politics this country is exceedingly polarized. What I would like to do is quantify that polarization. To do this, I will use the platform documents that each of the political parties create during presidential election years. I’ll be using the tidy packages (i.e., dplyr, tidyr, tidytext), and the data comes from Comparative Agendas, which collects and organizes data from archived sources to track policy outcomes across countries.
library(ggplot2)
library(tidytext)
library(topicmodels)
library(dplyr)
library(wordcloud)
library(RColorBrewer)
library(tidyr)
library(scales)
library(stringr)
platforms <- read.csv("US-parties-platforms.csv", header = TRUE, stringsAsFactors = FALSE)
dems <- read.csv("democratic-platform.csv", header = TRUE, stringsAsFactors = FALSE)
# create index for platform by year and unnest tokens
platform_words <- platforms %>%
mutate(linenumber = row_number()) %>%
unnest_tokens(word, description)
# get rid of "__" in platform data
platform_words <- platform_words %>%
mutate(word = str_extract(word, "[a-z]+"))
dem_words <- dems %>%
mutate(linenumber = row_number()) %>%
unnest_tokens(word, description)
# remove stopwords
data(stop_words)
platform_words <- platform_words %>%
anti_join(stop_words, by = "word")
dem_words <- dem_words %>%
anti_join(stop_words, by = "word")
No that the dataset is tokenized, with stop words removed, we can begin to analyze it. First, I want to look at the sentiment within each party’s platforms and look at progression over time. For this, we will use the Bing sentiment dictionary from the tidytext package.
# get and plot sentiment over time bing <- sentiments %>% filter(lexicon == "bing") %>% select(-score) republicansent <- platform_words %>% inner_join(bing) %>% count(year, index = linenumber %/% 45, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) demsent <- dem_words %>% inner_join(bing) %>% count(year, index = linenumber %/% 45, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative)
And we can use ggplot to visualize the results and compare democratic sentiment to republican sentiment.
ggplot(republicansent, aes(index, sentiment)) +
geom_col(fill="darkgoldenrod2") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment of Republican Party Platforms, 1948 - 2016",
y = "Sentiment", x = "")

ggplot(demsent, aes(index, sentiment)) +
geom_col(fill="darkgoldenrod2") +
theme_minimal(base_size = 13) +
labs(title = "Sentiment of Democratic Party Platforms, 1948 - 2016",
y = "Sentiment", x = "")

It appears that democratic platforms are more likely to contain negative sentiments, especially in that dip in the center. So let’s find the democratic platform with the nighest proportion of negative sentiments. First, we get the list of negative words from the Bing dictionary. Then, we isolate the platforms by year and count the number of words in each. This will allow us to normalize our result based on the length of each platform. Finally, we obtain the negative words in each and divide by the words in each chapter.
# find the most negative platform by year # obtain the negative words from bing bingneg <- sentiments %>% filter(lexicon == "bing", sentiment == "negative") # isolate each year's platform and count words for each wordtotal <- dem_words %>% group_by(year) %>% summarise(words = n()) # count negative words in each chapter and divide by total per chapter dem_neg <- dem_words %>% semi_join(bingneg) %>% group_by(year) %>% summarise(negativewords = n()) %>% left_join(wordtotal, by = "year") %>% mutate(ratio = negativewords / words) dem_neg
# A tibble: 18 x 4
year negativewords words ratio
<int> <int> <int> <dbl>
1 1948 98 2041 0.0480
2 1952 173 4255 0.0407
3 1956 404 6277 0.0644
4 1960 429 7765 0.0552
5 1964 354 9737 0.0364
6 1968 399 8125 0.0491
7 1972 825 12688 0.0650
8 1976 597 10174 0.0587
9 1980 893 18964 0.0471
10 1984 1169 18041 0.0648
11 1988 133 2370 0.0561
12 1992 298 4141 0.0720
13 1996 503 9131 0.0551
14 2000 621 10931 0.0568
15 2004 459 8127 0.0565
16 2008 720 12501 0.0576
17 2012 612 13213 0.0463
18 2016 813 13380 0.0608
Indeed, there is a cluster of more negative platforms in the 1984, 1988, and 1992 Democratic Party platforms. This would be during the Reagan-Bush years.
Word Frequencies
Now I want to look at the words embedded in the respective platforms to see if there is a different emphasis from one to the other. We can do this with the wordcloud package.
# get the most frequent words
rep_frequencies <- platform_words %>%
count(word, sort = TRUE)
dem_frequencies <- dem_words %>%
count(word, sort = TRUE)
# generate wordclouds based on frequencies
wordcloud(words = rep_frequencies$word, freq = rep_frequencies$n,
scale = c(3,.1), min.freq = 50,
max.words = 200, random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2"))
Word frequencies in Republican Party platforms

wordcloud(words = dem_frequencies$word, freq = dem_frequencies$n,
scale = c(3,.1), min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Word frequencies in Democratic Party platforms

These seem to make sense. Republicans tend to be much more focused than Democrats on the size of government, while Democrats generally focus more on things like education, healthcare and civil rights.
The Bing lexicon works fine for determining positive and negative sentiment, but there are others that we can use. The Afinn dictionary also assigns terms to positive or negative sentiment, but it also assigns a score that ranges from -5 to 5. Alternatively, we can use the NRC lexicon, which assigns terms to positive or negative sentiment, but it also determines where terms fall according to 8 moods, or emotions (i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, trust).
So let’s use the NRC dictionary and then visualize the results with the chordDiagram() function from the circlize package. In order to create a chord diagram that isn’t too busy, we can group the platforms by decade.
# attach nrc lexicon to party platforms and group them
dem_nrc <- dem_words %>%
inner_join(get_sentiments("nrc")) %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate(decade = ifelse(year %in% 1948, "1940s",
ifelse(year %in% 1950:1959, "1950s",
ifelse(year %in% 1960:1969, "1960s",
ifelse(year %in% 1970:1979, "1970s",
ifelse(year %in% 1980:1989, "1980s",
ifelse(year %in% 1990:1999, "1990s",
ifelse(year %in% 2000:2009, "2000s", "2010s"))))))))
rep_nrc <- platform_words %>%
inner_join(get_sentiments("nrc")) %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate(decade = ifelse(year %in% 1948, "1940s",
ifelse(year %in% 1950:1959, "1950s",
ifelse(year %in% 1960:1969, "1960s",
ifelse(year %in% 1970:1979, "1970s",
ifelse(year %in% 1980:1989, "1980s",
ifelse(year %in% 1990:1999, "1990s",
ifelse(year %in% 2000:2009, "2000s", "2010s"))))))))
# set proportionality of moods
dem_decade <- dem_nrc %>%
count(sentiment, decade) %>%
group_by(decade, sentiment) %>%
summarise(sentiment_sum = sum(n)) %>%
ungroup()
rep_decade <- rep_nrc %>%
count(sentiment, decade) %>%
group_by(decade, sentiment) %>%
summarise(sentiment_sum = sum(n)) %>%
ungroup()
# visualize platforms using nrc lexicon
cols <- brewer.pal(8, "Dark2")
grid.col = c("1940s" = cols[1], "1950s" = cols[2], "1960s" = cols[3],
"1970s" = cols[4], "1980s" = cols[5], "1990s" = cols[6],
"2000s" = cols[7], "2010s" = cols[8], "anger" = "grey",
"anticipation" = "grey", "disgust" = "grey", "fear" = "grey",
"joy" = "grey", "sadness" = "grey", "surprise" = "grey",
"trust" = "grey")
# set gaps for dems' platforms
circos.par(gap.after = c(rep(5, length(unique(dem_decade[[1]])) - 1), 15,
rep(5, length(unique(dem_decade[[2]])) - 1), 15))
chordDiagram(dem_decade, grid.col = grid.col, transparency = .2)
title("Mood of Democratic Platforms by Decade")

# clear and reset gaps for reps' platforms
circos.clear()
circos.par(gap.after = c(rep(5, length(unique(rep_decade[[1]])) - 1), 15,
rep(5, length(unique(rep_decade[[2]])) - 1), 15))
chordDiagram(dem_decade, grid.col = grid.col, transparency = .2)
title("Mood of Republican Platforms by Decade")

Trust is very clearly the most dominant mood in the party platforms. That’s not all that surprising since all politicians what people to trust them.
The post Analyze party platforms in R the tidy way appeared first on my (mis)adventures in R programming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.