Site icon R-bloggers

“I’m a Republican because…”, visualized with R

[This article was first published on Offensive Politics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The GOP recently relaunched its main web site with a new design and numerous interactive and social features like Facebook integration, blogs, etc. Of particular interest is the GOP Faces section, which asks users to submit a photo and answer the question “Why are you a Republican?” Not being a Republican, I was curious to see if there were any common themes among the submissions that would lead to insights about being a Republican and GOP.com user. Not excited about actually reading all 180 reasons, I instead used R to download, transform, analyze and visualize the data for me.

I used several packages (XML and plyr) to fetch and extract reasons, and then tm to filter stop words and identify commonly used terms. Finally, I used ggplot2, the invaluable ggplot2 blook, and a helpful post from the R-help mailing list to perform the visualization.

R code

library(XML)
library(plyr)
library(ggplot2)
library(tm)
 
# fetch & parse the HTML
doc <- htmlParse("http://gop.com/index.php/learn/republican_faces/",isURL = TRUE)
# pull the matching A elements of CSS class tipz
nodes <- getNodeSet(doc, "//a[@class='tipz']")
# extract the 'title' attribute 
titles <- sapply(nodes, function(x) xmlAttrs(x)[["title"]])
# clean up the title attribute 
titles <- sub("^[^:]+::","",titles)
# create the corpus and doc term matrix
co <- Corpus(VectorSource(titles))
tdm <- DocumentTermMatrix(co, control=list("tolower", removeNumbers=TRUE, stopwords=TRUE))
# extract the tags at each level
levels <- c(1,2,3,4)
df <- ldply(levels, function(x) data.frame(freq=x,term=findFreqTerms(tdm,x,x))) 
#assign random non-repeating coordinates to the terms
df$x <- sample(1:nrow(df),nrow(df), replace=F)
df$y <- df$freq + rnorm(nrow(df))
 
# clear standard graph options (thanks mike lawrence on r-help)
clear <- opts(
         legend.position = 'none'
         , panel.grid.minor = theme_blank()
         , panel.grid.major = theme_blank()
         , panel.background = theme_blank()
         , axis.line = theme_blank()
         , axis.text.x = theme_blank()
         , axis.text.y = theme_blank()
         , axis.ticks = theme_blank()
         , axis.title.x = theme_blank()
         , axis.title.y = theme_blank()
 )
 
p <- ggplot(df,aes(x=x,y=y,colour=freq,label=term,size=freq)) + geom_text() + coord_polar()+ clear 
ggsave("because.png",p,dpi=72,scale=1.3)
ggsave("because.pdf", p)

And the output:

Click for a page-sized PDF, or the raw terms and frequency counts.

The most common term is ‘freedom’, followed by ‘equal’, and ‘pro’. After those come ‘personal’, ‘government’, ‘people’, ‘school’, ‘family’, and ‘believe’. A more robust analysis could use term extraction (pro family, pro life, anti government) or stemming, and then feed the results into a better visualization. That would take more than the 10 minutes I spent so far, so I’m leaving that as an exercise to somebody else.

As it is I have the most common answer as to why GOP.com visitors are Republicans: freedom. I think that’s probably why anybody belongs to any political party, but without a corpus from other parties I suppose we’ll never know.

To leave a comment for the author, please follow the link and comment on their blog: Offensive Politics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.