“I’m a Republican because…”, visualized with R

October 15, 2009
By

(This article was first published on Offensive Politics » R, and kindly contributed to R-bloggers)

The GOP recently relaunched its main web site with a new design and numerous interactive and social features like Facebook integration, blogs, etc. Of particular interest is the GOP Faces section, which asks users to submit a photo and answer the question “Why are you a Republican?” Not being a Republican, I was curious to see if there were any common themes among the submissions that would lead to insights about being a Republican and GOP.com user. Not excited about actually reading all 180 reasons, I instead used R to download, transform, analyze and visualize the data for me.

I used several packages (XML and plyr) to fetch and extract reasons, and then tm to filter stop words and identify commonly used terms. Finally, I used ggplot2, the invaluable ggplot2 blook, and a helpful post from the R-help mailing list to perform the visualization.

R code

library(XML)
library(plyr)
library(ggplot2)
library(tm)
 
# fetch & parse the HTML
doc <- htmlParse("http://gop.com/index.php/learn/republican_faces/",isURL = TRUE)
# pull the matching A elements of CSS class tipz
nodes <- getNodeSet(doc, "//a[@class='tipz']")
# extract the 'title' attribute 
titles <- sapply(nodes, function(x) xmlAttrs(x)[["title"]])
# clean up the title attribute 
titles <- sub("^[^:]+::","",titles)
# create the corpus and doc term matrix
co <- Corpus(VectorSource(titles))
tdm <- DocumentTermMatrix(co, control=list("tolower", removeNumbers=TRUE, stopwords=TRUE))
# extract the tags at each level
levels <- c(1,2,3,4)
df <- ldply(levels, function(x) data.frame(freq=x,term=findFreqTerms(tdm,x,x))) 
#assign random non-repeating coordinates to the terms
df$x <- sample(1:nrow(df),nrow(df), replace=F)
df$y <- df$freq + rnorm(nrow(df))
 
# clear standard graph options (thanks mike lawrence on r-help)
clear <- opts(
         legend.position = 'none'
         , panel.grid.minor = theme_blank()
         , panel.grid.major = theme_blank()
         , panel.background = theme_blank()
         , axis.line = theme_blank()
         , axis.text.x = theme_blank()
         , axis.text.y = theme_blank()
         , axis.ticks = theme_blank()
         , axis.title.x = theme_blank()
         , axis.title.y = theme_blank()
 )
 
p <- ggplot(df,aes(x=x,y=y,colour=freq,label=term,size=freq)) + geom_text() + coord_polar()+ clear 
ggsave("because.png",p,dpi=72,scale=1.3)
ggsave("because.pdf", p)

And the output:

I'm a Republican because...

Click for a page-sized PDF, or the raw terms and frequency counts.

The most common term is ‘freedom’, followed by ‘equal’, and ‘pro’. After those come ‘personal’, ‘government’, ‘people’, ‘school’, ‘family’, and ‘believe’. A more robust analysis could use term extraction (pro family, pro life, anti government) or stemming, and then feed the results into a better visualization. That would take more than the 10 minutes I spent so far, so I’m leaving that as an exercise to somebody else.

As it is I have the most common answer as to why GOP.com visitors are Republicans: freedom. I think that’s probably why anybody belongs to any political party, but without a corpus from other parties I suppose we’ll never know.

To leave a comment for the author, please follow the link and comment on his blog: Offensive Politics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.