Presidential Debates 2012

October 23, 2012

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

I have been playing with the beta version of qdap utilizing the presidential debates as a data set. qdap is in a beta phase lacking documentation though I’m getting there. In previous blog posts (presidential debate 1 LINK and VP debate LINK) I demonstrated some of the capabilities of qdap. I wanted to further show some of qdap’s capabilities while seeking to provide information about the debates.

In previous posts readers made comments or emailed regarding functionality of qdap . This was extremely helpful in working out bugs that arise on various operating systems. If you have praise or methods you used to run the qdap scripts please leave a comment saying so. However, if you are having difficulty please file an issue at qdap’s home, GitHub (LINK).

In this post we’ll be looking at:

1. A faceted gantt plot for each of the speeches via gantt_plot
2. Various word statistics via word_stats
3. A venn diagram showing the overlap in word usage via trans.venn
4. A dissimilarity matrix indicating closeness in speech via dissimilarity
5. iGraph Visualization of dissimilarity

Reading in the data sets and Cleaning

library(qdap) #load qdap
# download transcript of the debate to working directory
url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)   

# load multiple files with read transcript and assign to global environment
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))
dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue"))
dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue"))

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)
dat2$dialogue <- qprep(dat2$dialogue)
dat3$dialogue <- qprep(dat3$dialogue)

# Split each sentece into it's own line
dat1b <- sentSplit(dat1, "dialogue") 
dat1$person <- factor(dat1$person , levels = qcv(ROMNEY, OBAMA, LEHRER))
dat2b <- sentSplit(dat2, "dialogue")  
dat3b <- sentSplit(dat3, "dialogue") 

# Create a large data frame by the three debates times
L1 <- list(dat1b, dat2b, dat3b)
L1 <- lapply(seq_along(L1), function(i) data.frame(L1[[i]], time = paste("time", i)))
dat4 <-, L1)

#view a truncated version of the data (see also htruncdf)

Faceted Gantt Plot

#reorder factor levels
dat4$person <- factor(dat4$person, 

with(dat4, gantt_plot(dialogue, person, time, xlab = "duration(words)", scale = "free"))


Basic Word Statistics
This section utilizes the word_stats function in conjunction with ggplot2 to create a heat map for various descriptive word statistics. Below is a list of column names for the function’s default print method.

   column title description                           
1  n.tot        number of turns of talk               
2  n.sent       number of sentences                   
3  n.words      number of words                       
4  n.char       number of characters                  
5  n.syl        number of syllables                   
6  n.poly       number of polysyllables               
7  sptot        syllables per turn of talk            
8  wptot        words per turn of talk                
9  wps          words per sentence                    
10 cps          characters per sentence               
11 sps          syllables per sentence                
12 psps         poly-syllables per sentence           
13 cpw          characters per word                   
14 spw          syllables per word                    
15 n.state      number of statements                  
16      number of questions                   
17 n.exclm      number of exclamations                
18 n.incom      number of incomplete statements       
19 p.state      proportion of statements              
20      proportion of questions               
21 p.exclm      proportion of exclamations            
22 p.incom      proportion of incomplete statements   
23 n.hapax      number of hapax legomenon             
24 n.dis        number of dis legomenon               
25 grow.rate    proportion of hapax legomenon to words
26 prop.dis     proportion of dis legomenon to words  
z <- with(dat4, word_stats(dialogue, list(person, time), tot))
plot(z, low="white", high="black")
plot(z, label=TRUE, low="white", high="black", lab.digits=1)


Venn Diagram
With proper stop word use and small, variable data sets a Venn diagram can be informative. In this case the overlap is fairly strong and less informative though labels are centered. Thus labels closer in proximity are closer in words used.

with(subset(dat4, person == qcv(ROMNEY, OBAMA)), 
    trans.venn(dialogue, list(person, time), = "Presidential Debates Word Overlap 2012")


Dissimilarity Matrix

dat5 <- subset(dat4, person == qcv(ROMNEY, OBAMA))
dat5$person <- factor(dat5$person, levels = qcv(OBAMA, ROMNEY))
#a word frequency matrix inspired by the tm package's DocumentTermMatrix
with(dat5, wfm(dialogue, list(person, time)))
#with row and column sums
with(dat5, wfdf(dialogue, list(person, time), margins = TRUE))
#dissimilarity (similar to a correlation 
#The default emasure is 1 - binary or proportion overlap between grouping variable
(sim <- with(dat5, dissimilarity(dialogue, list(person, time))))
              OBAMA.time.1 OBAMA.time.2 OBAMA.time.3 ROMNEY.time.1 ROMNEY.time.2
OBAMA.time.2         0.293                                                      
OBAMA.time.3         0.257        0.303                                         
ROMNEY.time.1        0.317        0.261        0.245                            
ROMNEY.time.2        0.273        0.316        0.285         0.317              
ROMNEY.time.3        0.240        0.276        0.311         0.265         0.312

Network Graph
The use of igraph may not always be the best way to view the data but this exercise shows one way this package can be utilized. In this plot the wlabels are sized based on number of words used. The distance measures that label the edges are taken from the dissimilarity function (1 – binary). Colors are based on political party.

Z <- with(dat5, adjacency_matrix(wfm(dialogue, list(person, time))))
g <- graph.adjacency(Z$adjacency, weighted=TRUE, mode ='undirected')
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

layout1 <-
opar <- par()$mar; par(mar=rep(.5, 4)) #Give the graph lots of room
plot(g, layout=layout1)

edge.weight <- 9  #a maximizing thickness constant
z1 <- edge.weight * sim/max(sim)*sim
E(g)$width <- c(z1) #remove 0s: these won't have an edge
numformat <- function(val, digits = 2) { sub("^(-?)0.", "\\1.", sprintf(paste0("%.", digits, "f"), val)) }
z2 <- numformat(round(sim, 3), 3)
E(g)$label <- c(z2)
plot(g, layout=layout1) #check it out! 

label.size <- 15 #a maximizing label size constant
WC <- aggregate(dialogue~person +time, data=dat5, function(x)  sum(word.count(x), na.rm = TRUE))
WC <- WC[order(WC$person, WC$time), 3]
resize <- (log(WC)/max(log(WC)))
V(g)$label.cex <- 5 *(resize - .8)
plot(g, layout=layout1) #check it out!

V(g)$color <- ifelse(substring(V(g)$label, 1, 2)=="OB", "pink", "lightblue")

plot(g, layout=layout1)


This blog post is a rough initial analysis of the three presidential debates. It was meant as a means of demonstrating the capabilities of qdap rather than providing in depth analysis of the candidates. Please share your experiences with using qdap in a comment below and suggestions for improvement via the issues page of qdap’s github site(LINK).

For a pdf version of all the graphics created in the blog post -click here-

To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)