Presidential Debates 2012

October 23, 2012
By

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

I have been playing with the beta version of qdap utilizing the presidential debates as a data set. qdap is in a beta phase lacking documentation though I’m getting there. In previous blog posts (presidential debate 1 LINK and VP debate LINK) I demonstrated some of the capabilities of qdap. I wanted to further show some of qdap’s capabilities while seeking to provide information about the debates.

In previous posts readers made comments or emailed regarding functionality of qdap . This was extremely helpful in working out bugs that arise on various operating systems. If you have praise or methods you used to run the qdap scripts please leave a comment saying so. However, if you are having difficulty please file an issue at qdap’s home, GitHub (LINK).

In this post we’ll be looking at:

1. A faceted gantt plot for each of the speeches via gantt_plot
2. Various word statistics via word_stats
3. A venn diagram showing the overlap in word usage via trans.venn
4. A dissimilarity matrix indicating closeness in speech via dissimilarity
5. iGraph Visualization of dissimilarity

Reading in the data sets and Cleaning

library(qdap) #load qdap
# download transcript of the debate to working directory
url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)   

# load multiple files with read transcript and assign to global environment
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))
dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue"))
dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue"))

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)
dat2$dialogue <- qprep(dat2$dialogue)
dat3$dialogue <- qprep(dat3$dialogue)

# Split each sentece into it's own line
dat1b <- sentSplit(dat1, "dialogue") 
dat1$person <- factor(dat1$person , levels = qcv(ROMNEY, OBAMA, LEHRER))
dat2b <- sentSplit(dat2, "dialogue")  
dat3b <- sentSplit(dat3, "dialogue") 

# Create a large data frame by the three debates times
L1 <- list(dat1b, dat2b, dat3b)
L1 <- lapply(seq_along(L1), function(i) data.frame(L1[[i]], time = paste("time", i)))
dat4 <- do.call(rbind, L1)

#view a truncated version of the data (see also htruncdf)
truncdf(dat4)

Faceted Gantt Plot

#reorder factor levels
dat4$person <- factor(dat4$person, 
    levels=qcv(terms="OBAMA ROMNEY CROWLEY LEHRER QUESTION SCHIEFFER"))

with(dat4, gantt_plot(dialogue, person, time, xlab = "duration(words)", scale = "free"))

rm3

Basic Word Statistics
This section utilizes the word_stats function in conjunction with ggplot2 to create a heat map for various descriptive word statistics. Below is a list of column names for the function’s default print method.

   column title description                           
1  n.tot        number of turns of talk               
2  n.sent       number of sentences                   
3  n.words      number of words                       
4  n.char       number of characters                  
5  n.syl        number of syllables                   
6  n.poly       number of polysyllables               
7  sptot        syllables per turn of talk            
8  wptot        words per turn of talk                
9  wps          words per sentence                    
10 cps          characters per sentence               
11 sps          syllables per sentence                
12 psps         poly-syllables per sentence           
13 cpw          characters per word                   
14 spw          syllables per word                    
15 n.state      number of statements                  
16 n.quest      number of questions                   
17 n.exclm      number of exclamations                
18 n.incom      number of incomplete statements       
19 p.state      proportion of statements              
20 p.quest      proportion of questions               
21 p.exclm      proportion of exclamations            
22 p.incom      proportion of incomplete statements   
23 n.hapax      number of hapax legomenon             
24 n.dis        number of dis legomenon               
25 grow.rate    proportion of hapax legomenon to words
26 prop.dis     proportion of dis legomenon to words  
z <- with(dat4, word_stats(dialogue, list(person, time), tot))
z$ts
z$gts
plot(z, low="white", high="black")
plot(z, label=TRUE, low="white", high="black", lab.digits=1)

heatmap

Venn Diagram
With proper stop word use and small, variable data sets a Venn diagram can be informative. In this case the overlap is fairly strong and less informative though labels are centered. Thus labels closer in proximity are closer in words used.

with(subset(dat4, person == qcv(ROMNEY, OBAMA)), 
    trans.venn(dialogue, list(person, time), 
    title.name = "Presidential Debates Word Overlap 2012")
)

venn

Dissimilarity Matrix

dat5 <- subset(dat4, person == qcv(ROMNEY, OBAMA))
dat5$person <- factor(dat5$person, levels = qcv(OBAMA, ROMNEY))
#a word frequency matrix inspired by the tm package's DocumentTermMatrix
with(dat5, wfm(dialogue, list(person, time)))
#with row and column sums
with(dat5, wfdf(dialogue, list(person, time), margins = TRUE))
#dissimilarity (similar to a correlation 
#The default emasure is 1 - binary or proportion overlap between grouping variable
(sim <- with(dat5, dissimilarity(dialogue, list(person, time))))
              OBAMA.time.1 OBAMA.time.2 OBAMA.time.3 ROMNEY.time.1 ROMNEY.time.2
OBAMA.time.2         0.293                                                      
OBAMA.time.3         0.257        0.303                                         
ROMNEY.time.1        0.317        0.261        0.245                            
ROMNEY.time.2        0.273        0.316        0.285         0.317              
ROMNEY.time.3        0.240        0.276        0.311         0.265         0.312

Network Graph
The use of igraph may not always be the best way to view the data but this exercise shows one way this package can be utilized. In this plot the wlabels are sized based on number of words used. The distance measures that label the edges are taken from the dissimilarity function (1 – binary). Colors are based on political party.

library(igraph)
Z <- with(dat5, adjacency_matrix(wfm(dialogue, list(person, time))))
g <- graph.adjacency(Z$adjacency, weighted=TRUE, mode ='undirected')
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

set.seed(3952)
layout1 <- layout.auto(g)
opar <- par()$mar; par(mar=rep(.5, 4)) #Give the graph lots of room
plot(g, layout=layout1)

edge.weight <- 9  #a maximizing thickness constant
z1 <- edge.weight * sim/max(sim)*sim
E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge
numformat <- function(val, digits = 2) { sub("^(-?)0.", "\\1.", sprintf(paste0("%.", digits, "f"), val)) }
z2 <- numformat(round(sim, 3), 3)
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout1) #check it out! 

label.size <- 15 #a maximizing label size constant
WC <- aggregate(dialogue~person +time, data=dat5, function(x)  sum(word.count(x), na.rm = TRUE))
WC <- WC[order(WC$person, WC$time), 3]
resize <- (log(WC)/max(log(WC)))
V(g)$label.cex <- 5 *(resize - .8)
plot(g, layout=layout1) #check it out!

V(g)$color <- ifelse(substring(V(g)$label, 1, 2)=="OB", "pink", "lightblue")

plot(g, layout=layout1)
tkplot(g)

igr

This blog post is a rough initial analysis of the three presidential debates. It was meant as a means of demonstrating the capabilities of qdap rather than providing in depth analysis of the candidates. Please share your experiences with using qdap in a comment below and suggestions for improvement via the issues page of qdap’s github site(LINK).

For a pdf version of all the graphics created in the blog post -click here-


To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.