Presidential Debates with qdap-beta

October 4, 2012

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers)

qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.

Getting qdap to work may take some finagling because the package relies on the opneNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.

Installing qdap
Here’s the github link for qdap (LINK) and install instructions:

 # install.packages("devtools")
install_github("qdap", "trinker")

Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer

Getting and cleaning transcripts of the debate

url_dl("pres.deb1.docx")  #downloads a docx file of the debate to wd
# the read.transcript function allows reading in of docx file 
# special thanks to Bryan Goodrich for his work on this
dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue"))
dat$dialogue <- bracketX(dat$dialogue)  #removes brackets (non dialogue)
dat$dialogue <- symbol_change(dat$dialogue)  #changes symbols to words (ie % = percent)
dat$dialogue <- num_replace(dat$dialogue)  #changes numerbers to word form (compliments of John Fox)
dat$dialogue <- scrubber(gsub("-", " ", dat$dialogue)) #removes dashes
# sentSplit splits turns of talk into sentences
# special thanks to Dason Kurkiewicz for his work on this
dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE)  
htruncdf(dat2)   #view a truncated version of the data (see also truncdf)

Wordclouds (relies on Ian Fellows’ wordcloud package)

#first put a unique character between words we want to keep together
dat2$dia2 <- mgsub(c("Governor Romney", "President Obama", "middle class"), 
    c("Governor~Romney", "President~Obama", "middle~class"), dat2$dialogue)
#the word cloud by grouping variable function
with(dat2,, person, proportional = TRUE,
    target.words = list(health=c("health", "insurance", "medic", "obamacare", "hospital"), 
        economic = c("econom", "jobs", "unemploy", "business", "banks", 
            "budget", "market", "paycheck"),
        foreign = c("war ", "terror", "foreign"),
        class = c("middle~class", "poor", "rich"),
        oponent = c("romney ", "obama")),
    cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"),
    legend = c("health", "economic", "foreign", "class", "oponent"),
    stopwords=exclude(Top25Words, "he", "I"), char2space = "~"))

Visuals of the function
wordcloud 1
wordcloud 2
wordcloud 3

Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.

# special thanks to Andrie de Vries for his work on this function
with(dat2, gantt_plot(dialogue, person,  xlab = "duration(words)", x.tick=TRUE,
    minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))

Visualization of the Gantt Plot
Gantt Plot

Formality scores (how formal a person’s language is)
This concept comes from:

Heylighen, F., & Dewaele, J.-M. (2002). Variation in the 
    contextuality of language: An empirical measure. Foundations 
    of Science, 7(3), 293–340. doi:10.1023/A:1019661126744

The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.

#parallel about 1:20 on 8 GB ram 8 core i7 machine
v <- with(dat2, formality(dialogue, person, plot=TRUE, parallel=TRUE))
#about 4 minutes on 8GB ram i7 machine
v <- with(dat2, formality(dialogue, person, plot=TRUE)) 

# note you can resupply the output from formality back
# to formality and change arguments.  This avoids the need for
# openNLP, saving time.
with(dat2, formality(v, person, plot=TRUE, bar.colors=c("Dark2")))

Output and plot from the formality function

  person word.count formality
1 ROMNEY       4068     61.82
2 LEHRER        765     61.31
3  OBAMA       3595     58.30


To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , , , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)