qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.
Getting qdap to work may take some finagling because the package relies on the opneNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.
Here’s the github link for qdap (LINK) and install instructions:
# install.packages("devtools") library(devtools) install_github("qdap", "trinker")
Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer
Getting and cleaning transcripts of the debate
library(qdap) url_dl("pres.deb1.docx") #downloads a docx file of the debate to wd # the read.transcript function allows reading in of docx file # special thanks to Bryan Goodrich for his work on this dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue")) truncdf(dat) left.just(dat) dat$dialogue <- bracketX(dat$dialogue) #removes brackets (non dialogue) dat$dialogue <- symbol_change(dat$dialogue) #changes symbols to words (ie % = percent) dat$dialogue <- num_replace(dat$dialogue) #changes numerbers to word form (compliments of John Fox) dat$dialogue <- scrubber(gsub("-", " ", dat$dialogue)) #removes dashes # sentSplit splits turns of talk into sentences # special thanks to Dason Kurkiewicz for his work on this dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE) htruncdf(dat2) #view a truncated version of the data (see also truncdf)
Wordclouds (relies on Ian Fellows’ wordcloud package)
#first put a unique character between words we want to keep together dat2$dia2 <- mgsub(c("Governor Romney", "President Obama", "middle class"), c("Governor~Romney", "President~Obama", "middle~class"), dat2$dialogue) #the word cloud by grouping variable function with(dat2, trans.cloud(dia2, person, proportional = TRUE, target.words = list(health=c("health", "insurance", "medic", "obamacare", "hospital"), economic = c("econom", "jobs", "unemploy", "business", "banks", "budget", "market", "paycheck"), foreign = c("war ", "terror", "foreign"), class = c("middle~class", "poor", "rich"), oponent = c("romney ", "obama")), cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"), legend = c("health", "economic", "foreign", "class", "oponent"), stopwords=exclude(Top25Words, "he", "I"), char2space = "~"))
Visuals of the trans.cloud function
Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.
# special thanks to Andrie de Vries for his work on this function with(dat2, gantt_plot(dialogue, person, xlab = "duration(words)", x.tick=TRUE, minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))
Visualization of the Gantt Plot
Formality scores (how formal a person’s language is)
This concept comes from:
Heylighen, F., & Dewaele, J.-M. (2002). Variation in the contextuality of language: An empirical measure. Foundations of Science, 7(3), 293–340. doi:10.1023/A:1019661126744
The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.
#parallel about 1:20 on 8 GB ram 8 core i7 machine v <- with(dat2, formality(dialogue, person, plot=TRUE, parallel=TRUE)) #about 4 minutes on 8GB ram i7 machine v <- with(dat2, formality(dialogue, person, plot=TRUE)) # note you can resupply the output from formality back # to formality and change arguments. This avoids the need for # openNLP, saving time. with(dat2, formality(v, person, plot=TRUE, bar.colors=c("Dark2")))
Output and plot from the formality function
person word.count formality 1 ROMNEY 4068 61.82 2 LEHRER 765 61.31 3 OBAMA 3595 58.30