After the presidential debates I used the beta version of qdap to provide some initial surface level analysis (LINK to Presidential Debates with qdap-beta). In the comments of that post, annon (a commenter) provided a link to an analysis/visualization that utilizes bubbles to demonstrate proportion of words and colors and labels to show each candidate’s usage (LINK). While I initially liked the graphic it was the shape and colors that appealed to me. Closer inspection reveals that smaller words are hard to get information for and the bubbles make comparing across words difficult. I decided to attempt a visualization for the vice presidential debates using qdap and ggplot2.
I decided to use themes rather than words and categorize similar words together. This approach utilizes a function in qdap called termco. Here’s the function’s arguments:
termco(text.var, grouping.var=NULL, match.list, short.term = FALSE, ignore.case = TRUE, lazy.term = TRUE, elim.old = TRUE, zero.replace = 0, output = "percent", digits = 2)
Basically you can supply a list of named character vectors (our themes) to this function as well as dialogue (the debate text) and grouping variable (person) and it will output a list with several data frames. You can get raw counts, percent/proportions or a combination of raw and percent/proportions by grouping variable (person) for each theme.
The important part is the themes we supply to match list. This function relies on gregexpr meaning it will do partial matching, so there’re some things you’ll want to think about when supplying the themes:
- If you want to find “read” but not “bread” or “reading” use a trailing and leading white space as in ” read “
- If you want to find and root word with “read” leading white space as in ” read”
- This will also find “ready” so if you want any form of the word “read” you’ll have to be explicit and put all these forms in the vector for read with trailing and leading white spaces; ie ” read “, ” reads “, ” reader” (reader and readers), ” reading “
- If you use ” obama” and ” obamacare” termco.a will count obamacare two times; instead use ” obama “ and ” obamacare “ or just ” obama”
The basic form for the list of vectors supplied to match.list is:
target.words <- list( theme_1 = c(), theme_2 = c(), theme_n = c(), )
Let’s look at the results with some themes I examined for VP debates
library(qdap) url_dl("vpres.deb1.docx") #downloads a docx file of the debate to wd dat <- read.transcript("vpres.deb1.docx", col.names=c("person", "dialogue")) truncdf(dat) left.just(dat) dat$dialogue <- qprep(dat$dialogue) dat2 <- sentSplit(dat, "dialogue") htruncdf(dat2) #view a truncated version of the data (see also truncdf) dat2$person <- factor(Trim(dat2$person)) #the themes we're looking at (termco.a is only as good as the researcher who supplied these themes) tw2 <- list(health=c(" health", " insurance", " medic", "obamacare", " hospital", " doctor"), economic = c(" econom", " jobs", " unemploy", " business", " banks", " mortgage", " budget", " market", " paycheck", " wall street"), foreign = c(" war ", " terror", " foreign", "iran", "iraq", "sanctions", "nuclear", "al qaida", "libya", "netanyahu", "israel", "africa", "afgha", " embassy", "russia"), democratic_people = c("the president", " obama ", " obamas", " obama's", "biden", "the vice president", "mister vice president"), rebublican_people = c("my friend", " ryan", "romney"), obama_any_name = c("obama ", "obamas", "obama's", "the president"), "romney", #you don't have to name a vector of length 1 obama_by_name = c("obama ", "obamas", "obama's")) (a <- with(dat2, termco(dialogue, person, tw2, short.term = TRUE))) names(a) #see what else is in the termco object a$raw #raw numbers of use a$prop #proportions or percentages of use a$rnp #default print for termco plot(a)
For a txt version of the data frame that termco produces click here
Creating the graphic of the themes via ggplot2
library(ggplot2) library(reshape2) dat3 <- melt(a$raw[-2,], id=qcv(person, word.count)) #drop the moderator dat3$labs <- melt(a$rnp[-2,], id=qcv(person, word.count))[, 4] dat3$variable <- factor(dat3$variable, levels=names(sort(apply(a$prop[-2, -c(1:2)], 2, max)))) dat3$loc <- dat3$value - 6.5; dat3$loc <- 7; dat3$loc <- 65.75 dat3$cols <- rep("white", 16); dat3$cols <- "black" ggplot(dat3, aes(x=variable, y=value, fill=person)) + geom_bar(position="dodge", stat="identity") + coord_flip() + theme_bw() + theme(legend.position=c(.91, 0.07), legend.background = element_rect(color="grey60"), panel.grid.major=element_blank(),panel.grid.minor=element_blank()) + ylab("Occurances") + xlab("Theme") + scale_fill_manual(values=c("#0000FF", "#FF0000"), name="Candidate", guide = guide_legend(reverse=TRUE)) + geom_text(aes(label = labs, y = loc, x = variable), size = 5, position = position_dodge(width=0.9), color=dat3$cols) + scale_y_discrete(expand = c(0, 0), breaks=seq(0,80,20))
For a pdf version of the output click here
Discussion of the results
At first I ran a search to see who used the name Obama the most and I saw Vice President Biden only used the name once. At first I concluded (wrongly) he was focused on himself; after all the point of the vice presidential debates is to sell your boss as the winner. I did more inspection of the terminology (via word clouds) and I found Biden refers to President Obama as “The President”. This must be an inner circle respect thing that’s so ingrained in The Vice President that using the term “Mr. Obama” or “President Obama” just doesn’t happen for him.
I also noticed Ryan pushed the economic theme hard. Vice President Biden discussed the opposition quite a bit as well.
This was a quick and dirty demo. I didn’t actually put a tremendous amount of thought into the themes but was more demonstrating the ability of qdap for aiding the researcher in representing themes numerically and visually