qdap 1.3.1 Release: Demoing Dispersion Plots, Sentiment Analysis, Easy Hash Lookups, Boolean Searches and More…

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’re very pleased to announce the release of qdap 1.3.1

logo

This is the latest installment of the qdap package available at CRAN. Several important updates have occurred since the 1.1.0 release, most notable the addition of two vignettes and some generic view methods.

The new vignettes include:

  1. An Introduction to qdap
  2. qdap-tm Package Compatibility

The former is a detailed HTML based guide over viewing the intended use of qdap functions.  The second vignette is an explanation of how to move between qdap and tm package forms as qdap moves to be more compatible with this seminal R text mining package.

To install use:

install.packages(“qdap”)

Some of the changes in versions 1.2.0-1.3.1 include:


Generic Methods

  • scores generic method added to view scores from select qdap objects.
  • counts generic method added to view counts from select qdap objects.
  • proportions generic method added to view proportions from select qdap objects.
  • preprocessed generic method added to view preprocessed data from select qdap objects.

These methods allow the user to grab particular parts of qdap objects in a consistent fashion.  The majority of these methods also pick up a corresponding plot method as well.  This adds to the qdap philosophy that data results should be easy to grab and easy to visualize. For instance:

(x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person))

## methods
scores(x)
plot(scores(x))
counts(x)
plot(counts(x))
proportions(x)
plot(proportions(x))
truncdf(preprocessed(x), 15)
plot(preprocessed(x))

Demoing Some of the New Features

We’d like to take the time to highlight some of the development that has happened in qdap in the past several months:

Dispersion Plots

 wrds <- freq_terms(pres_debates2012$dialogue, stopwords = Top200Words)

## Add leading/trailing spaces if desired
wrds2 <- spaste(wrds)

## Use `~~` to maintain spaces
wrds2 <- c(" governor~~romney ", wrds2[-c(3, 12)])

## Plot
with(pres_debates2012 , dispersion_plot(dialogue, wrds2, rm.vars = time, 
    color="black", bg.color="white")) 

 with(rajSPLIT, dispersion_plot(dialogue, c("love", "night"),
    bg.color = "black", grouping.var = list(fam.aff, sex),
    color = "yellow", total.color = "white", horiz.color="grey20")) 

Word Correlation

 library(tm)
data("crude")
oil_cor1 <- apply_as_df(crude, word_cor, word = "oil", r=.7)
plot(oil_cor1) 

 oil_cor2 <- apply_as_df(crude, word_cor, word = qcv(texas, oil, money), r=.7)
plot(oil_cor2, ncol=2)
 

Easy Hash Table

A Small Example

 lookup(1:5, data.frame(1:4, 11:14))

## [1] 11 12 13 14 NA

## Leave alone elements w/o a match
lookup(1:5, data.frame(1:4, 11:14), missing = NULL) 

## [1] 11 12 13 14  5

Scaled Up 3 Million Records

key <- data.frame(x=1:2, y=c("A", "B"))

##   x y
## 1 1 A
## 2 2 B

big.vec <- sample(1:2, 3000000, T)
out <- lookup(big.vec, key)
out[1:20]

## On my system 3 million records in:
## Time difference of 24.5534 secs

Binary Operator Version

 codes <- list(A=c(1, 2, 4),
    B = c(3, 5),
    C = 7,
    D = c(6, 8:10))

1:12 %l% codes

##  [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D" NA  NA 

1:12 %l+% codes

##  [1] "A"  "A"  "B"  "A"  "B"  "D"  "C"  "D"  "D"  "D"  "11" "12" 

Simple-Quick Boolean Searches

We’ll be demoing this capability on the qdap data set DATA:

 ##        person                                 state
## 1         sam         Computer is fun. Not too fun.
## 2        greg               No it's not, it's dumb.
## 3     teacher                    What should we do?
## 4         sam                  You liar, it stinks!
## 5        greg               I am telling the truth!
## 6       sally                How can we be certain?
## 7        greg                      There is no way.
## 8         sam                       I distrust you.
## 9       sally           What are you talking about?
## 10 researcher         Shall we move on?  Good then.
## 11       greg I'm hungry.  Let's eat.  You already? 

First a brief explanation from the documentation:

terms – A character string(s) to search for. The terms are arranged in a single string with AND (use AND or && to connect terms together) and OR (use OR or || to allow for searches of either set of terms. Spaces may be used to control what is searched for. For example using ” I ” on c(“I’m”, “I want”, “in”) will result in FALSE TRUE FALSE whereas “I” will match all three (if case is ignored).

Let’s see how it works. We’ll start with ” I ORliar&&stinks”. This will find sentences that contain ” I “ or that contain “liar” and the word “stinks”.

 boolean_search(DATA$state, " I ORliar&&stinks")

## The following elements meet the criteria:
## [1] 4 5 8

boolean_search(DATA$state, " I &&.", values=TRUE)

## The following elements meet the criteria:
## [1] "I distrust you."

boolean_search(DATA$state, " I OR.", values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I am telling the truth!"              
## [4] "There is no way."                     
## [5] "I distrust you."                      
## [6] "Shall we move on?  Good then."        
## [7] "I'm hungry.  Let's eat.  You already?"

boolean_search(DATA$state, " I &&.")

## The following elements meet the criteria:
## [1] 8 

Exclusion as Well

boolean_search(DATA$state, " I ||.", values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I am telling the truth!"              
## [4] "There is no way."                     
## [5] "I distrust you."                      
## [6] "Shall we move on?  Good then."        
## [7] "I'm hungry.  Let's eat.  You already?"

boolean_search(DATA$state, " I ||.", exclude = c("way", "truth"), values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I distrust you."                      
## [4] "Shall we move on?  Good then."        
## [5] "I'm hungry.  Let's eat.  You already?"  

Binary Operator Version

 dat <- data.frame(x = c("Doggy", "Hello", "Hi Dog", "Zebra"), y = 1:4)

##        x y
## 1  Doggy 1
## 2  Hello 2
## 3 Hi Dog 3
## 4  Zebra 4

z <- data.frame(z =c("Hello", "Dog"))

##       z
## 1 Hello
## 2   Dog

dat[dat$x %bs% paste(z$z, collapse = "OR"), ]  

Polarity (Sentiment)

The polarity function is an extension of the work originally done by Jeffrey Breen with some accompnaying plotting methods. For more information see the Introduction to qdap Vignette.

 poldat2 <- with(mraja1spl, polarity(dialogue,
    list(sex, fam.aff, died)))
colsplit2df(scores(poldat2))[, 1:7] 
    sex fam.aff  died total.sentences total.words ave.polarity sd.polarity
1     f     cap FALSE             158        1810  0.076422846   0.2620359
2     f     cap  TRUE              24         221  0.042477906   0.2087159
3     f    mont  TRUE               4          29  0.079056942   0.3979112
4     m     cap FALSE              73         717  0.026496626   0.2558656
5     m     cap  TRUE              17         185 -0.159815603   0.3133931
6     m   escal FALSE               9         195 -0.152764808   0.3131176
7     m   escal  TRUE              27         646 -0.069421082   0.2556493
8     m    mont FALSE              70         952 -0.043809741   0.3837170
9     m    mont  TRUE             114        1273 -0.003653114   0.4090405
10    m    none FALSE               7          78  0.062243180   0.1067989
11 none    none FALSE               5          18 -0.281649658   0.4387579

The Accompanying Plotting Methods

plot(poldat2)

 plot(scores(poldat2))   

Question Type

 dat <- c("Kate's got no appetite doesn't she?",
    "Wanna tell Daddy what you did today?",
    "You helped getting out a book?", "umm hum?",
    "Do you know what it is?", "What do you want?",
    "Who's there?", "Whose?", "Why do you want it?",
    "Want some?", "Where did it go?", "Was it fun?")

left_just(preprocessed(question_type(dat))[, c(2, 6)])  
   raw.text                             q.type             
1  Kate's got no appetite doesn't she?  doesnt             
2  Wanna tell Daddy what you did today? what               
3  You helped getting out a book?       implied_do/does/did
4  Umm hum?                             unknown            
5  Do you know what it is?              do                 
6  What do you want?                    what               
7  Who's there?                         who                
8  Whose?                               whose              
9  Why do you want it?                  why                
10 Want some?                           unknown            
11 Where did it go?                     where              
12 Was it fun?                          was                
 x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person)

scores(x)
      person tot.quest    what    how   shall implied_do/does/did
1       greg         1       0      0       0             1(100%)
2 researcher         1       0      0 1(100%)                   0
3      sally         2  1(50%) 1(50%)       0                   0
4    teacher         1 1(100%)      0       0                   0
5        sam         0       0      0       0                   0
plot(scores(x), high="orange")

 


These are a few of the more recent developments in qdap. We would encourage readers to dig into the new vignettes and start using qdap for various Natural Language Processing tasks. If you have suggestions or find a bug you are welcome to:

  • submit suggestions and bug-reports at: https://github.com/trinker/qdap/issues
  • send a pull request on: https://github.com/trinker/qdap

  • For a complete list of changes see qdap’s NEWS.md

    Development Version
    github


    To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)