When I was working on a large survey among people with spinal cord injuries/disorders, the survey designers decided to assess the exact details of the respondent’s spinal injury or disorder with an open-ended question so that people could describe it “in their own words.” As you might have guessed, the data were mostly meaningless and as such unusable. But many hypotheses and research questions dealt with looking at differences between people who sustained an injury and those with a disorder affecting their spinal cord. We couldn’t even begin to test or examine any of those in the course of our study. It was unbelievably frustrating, because we could have gotten the information we needed with a single question and some categorical responses. We could have then asked people to supplement their answer with the open-ended question. Most would skip it, but we’d have the data we needed to answer our questions.
In my job, I inherited the results of a large survey involving a variety of dental professional populations. Once again, certain items that could have been addressed with a few close-ended questions were instead open-ended questions and not many of the responses are useful. The item that inspired this blog post assessed the types of products dental assistants are involved in purchasing, which can include anything from office supplies to personal protective equipment to large equipment (X-ray machine, etc.). Everyone had a different way of articulating what they were involved with purchasing, some simply saying “all dental supplies” or “everything,” while others gave more specific details. In total, about 400 people responded to this item, which is a lot of data to dig through. But thanks to my new experience with text mining in R, I was able to try to make some sense of responses. Mind you, a lot of the responses can’t be understood, but it’s at least something.
I decided I could probably begin to categorize responses by slicing the responses up into the individual words. I can generate counts overall as well as by respondent, and use this to begin examining and categorizing the data. Finally, if I need some additional context, I can ask R to give me all responses that contain a certain word.
Because I don’t own these data, I can’t share them on my blog. But I can demonstrate with different text data to show what I did. To do this, I’ll use one of my all-time favorite books, The Wonderful Wizard of Oz by L. Frank Baum, which is available through Project Gutenberg. The gutenbergr package will let me download the full-text.
library(gutenbergr) gutenberg_works(title == "The Wonderful Wizard of Oz") ## # A tibble: 1 x 8 ## gutenberg_id title author gutenberg_autho~ language gutenberg_books~ ##
## 1 55 The Wo~ Baum, L~ 42 en Children's Lite~ ## # ... with 2 more variables: rights , has_text
Now we have the gutenberg_id, which will be used to download the fulltext into a data frame.
WOz <- gutenberg_download(gutenberg_id = 55) ## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest ## Using mirror http://aleph.gutenberg.org WOz$line <- row.names(WOz)
To start exploring the data, I'll need to use the tidytext package to unnest tokens (words) and remove stop words that don't tell us much.
library(tidyverse) library(tidytext) tidy_oz <- WOz %>% unnest_tokens(word, text) %>% anti_join(stop_words) ## Joining, by = "word"
Now I can begin to explore my "responses" by generating overall counts and even counts by respondent. (For the present example, I'll use my line number variable instead of the respondent id number I used in the actual dataset.)
word_counts <- tidy_oz %>% count(word, sort = TRUE) resp_counts <- tidy_oz %>% count(line, word, sort = TRUE) %>% ungroup()
When I look at my overall counts, I see that the most frequently used words are the main characters of the book. So from this, I could generate a category "characters." Words like "Kansas", "Emerald" and "City" are also common, and I could create a category called "places." Finally, "heart" and "brains" are common - a category could be created to encompass what the characters are seeking. Obviously, this might not be true for every instance. It could be that someone "didn't have the heart" to tell someone something. I can try to separate out those instances by looking at the original text.
heart <- WOz[grep("heart",WOz$text),] head(heart) ## # A tibble: 6 x 3 ## gutenberg_id text line ##
## 1 55 happiness to childish hearts than all other human cr~ 48 ## 2 55 the heartaches and nightmares are left out. 62 ## 3 55 and press her hand upon her heart whenever Dorothy's~ 110 ## 4 55 strange people. Her tears seemed to grieve the kind~ 375 ## 5 55 Dorothy ate a hearty supper and was waited upon by t~ 517 ## 6 55 She ate a hearty breakfast, and watched a wee Munchk~ 545
Unfortunately, this got me any line containing a word starting with "heart" like "hearty" and "heartache." Let's rewrite that command to request an exact match.
heart <- WOz[grep("\\bheart\\b",WOz$text),] head(heart) ## # A tibble: 6 x 3 ## gutenberg_id text line ##
## 1 55 and press her hand upon her heart whenever Dorothy's~ 110 ## 2 55 "\"Do you suppose Oz could give me a heart?\"" 935 ## 3 55 brains, and a heart also; so, having tried them both~ 975 ## 4 55 "rather have a heart.\"" 976 ## 5 55 grew to love her with all my heart. She, on her par~ 992 ## 6 55 now no heart, so that I lost all my love for the Mun~ 1024
Now I can use this additional context to determine which instances of "heart" are about the Tin Man's quest.
It seems like these tools would be most useful for open-ended responses that are still very categorical, but it could help with determining themes for more narrative data.