Text Mining – Term Frequency analysis and Word Cloud creation using the tm package

[This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Analysing the pre-conference workshop sentiments

I think back to a post I put on the NHS-R community website, but never posted on my own site, as I have been doing some textual analysis recently and referenced this post again. The driver for this package is the tm package and is still one of the main packages in R, but it assumes a non-tidy format. If you can get over this – it is still really useful for text analysis.

Installing and loading the required packages

install_or_load_pack <- function(pack){
   create.pkg <- pack[!(pack %in% installed.packages()[, "Package"])]
    if (length(create.pkg))
     install.packages(create.pkg, dependencies = TRUE)
     sapply(pack, require, character.only = TRUE)
}
packages <- c("ggplot2",  "data.table", "wordcloud", "tm", "wordcloud2","scales", "tidytext", "devtools", "twitteR", "caret", "magrittr", "RColorBrewer", "tidytext", "ggdendro","tidyr", "topicmodels", "SnowballC", "gtools")
install_or_load_pack(packages)

This function was previously covered in blog post: https://nhsrcommunity.com/blog/a-simple-function-to-install-and-load-packages-in-r/.

Here I specify that I want to load the main packages for dealing with sentiment and discourse analysis in R. Libraries such as tmwordcloud and wordcloud2 are loaded for working with this type of data.

Choosing the file to import

The file we have to import is a prepared csv file and instead of hard coding the path to load the file from I simply use:

path <- choose_files()

This is a special  function which allows you to open a dialog UI from R:

From this dialog I select the csv file I want to be imported. Once I have selected the csv and hit open, the path variable will be filled with the location of the file to work with.

Creating the R data frame

To create the data frame I can now pass the variable path to the read_csv command:

workshop_sentiment <- read_csv(path, col_names = T)

This will read the textual data from the workshops in to a data frame with 2 columns. The first relates to what the attendees enjoyed about the workshop and  the second relates to improvements that can be made:

Seperating the master data frame

The master data frame now needs to be separated into two separate data frames, as text analysis requires one column with the number of rows for each sentence, as demonstrated. Here I use magrittr to divide this into two new data frames:

ws_highlights <- workshop_sentiment %>%
   slice(1)

#Copy for improvements

ws_improvements <- workshop_sentiment %>%
   slice(2)

The ws_highlights data frame uses the first column and  the ws_improvements data frame uses the second.

Function to create textual corpus

As I want to replicate this for highlights and improvements – I have created a function that could be replicated with any text analysis to create what is known as a text corpus (see: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) this creates a series of documents, in our case sentences.

corpus_tm <- function(x){
   corpus_tm <- Corpus(VectorSource(x))
}

This function allows you to pass any data frame to the function and creates a corpus for each data frame you pass to the function. The data frame would be passed to the x parameter. The VectorSource() function creates an element for each part of the corpus.

Create Corpus for Highlights and Improvements data frame

Now the function has been created, I can simpily pass the two separate data frames I created before to create two corpuses:

corpus_positive <- corpus_tm(ws_highlights$Highlights)
corpus_improvements <- corpus_tm(ws_improvements$Improvements)

The code block above shows that I create a corpus for the positive (highlights) data frame and an improvements corpus. This will display as hereunder in your environment:

Function to clean data in the corpus

The most common cleaning task of working with text data is to remove things like punctuation, common English words, etc. This is something I have to repeat multiple times when dealing with discourse analysis:

clean_corpus <- function(corpus_to_use){
corpus_to_use %>%
   tm_map(removePunctuation) %>%
   tm_map(stripWhitespace) %>%
   tm_map(content_transformer(function(x) iconv(x, to='UTF-8', sub='byte'))) %>%
   tm_map(removeNumbers) %>%
   tm_map(removeWords, stopwords("en")) %>%
   tm_map(content_transformer(tolower)) %>%
   tm_map(removeWords, c("etc","ie", "eg", stopwords("english")))
}

The parameter here takes the corpus object previously created and uses the corpus passed to:

  • Remove punctuation
  • Strip out whitespace between each text item, as the VectorSource has stripped out each word from each sentence in the data frame
  • Change the underlying formatting of the text to UTF-8
  • Remove numbers
  • Remove common English word (stop words)
  • Change the case to lower case
  • Remove a custom vector of words to adjust for things like e.g., i.e., etc.

To clean the corpus objects I simply pass the original corpus objects back through this function to perform cleaning:

corpus_positive <- clean_corpus(corpus_positive)
corpus_improvements <- clean_corpus(corpus_improvements)

Inspection of one of the data frames confirms that this has successfully been cleaned:

Create TermDocumentMatrix to attain frequent terms

The term document matrix (explained well here: https://www.youtube.com/watch?v=dE10fBCDWQc) can be used with the corpus to identify frequent terms by classification on a matrix. However, more code is needed to do this:

find_freq_terms_fun <- function(corpus_in){
doc_term_mat <- TermDocumentMatrix(corpus_in)
freq_terms <- findFreqTerms(doc_term_mat)[1:max(doc_term_mat$nrow)]
terms_grouped <- doc_term_mat[freq_terms,] %>%
    as.matrix() %>%
    rowSums() %>%
    data.frame(Term=freq_terms, Frequency = .) %>%
    arrange(desc(Frequency)) %>%
    mutate(prop_term_to_total_terms=Frequency/nrow(.))
return(data.frame(terms_grouped))
}

This function needs explanation. The function uses as a single parameter the corpus that you need to pass in, then a variable is created to create the doc_term_mat which uses the tm TermDocumentMatrix.

Next, I use the findFreqTerms function to iterate from the first entry to the maximum number of rows in the matrix. These are the powerhouses of the function, as they highlight how many times a word has been used in a sentence across all the rows of text.

The terms_grouped variable then slices the term matrix with the frequent terms, this is converted to a matrix, sum of each row are calculated i.e. the number of times the word appears. Then, a data frame is created of the terms in the function with the headings term and Frequency.

Next, we use the power of DPLYR to use arrange by the frequency descending and to add a mutated column to the data frame to calculate the proportion of that specific term over all terms. The return(data.frame(terms_group)) then forces R to return the results of the function.

I then pass my data frames (highlights and improvements) to the function I have just created to see if this method works:

positive_freq_terms <- data.frame(find_freq_terms_fun(corpus_positive))
improvement_freq_terms <- data.frame(find_freq_terms_fun(corpus_improvements))

These will be built as data frames and can be viewed in R Studio’s Data environment window:

This has worked just as expected. You could now use ggplot2 to produce a bar chart / pareto chart of the terms.

 Create a Word Cloud with the wordcloud2 package

R has a wordcloud package that produces relatively nice looking word clouds, but wordcloud2 surpasses this in terms of visualisation. To use this function is easy now I have the frequent terms data frame – using the highlights data frame this can be implemented by using the below syntax:

wordcloud2(positive_freq_terms[,1:2], shape="pentagon",
				      color="random-dark")

To use the function I pass the data frame and use the term and frequency fields only to use the visualisation. There are a number of options and these can be accessed by using the help(“wordcloud2”) function. Here I use the shape and colour parameters to set the display of the word cloud:

This can be exported in the viewer window by using the Export function:

This word cloud relates to the pre-workshop before the conference in 2018. I personally thought the NHS-R conference was amazing and I was honoured to have a spot to speak amongst so many other brilliant R users.

R is so versatile – every day is like a school day when you are learning it, but what a journey.

Final thoughts

I would probably use the tidytext package now, but prior to Julia Silge and David Robinson, the tm package was the only way to get it done in R.

However, I would recommend getting to grips with converting to and from tidy formats, as there is lots you can do with the tm package.

To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)