How to create unigrams, bigrams and n-grams of App Reviews

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is one of the frequent questions I’ve heard from the first timer NLP / Text Analytics – programmers (or as the world likes it to be called “Data Scientists”).

Prerequisite

For simplicity, this post assumes that you already know how to install a package and so you’ve got tidytext installed on your R machine.

install.packages("tidytext")

Loading the Library

Let’s start with loading the tidytext library.

library(tidytext)

Extracting App Reviews

We’ll use the R-package itunesr for downloading iOS App Reviews on which we’ll perform Simple Text Analysis (unigrams, bigrams, n-grams). getReviews() funciton of itunesr helps us in extracting reviews of Medium iOS App.

library(itunesr)
library(tidyverse)

# Extracting Medium iOS App Reviews
medium <- getReviews("828256236","us",1)

Overview of the extract App Reviews

As usual, we’ll start with seeing what’s head of the dataframe.

head(medium) 
##                                     Title
## 1                         Great source...
## 2                              I love it!
## 3 Medium Provide wide variety of articles
## 4                        A bargain at 50$
## 5                                 Awesome
## 6                             Love Medium
##                                        Author_URL     Author_Name
## 1  https://itunes.apple.com/us/reviews/id14871198 Helpful Program
## 2 https://itunes.apple.com/us/reviews/id622727268   tacos are lit
## 3 https://itunes.apple.com/us/reviews/id124091445   Anjan12344321
## 4 https://itunes.apple.com/us/reviews/id105720950       Judster64
## 5  https://itunes.apple.com/us/reviews/id39489978          jalton
## 6  https://itunes.apple.com/us/reviews/id26999143   girlbakespies
##   App_Version Rating
## 1        3.89      5
## 2        3.89      5
## 3        3.89      5
## 4        3.89      5
## 5        3.89      4
## 6        3.88      5
##                                                                                                                                                                                                                                                                                                Review
## 1                                                                                                                                                                                                                                            Great source for top content and food for mind and soul.
## 2                                                                                                                                                                                                                                                                                                ⠀⠀⠀⠀
## 3 I am feeling happy about Medium yearly subscription, Each penny os worth. Medium provides wide range of articles. I really like some of the authors! I am trying to start writing my own articles, this is the best forum to express your opinions and based on feedback you can improve your self.
## 4                                                                                                                                                                                                                                  The most interesting articles at your fingertips. No ads. Love it.
## 5                                                                                                                                                                                                                     Just need to be able to bookmark without crashing the app and it’ll be 5 stars.
## 6                                                                                                I am on my second month.I am getting back into writing again and Medium is a brilliant community of writers. I Highly recommend it for entertainment and an outanding information resource #READMORE
##                  Date
## 1 2019-08-04 15:09:50
## 2 2019-08-04 10:04:59
## 3 2019-08-03 03:10:22
## 4 2019-08-01 14:40:14
## 5 2019-07-31 23:56:41
## 6 2019-07-31 03:15:44

Now, we know that there are two Text Columns of our interest - Title and Review.

To make our n-grams analysis a bit more meaningful, we’ll extract only the positive reviews (5-star) to see what’s good people are writing about Medium iOS App. To make a better sense of the filter we’ve to use let’s see the split of Rating.

table(medium$Rating)
## 
##  1  3  4  5 
##  5  5  5 34

So, 5-star is the major component in the text reviews we extract and we’re good to go filtering only 5-star.We’ll pick Review from that and also we’ll specify it only for Rating == 5. Since we need a dataframe (or tibble) for tidytext to process it, we’ll put these 5-star reviews as a new column in a new dataframe.

reviews <- data.frame(txt = medium$Review[medium$Rating==5],
                      stringsAsFactors = FALSE)

Tokens

Tokenization in NLP is the process of splitting a text corpus based on some splitting factor - It could be Word Tokens or Sentence Tokens or based on some advanced alogrithm to split a conversation. In this process, we’ll just simply do word tokenization.

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  head()
##        word
## 1     great
## 1.1  source
## 1.2     for
## 1.3     top
## 1.4 content
## 1.5     and

As you can see above, unnest_tokens() is the function that’ll help us in this tokenization process. Since it supports %>% pipe operator, the first argument of the function is a dataframe or tibble, the second argument output is the name of the output (new) column where the tokenized words are going to be put in. The third column input is where the input text is fed in.

Now, this is what unigrams are for this Medium iOS App Reviews. As with many other data science projects, Data like this is not useful unless it’s visualized in a way to look at insights.

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  count(word, sort = TRUE) 
## # A tibble: 444 x 2
##    word         n
##    <chr>    <int>
##  1 the         45
##  2 i           35
##  3 and         34
##  4 of          27
##  5 to          27
##  6 a           18
##  7 it          14
##  8 medium      14
##  9 this        13
## 10 articles    12
## # … with 434 more rows

Roughly, looking at the most frequently appeared unigram we end up with the,i,and and this is one of those places where we need to remove stopwords

Stopword Removal

Fortunately, tidytext helps us in removing stopwords by having a dataframe of stopwords from multiple lexicons. With that, we can use anti_join for picking the words (that are present in the left df (reviews) but not present in the right df (stop_words)).

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) 
## Joining, by = "word"
## # A tibble: 280 x 2
##    word         n
##    <chr>    <int>
##  1 medium      14
##  2 articles    12
##  3 app          9
##  4 reading      9
##  5 content      6
##  6 love         5
##  7 read         5
##  8 article      4
##  9 enjoy        4
## 10 i’ve         4
## # … with 270 more rows

With that stop word removal, now we can see better represenation of most frequently appearing unigrams in the reviews.

unigram Visualziation

We’ve got our data in the shape that we want so, let’s go ahead and visualize it. To keep the pipeline intact, I’m not creating any temporary object to store the previous output and just simply continue using the same. Also too many bars (words) wouldn’t make any sense (except resulting in a shabby plot), We’ll filter taking the top 10 words

reviews %>% 
  unnest_tokens(output = word, input = txt) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  slice(1:10) %>% 
  ggplot() + geom_bar(aes(word, n), stat = "identity", fill = "#de5833") +
  theme_minimal() +
  labs(title = "Top unigrams of Medium iOS App Reviews",
       subtitle = "using Tidytext in R",
       caption = "Data Source: itunesr - iTunes App Store")
## Joining, by = "word"

Bigrams & N-grams

Now that we’ve got the core code for unigram visualization set up. We can slightly modify the same - just by adding a new argument n=2 and token="ngrams" to the tokenization process to extract n-gram. 2 for bigram and 3 trigram - or n of your interest. But remember, large n-values may not useful as the smaller values.

Doing this naively also has a catch and the catch is - the stop-word removal process we used above was using anti_join which wouldn’t be supported in this process since we’ve a bigram (two-word combination separated by a space). So, we’ll separate the word by space and then filter out the stop words in both word1 and word2 and then unite them back - which gives us the bigram after stop-word removal. This is the process that you might have to carry out when you are dealing with n-grams.

reviews %>% 
  unnest_tokens(word, txt, token = "ngrams", n = 2) %>% 
  separate(word, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  unite(word,word1, word2, sep = " ") %>% 
  count(word, sort = TRUE) %>% 
  slice(1:10) %>% 
  ggplot() + geom_bar(aes(word, n), stat = "identity", fill = "#de5833") +
  theme_minimal() +
  coord_flip() +
  labs(title = "Top Bigrams of Medium iOS App Reviews",
       subtitle = "using Tidytext in R",
       caption = "Data Source: itunesr - iTunes App Store")

Summary

This particular assignment that may not reveal some meaningful insights as we started with less data, but this is really useful when you have a decent amount of text corpus and this simple analysis of unigram, bigram (n-gram analysis) can reveal something business-worthy (let’s say in Customer Service, App Development or in multiple other use-cases).

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)