Text featurization with the Microsoft ML package

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here's one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It's great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms (“useful”, “indespensable”, “highly recommended”), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

transformRule = list(
  featurizeText(
    vars = c(Features = "REVIEW_TEXT"),
    # ngramLength=2: include not only "Azure", "AD", but also "Azure AD"
    # skipLength=1 : "computer" and "compuuter" is the same
    wordFeatureExtractor = ngramCount(
      weighting = "tfidf",
      ngramLength = 2,
      skipLength = 1),
    language = "English"
  ),
  selectFeatures(
    vars = c("Features"),
    mode = minCount(500)
  )
)

# train using transforms !
model <- rxFastLinear(
  RATING ~ Features,
  data = train,
  mlTransforms = transformRule,
  type = "regression" # not binary (numeric regression)
)

We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:

boring       -7.647399
waste        -7.537471
not          -6.355953
nothing      -6.149342
money        -5.386262
bad          -5.377981
no           -5.210301
worst        -5.051558
poorly       -4.962763
disappointed -4.890280

Similarly, the top 10 words or word-pairs associated with a positive rating are:

will      3.073104
the|best  3.265797
love      3.290348
life      3.562267
wonderful 3.652950
,|and     3.762862
you       3.889580
excellent 3.902497
my        4.454115
great     4.552569

Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment function. 

sentimentScores <- rxFeaturize(data=data, 
                    mlTransforms = getSentiment(vars = 
                                     list(SentimentScore = "REVIEW_TEXT")))

As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.

Sentiment-boxplots

You can find more details on this analysis, including the Microsoft R code, at the link below.

Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)