Using Partial Least Squares to conduct relative importance analysis in Displayr

June 15, 2017

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Partial Least Squares can be used find drivers of consumer preference

Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable.

There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.

Partial Least Squares

PLS is a dimension reduction technique with some similarity to principal component analysis. The predictor variables are mapped to a smaller set of variables and within that smaller space we perform a regression against the outcome variable.  In contrast to principal component analysis where the dimension reduction ignores the outcome variable, the PLS procedure aims to choose new mapped variables that maximally explain the outcome variable.

Loading the example data

First I’ll add some data with Insert > Data Set > URL and paste in this link:

Dragging Brand preference onto the page from the Data tree on the left table produces a table showing the breakdown of the respondents by category. This includes a Don’t Know category that doesn’t fit in the ordered scale from Love to Hate.  To remove Don’t Know I click on top of Brand preference in the Data tree on the left and then click on Value Attributes. Changing Missing Values for the Don’t Know category to Exclude from analyses produces the table below.

Creating the PLS model

Partial least squares is easy to run with a few lines of code. Select Insert > R Output and enter the following snippet of code into the R CODE box:

dat = data.frame(pref, Q5r0, Q5r1, Q5r2, Q5r3, Q5r4, Q5r5, Q5r6, Q5r7, Q5r8, 
                  Q5r9, Q5r10, Q5r11, Q5r12, Q5r13, Q5r14, Q5r15, Q5r16, Q5r17,
                  Q5r18, Q5r19, Q5r20, Q5r21, Q5r22, Q5r23, Q5r24, Q5r25, Q5r26,
                  Q5r27, Q5r29, Q5r28, Q5r30, Q5r31, Q5r32, Q5r33)


dat = AsNumeric(ProcessQVariables(dat), binary = FALSE, remove.first = FALSE)
pls.model = plsr(pref ~ ., data = dat, validation = "CV")

The first line selects pref as the outcome variable (strength of preference for a brand) and then adds 34 predictor variables, each indicating whether the respondent perceives the brand to have a particular characteristic. These variables can be dragged across from the Data tree on the left.

Next, the 3 libraries containing useful functions are loaded. The package pls contains the function to estimate the PLS model, and our own publicly-available packages, flipFormat and flipTransformations are included for function to help us transform and tidy the data. Since the R pls package requires inputs to be numerical I convert the variables from categorical.

In the final line above the plsr function does the work and creates pls.model.

Automatically Selecting the Dimensions

The following few lines recreate the model having found the optimal number of dimensions,

# Find the number of dimensions with lowest cross validation error
cv = RMSEP(pls.model)
best.dims = which.min(cv$val[estimate = "adjCV", , ]) - 1
# Rerun the model
pls.model = plsr(pref ~ ., data = dat, ncomp = best.dims)

Producing the Output

Finally, we extract the useful information and format the output,

coefficients = coef(pls.model)
sum.coef = sum(sapply(coefficients, abs))
coefficients = coefficients * 100 / sum.coef
names(coefficients) = TidyLabels(Labels(dat)[-1])
coefficients = sort(coefficients, decreasing = TRUE)

The regression coefficients are normalized so their absolute sum is 100. The labels are added and the result is sorted.

The results below show that Reliable and Fun are positive predictors of preference, Unconventional and Sleepy are negative predictors and Tough has little relevance.

You can perform this analysis for yourself in Displayr.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)