[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Partial Least Squares Regression:

This week I will be doing some consulting around Structural Equation Modeling (SEM) techniques to solve a unique business problem.  We are trying to identify customer preference for various products and traditional regression is not adequate because of the high dimensional component to the data set along with the multi-colinearity of the variables.  PLS is a powerful and effective method to handle these sorts of problematic data sets.

Principal Components regression is one option we will explore, but in doing background research I have found that PLS may be a better option.  We will look at both PLS regression and PLS path analysis.  I don’t believe traditional SEM will be of value at this point as we don’t have a good feel or theory to make assumptions on the latent structure.  Also, because of the number of variables in the data set, we are stretching SEM techniques to the limit.  An interesting discussion of this limitation can be found in Haenlein, M & Kaplan, A., 2004, “A Beginner’s Guide to Partial Least Squares Analysis”, Understanding Statistics, 3(4), 283-297.

Of course, I want to do this in R and a couple of packages exist.  My favorite ones come from the researcher Gaston Sanchez with an ebook and tutorials on his website gastonsanchez.com.

# load the plsdepot package and a dataset
> library(plsdepot)
Warning message:
package ‘plsdepot’ was built under R version 3.0.1
> data(vehicles)

> names(vehicles)
[1] “diesel”      “turbo”       “two.doors”   “hatchback”   “wheel.base”
[6] “length”      “width”       “height”      “curb.weight” “eng.size”
[11] “horsepower”  “peak.rpm”    “price”       “symbol”      “city.mpg”
[16] “highway.mpg”

This data has 16 variables and 30 observations.  It is included in the plsdepot package.

One of the interesting things about PLS regression is you can have multiple response variables and plsdepot can accommodate that type of analysis.  In this case, I just want to analyze one Y variable and that will be price.

One of the quirks of the package is you will need to have the predictors and responses separated i.e. put the response variable column(s) at the end of your dataframe.  To do that I simply run this elegant bit of code I found somewhere…
# put the variable price (column 13) at the end
cars = vehicles[ ,c(1:12,14:16,13)]

Here is the code to build the model and in this case we will examine it with 3 components (latent variables) by using comps=3.  We use plsreg1 in this case because we are only interested in price.  If we had multiple Ys then plsreg2 would be used.

> pls1 = plsreg1(cars[, 1:15], cars[, 16, drop = FALSE], comps = 3)

# what options are available in plsreg1?
>pls1

$x.scores X-scores (T-components)$x.loads            X-loadings
$y.scores Y-scores (U-components)$y.loads            Y-loadings
$cor.xyt score correlations$raw.wgs          raw weights
$mod.wgs modified weights$std.coefs        standard coefficients
$reg.coefs regular coefficients$R2                      R-squared
$R2Xy explained variance of X-y by T$y.pred              y-predicted
$resid residuals$T2                      T2 hotelling
$Q2 Q2 cross validation There is a lot here in this package and I highly recommend going through the excellent tutorials to understand more. At any rate, let me highlight of couple of things. #R2 for each of our components > pls1$R2
t1                          t2                        t3
0.70474894     0.11708109      0.09872169

> # correlations plot; notice what is highly correlated with price

> plot(pls1)

#plot each observation predicted versus actual
> plot(cars$price, pls1$y.pred, type = “n”, xlab=”Original”, ylab = “Predicted”)
> title(“Comparison of responses”, cex.main = 0.9)
> abline(a = 0, b = 1, col = “gray85”, lwd = 2)
> text(cars$price, pls1$y.pred, col = “#5592e3”)

This is not a bad start.  We would have to continue to look at different numbers of components to identify the best model and to see if the latent variables make sense from a practical standpoint.