Site icon R-bloggers

pcLasso: a new method for sparse regression

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m excited to announce that my first package has been accepted to CRAN! The package pcLasso implements principal components lasso, a new method for sparse regression which I’ve developed with Rob Tibshirani and Jerry Friedman. In this post, I will give a brief overview of the method and some starter code. (For an in-depth description and elaboration of the method, please see our arXiv preprint. For more details on how to use the package, please see the package’s vignette.)

Let’s say we are in the standard supervised learning setting, with design matrix and response . Let the singular value decomposition (SVD) of be , and let the diagonal entries of be . Principal components lasso solves the optimization problem

where and are non-negative hyperparameters, and is the diagonal matrix with entries . The predictions this model gives for new data are .

This optimization problem seems a little complicated so let me try to motivate it. Notice that if we replace with the identity matrix, since is orthogonal the optimization problem reduces to

which we recognize as the optimization problem that elastic net solves. So we are doing something similar to elastic net.

To be more specific: we can think of as the coordinates of the coefficient vector in the standard basis . Then would be the coordinates of this same coefficient vector, but where the basis comprises the principal component (PC) directions of the design matrix . Since we have the matrix , with entries increasing down its diagonal, instead of the identity matrix sandwiched between and in the quadratic penalty, it means that we are doing shrinkage in the principal components space in a way that (i) leaves the component along the first PC direction unchanged, and (ii) shrinks components along larger PC directions more to 0.

This method extends easily to groups (whether overlapping or non-overlapping). Assume that our features come in groups. For each , let represent the reduced design matrix corresponding to group , and let its SVD be . Let the diagonal entries of be , and let be the diagonal matrix with diagonal entries . Let be the reduced coefficient vector corresponding to the features in group . Then pcLasso solves the optimization problem

Now for some basic code. Let’s make some fake data:

set.seed(1)
n <- 100; p <- 10
X <- matrix(rnorm(n * p), nrow = n)
y <- rnorm(n)

Just like glmnet in the glmnet package, the pcLasso function fits the model for a sequence of values which do not have to be user-specified. The user however, does have to specify the parameter:

library(pcLasso)
fit <- pcLasso(X, y, theta = 10)

We can use the generic predict function to obtain predictions this fit makes on new data. For example, the following code extracts the predictions that pcLasso makes on the 5th value for the first 3 rows of our training data:

predict(fit, X[1:3, ])[, 5]
# [1]  0.002523773  0.004959471 -0.014095065

The code above assumes that all our features belong to one big group. If our features come in groups, pcLasso can take advantage of that by specifying the groups option. groups should be a list of length , with groups[[k]] being a vector of column indices which belong to group . For example, if features 1-5 belong to one group and features 6-10 belong to another group:

> groups <- list(1:5, 6:10)
> groups
# [[1]]
#  [1]  1  2  3  4  5
# 
# [[2]]
#  [1] 6  7  8  9 10
fit <- pcLasso(X, y, theta = 10, groups = groups)

The function cv.pcLasso fits pcLasso and picks optimal values via cross-validation. The output of the cv.pcLasso function can also be used to predict on new data:

fit <- cv.pcLasso(X, y, theta = 10)
predict(fit, X[1:3, ], s = "lambda.min")
# [1] -0.01031697 -0.01031697 -0.01031697

The vignette contains significantly more detail on how to use this package. If you spot bugs, have questions, or have features that you would like to see implemented, get in touch with us!

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.