[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m excited to announce that my first package has been accepted to CRAN! The package pcLasso implements principal components lasso, a new method for sparse regression which I’ve developed with Rob Tibshirani and Jerry Friedman. In this post, I will give a brief overview of the method and some starter code. (For an in-depth description and elaboration of the method, please see our arXiv preprint. For more details on how to use the package, please see the package’s vignette.)

Let’s say we are in the standard supervised learning setting, with design matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ and response $\mathbf{y} \in \mathbb{R}^n$. Let the singular value decomposition (SVD) of $\mathbf{X}$ be $\mathbf{X} = \mathbf{UDV^T}$, and let the diagonal entries of $\mathbf{D}$ be $d_1 \geq d_2 \geq \dots$. Principal components lasso solves the optimization problem

\begin{aligned} \beta^* = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \frac{1}{2n}\| \mathbf{y} - \mathbf{X}\beta \|_2^2 + \lambda \| \beta\|_1 + \frac{\theta}{2} \beta^T \mathbf{VZV^T} \beta, \end{aligned}

where $\lambda$ and $\theta$ are non-negative hyperparameters, and $\mathbf{Z}$ is the diagonal matrix with entries $d_1^2 - d_1^2, d_1^2 - d_2^2, \dots$. The predictions this model gives for new data $\mathbf{X_{new}}$ are $\mathbf{X_{new}}\beta^*$.

This optimization problem seems a little complicated so let me try to motivate it. Notice that if we replace $\mathbf{Z}$ with the identity matrix, since $\mathbf{V}$ is orthogonal the optimization problem reduces to

\begin{aligned} \beta^* &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \frac{1}{2n}\| \mathbf{y} - \mathbf{X}\beta \|_2^2 + \lambda \| \beta\|_1 + \frac{\theta}{2} \beta^T \mathbf{VV^T} \beta \\ &= \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \frac{1}{2n}\| \mathbf{y} - \mathbf{X}\beta \|_2^2 + \lambda \| \beta\|_1 + \frac{\theta}{2} \|\beta\|_2^2, \end{aligned}

which we recognize as the optimization problem that elastic net solves. So we are doing something similar to elastic net.

To be more specific: we can think of $\beta$ as the coordinates of the coefficient vector in the standard basis $e_1 = \begin{pmatrix} 1 & 0 & \dots \end{pmatrix}^T, e_2 = \begin{pmatrix} 0 & 1 & 0 & \dots \end{pmatrix}^T, \dots$. Then $\mathbf{V^T}\beta$ would be the coordinates of this same coefficient vector, but where the basis comprises the principal component (PC) directions of the design matrix $\mathbf{X}$. Since we have the matrix $\mathbf{Z}$, with entries increasing down its diagonal, instead of the identity matrix sandwiched between $(\mathbf{V^T}\beta)^T$ and $\mathbf{V^T}\beta$ in the quadratic penalty, it means that we are doing shrinkage in the principal components space in a way that (i) leaves the component along the first PC direction unchanged, and (ii) shrinks components along larger PC directions more to 0.

This method extends easily to groups (whether overlapping or non-overlapping). Assume that our features come in $K$ groups. For each $k = 1, \dots, K$, let $\mathbf{X_k}$ represent the reduced design matrix corresponding to group $k$, and let its SVD be $\mathbf{X_k} = \mathbf{U_k D_k V_k^T}$. Let the diagonal entries of $\mathbf{D_k}$ be $d_{k1} \geq d_{k2} \geq \dots$, and let $\mathbf{Z_k}$ be the diagonal matrix with diagonal entries $d_{k1}^2 - d_{k1}^2, d_{k1}^2 - d_{k2}^2, \dots$. Let $\beta_k$ be the reduced coefficient vector corresponding to the features in group $k$. Then pcLasso solves the optimization problem

\begin{aligned} \beta^* = \underset{\beta \in \mathbb{R}^p}{\text{argmin}} \quad \frac{1}{2n}\| \mathbf{y} - \mathbf{X}\beta \|_2^2 + \lambda \| \beta\|_1 + \frac{\theta}{2} \sum_{k=1}^K \beta_k^T \mathbf{V_kZ_kV_k^T} \beta_k. \end{aligned}

Now for some basic code. Let’s make some fake data:

set.seed(1)
n <- 100; p <- 10
X <- matrix(rnorm(n * p), nrow = n)
y <- rnorm(n)


Just like glmnet in the glmnet package, the pcLasso function fits the model for a sequence of $\lambda$ values which do not have to be user-specified. The user however, does have to specify the $\theta$ parameter:

library(pcLasso)
fit <- pcLasso(X, y, theta = 10)


We can use the generic predict function to obtain predictions this fit makes on new data. For example, the following code extracts the predictions that pcLasso makes on the 5th $\lambda$ value for the first 3 rows of our training data:

predict(fit, X[1:3, ])[, 5]
# [1]  0.002523773  0.004959471 -0.014095065


The code above assumes that all our features belong to one big group. If our features come in groups, pcLasso can take advantage of that by specifying the groups option. groups should be a list of length $K$, with groups[[k]] being a vector of column indices which belong to group $k$. For example, if features 1-5 belong to one group and features 6-10 belong to another group:

> groups <- list(1:5, 6:10)
> groups
# [[1]]
#  [1]  1  2  3  4  5
#
# [[2]]
#  [1] 6  7  8  9 10
fit <- pcLasso(X, y, theta = 10, groups = groups)


The function cv.pcLasso fits pcLasso and picks optimal $\lambda$ values via cross-validation. The output of the cv.pcLasso function can also be used to predict on new data:

fit <- cv.pcLasso(X, y, theta = 10)
predict(fit, X[1:3, ], s = "lambda.min")
# [1] -0.01031697 -0.01031697 -0.01031697


The vignette contains significantly more detail on how to use this package. If you spot bugs, have questions, or have features that you would like to see implemented, get in touch with us!