I’m writing a series of posts on various function options of the
glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, we will look at the type.gaussian option.
For reference, here is the full signature of the
glmnet function (v3.0-2):
glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"), weights, offset = NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04), lambda = NULL, standardize = TRUE, intercept = TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20, nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf, upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars < 500, "covariance", "naive"), type.logistic = c("Newton", "modified.Newton"), standardize.response = FALSE, type.multinomial = c("ungrouped", "grouped"), relax = FALSE, trace.it = 0, ...)
According to the official R documentation,
Two algorithm types are supported for (only)
family="gaussian". The default when
type.gaussian="covariance", and saves all inner-products ever computed. This can be much faster than
type.gaussian="naive", which loops through nobs every time an inner-product is computed. The latter can be far more efficient for
nvar >> nobssituations, or when
nvar > 500.
Generally speaking there is no need for you as the user to change this option.
How do the fitting times compare?
I ran a timing simulation to compare the function run times of
type.gaussian="covariance" for a range of number of observations (
nobs or ) and number of features (
nvar or ). The results are shown below. (For the R code that generated these plots, see here.)
This first panel of boxplots shows the time taken for
type.gaussian="naive" to complete as a fraction (or multiple) of that for
type.gaussian="covariance" (each boxplot represents 5 simulation runs). As advertised, naive runs more slowly for small values of but more quickly for large values of . The difference seems to be more stark when is larger.
This next plot shows the absolute fitting times: note the log scale on both the x and y axes.
So, what algorithms do these two options represent? What follows is based on Statistical Learning with Sparsity by Hastie, Tibshirani and Wainwright (free PDF here, p124 of the PDF, p113 of the book).
Let denote the response for observation , and let denote the value of feature for observation . Assume that the response and the features are standardized to have mean zero so that we don’t have to worry about fitting an intercept term. For each value of in lambda,
glmnet is minimizing the following objective function:
We minimize the expression above by cyclic coordinate descent. That is, we cycle through the features . For each , treat as a function of only (pretend that for are fixed), then update to the value which minimizes . It turns out that this update is very simple:
where is the soft-thresholding operator and is the partial residual.
Both of the modes minimize in this way. Where they differ is in how they keep track of the quantities needed to do the update above. From here, assume that the data has been standardized. (What follows works for unstandardized data as well but just has more complicated expressions.)
type.gaussian = "naive"
As the features are standardized, we can write the argument in the soft-thresholding operator as
where is the full residual for observation . In this mode, we keep track of the full residuals , .
- At a coordinate descent step for feature , if the coefficient doesn’t change its value, no updating of is needed. However, to get the LHS of for the next feature (), we need to make operations to compute the sum on the RHS of .
- If changes value, then we have to update the ‘s, then recompute the LHS of for the next feature using the expression on the RHS. This also takes time.
All in all, a full cycle through all variables costs operations.
type.gaussian = "covariance"
Ignoring the factor of , note that the first term on the RHS of can be written as
In this mode, we compute all the inner products ( of them) which takes operations. For each such that , we store the current values of (there are of them for each ).
- At a coordinate descent step for feature , if and the beginning of the step and its value changes, we need to update the terms with operations. Then, to calculate for the next coordinate descent step, we only need operations, where is the number of non-zero coefficients at the moment.
- As such, if no new variables become non-zero in a full cycle through the features, one full cycle takes only operations.
- If a new feature enters the model for the first time (i.e. becomes non-zero), then we need to compute and store for , which takes .
This form of updating avoids the updating needed at every step at each feature in naive mode. While we sometimes occur operations when a new variable enters the model, such events don’t happen often. Also, we have , so if is small or if , the operations for one full cycle pale in comparison with operations for naive updating.