# A deep dive into glmnet: offset

January 9, 2019
By

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.

In this post, we will look at the offset option.

For reference, here is the full signature of the glmnet function:

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"),
weights, offset=NULL, alpha = 1, nlambda = 100,
lambda.min.ratio = ifelse(nobs
offset
According to the official R documentation, offset should be

A vector of length nobs that is included in the linear predictor (a nobs x nc matrix for the “multinomial” family).

Its default value is NULL: in that case, glmnet internally sets the offset to be a vector of zeros having the same length as the response y.
Here is some example code for using the offset option:
set.seed(1)
n <- 50; p <- 10
x <- matrix(rnorm(n * p), nrow = n)
y <- rnorm(n)
offset <- rnorm(n)

# fit model
fit1 <- glmnet(x, y, offset = offset)

If we specify offset in the glmnet call, then when making predictions with the model, we must specify the newoffset option. For example, if we want the predictions fit1 gives us at $\lambda = 0.1$$\lambda = 0.1$ for the training data, not specifying newoffset will give us an error:

This is the correct code:
predict(fit1, x, s = 0.1, newoffset = offset)
#                 1
#  [1,]  0.44691399
#  [2,]  0.30013292
#  [3,] -1.68825225
#  [4,] -0.49655504
#  [5,]  1.20180199
#  ...

So, what does offset actually do (or mean)? Recall that glmnet is fitting a linear model. More concretely, our data is $\{ (x_1, y_1), \dots, (x_n, y_n) \}$$\{ (x_1, y_1), \dots, (x_n, y_n) \}$, where the $x_j \in \mathbb{R}^p$$x_j \in \mathbb{R}^p$ are our features for observation $j$$j$ and $y_j \in \mathbb{R}$$y_j \in \mathbb{R}$ is the response for observation $j$$j$. For each observation, we are trying to model some variable $z_j$$z_j$ as a linear combination of the features, i.e. $z_j = \beta_0 + \beta_1^T x_j$$z_j = \beta_0 + \beta_1^T x_j$. $z_j$$z_j$ is a function of $z_j$$z_j$; the function depends on the context. For example,

For ordinary regression, $z_j = y_j$$z_j = y_j$, i.e. the response itself.
For logistic regression, $z_j = \text{logit}(y_j) = \log \left(\dfrac{y_j}{1-y_j} \right)$$z_j = \text{logit}(y_j) = \log \left(\dfrac{y_j}{1-y_j} \right)$.
For Poisson regression, $z_j = \log(y_j)$$z_j = \log(y_j)$.

So, we are trying to find $\beta_0$$\beta_0$ and $\beta_1$$\beta_1$ so that $\beta_0 + \beta_1^T x_j$$\beta_0 + \beta_1^T x_j$ is a good estimate for $z_j$$z_j$. If we have an offset $(e_1, \dots, e_n)$$(e_1, \dots, e_n)$, then we are trying to find $\beta_0$$\beta_0$ and $\beta_1$$\beta_1$ so that $\boldsymbol{e_j} + \beta_0 + \beta_1^T x_j$$\boldsymbol{e_j} + \beta_0 + \beta_1^T x_j$ is a good estimate for $z_j$$z_j$.
Why might we want to use offsets? There are two primary reasons for them stated in the documentation:

Useful for the “poisson” family (e.g. log of exposure time), or for refining a model by starting at a current fit.

Let me elaborate. First, offsets are useful for Poisson regression. The official vignette has a little section explaining this; let me explain it through an example.
Imagine that we are trying to predict how many points an NBA basketball player will score per minute based on his physical attributes. If the player’s physical attributes (i.e. the covariates of our model) are denoted by $x$$x$ and then the number of points he scores in a minute is denoted by $y$$y$, then Poisson regression assumes that
\begin{aligned} y &\sim \text{Poisson}(\mu(x)), \\ \log [\mu(x)] &= \beta_0 + \beta_1^T x. \end{aligned}\begin{aligned} y &\sim \text{Poisson}(\mu(x)), \\ \log [\mu(x)] &= \beta_0 + \beta_1^T x. \end{aligned}
$\beta_0$$\beta_0$ and $\beta_1$$\beta_1$ are parameters of the model to be determined.
Having described the model, let’s turn to our data. For each player $1, \dots, n$$1, \dots, n$, we have physical covariates $x_1, \dots, x_n$$x_1, \dots, x_n$. However, instead of having each player’s points per minute, we have number of points scored over a certain time period. For example, we might have “player 1 scored 12 points over 30 minutes” instead of “player 1 scored 0.4 points per minute”.
Offsets allow us to use our data as is. In our example above, loosely speaking 12/30 (points per minute) is our estimate for $\mu(x_1)$$\mu(x_1)$. Hence, 12 (points in 30 minutes) is our estimate for $30 \mu(x_1)$$30 \mu(x_1)$. In our model, $\beta_0 + \beta_1^T x$$\beta_0 + \beta_1^T x$ is our estimate for $\log [\mu(x_1)]$$\log [\mu(x_1)]$, and so our estimate for $\log [30\mu(x_1)]$$\log [30\mu(x_1)]$ would be $\beta_0 + \beta_1^T x + \log 30$$\beta_0 + \beta_1^T x + \log 30$. The $\log 30$$\log 30$ term is the “offset” to get the model prediction for our data as is.
Taking this to the full dataset: if player $j$$j$ scores $p_j$$p_j$ points in $t_j$$t_j$ minutes, then our offset would be the vector $(\log t_1, \dots, \log t_n)$$(\log t_1, \dots, \log t_n)$, and the response we would feed glmnet is $(p_1, \dots, p_n)$$(p_1, \dots, p_n)$.
The second reason one might want to use offsets is to improve on an existing model. Continuing the example above: say we have a friend who has trained a model (not necessarily a linear model) to predict $\log [\mu(x)]$$\log [\mu(x)]$, but he did not use the player’s physical attributes. We think that we can improve on his predictions by adding physical attributes to the model. One refinement to our friend’s model could be
$\log[\mu(x)] = \hat{\theta} + \beta_0 + \beta_1^T x,$$\log[\mu(x)] = \hat{\theta} + \beta_0 + \beta_1^T x,$
where $\hat{\theta}$$\hat{\theta}$ is the prediction of $\log[\mu(x)]$$\log[\mu(x)]$ from our friend’s model. In this setting, the offsets are simply our friend’s predictions. For model training, we would provide the first model’s predictions on the training observations as the offset. To get predictions from the refinement on new observations, we would first compute the predictions from the first model, then use them as the newoffset option in the predict call.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

(function(d, t) {
var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
}(document, 'script'));

Related
ShareTweet

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...