On leverage

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, in our STT5100 (applied linear models) class, I’ve introduce the hat matrix, and the notion of leverage. In a classical regression model, \(\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}\) (in a matrix form), the ordinary least square estimator of parameter \(\boldsymbol{\beta}\) is \(\widehat{\boldsymbol{\beta}}=(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{y}\)The prediction can then be written\(\widehat{\boldsymbol{y}}=\boldsymbol{X}\widehat{\boldsymbol{\beta}}=\underbrace{\color{blue}{\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top}}_{\color{blue}{\boldsymbol{H}}}\boldsymbol{y}\)where \(\color{blue}{\boldsymbol{H}}\) is called the hat matrix.

The matrix is idempotent, i.e. \(\boldsymbol{H}^2={\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\textcolor{grey}{\boldsymbol{X}^\top{\boldsymbol{X}}(\boldsymbol{X}^\top\boldsymbol{X})^{-1}}\boldsymbol{X}^\top}={\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top}=\boldsymbol{H}\)so it can be interpreted as a projection matrix. Furthermore, since\(\boldsymbol{H}\boldsymbol{X}=\boldsymbol{X}\) (just do the maths), the projection is on a subspace that contains all the linear combinations of columns of \(\boldsymbol{X}\). One can also observe that \(\mathbb{I}-\boldsymbol{H}\) is also a projection matrix. And we can write\(\boldsymbol{y}=\underbrace{\boldsymbol{H}\boldsymbol{y}}_{\widehat{\boldsymbol{y}}}+\underbrace{(\mathbb{I}-\boldsymbol{H})\boldsymbol{y}}_{\widehat{\boldsymbol{\varepsilon}}}\)where \(\widehat{\boldsymbol{y}}\) is the orthogonal projection of \(\boldsymbol{y}\) on the (linear) space of linear combinations of columns of \(\boldsymbol{X}\), and \(\widehat{\boldsymbol{y}}\perp\widehat{\boldsymbol{\varepsilon}}\), which gives the classical interpretation of residuals, being unpredictible (at least with a linear model using variables \(\boldsymbol{X}\)).

Let’s move a bit faster now (we’ve seen many other properties last week), and consider elements on the diagonal of matrix \(\boldsymbol{H}\). Recall that we have

so entry \(\boldsymbol{H}_{i,i}\) is a measure of the influence of entry \(\boldsymbol{y}_i\) on its prediction latex]\widehat{\boldsymbol{y}}_i[/latex].

We have seen that\(\sum_{i=1}^n\boldsymbol{H}_{i,i}=\text{trace}(\boldsymbol{H})=\text{trace}(\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top)\)which can be written\(\sum_{i=1}^n\boldsymbol{H}_{i,i}=\text{trace}\boldsymbol{X}^\top(\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{X})^{-1})=\text{trace}(\mathbb{I})=p\)where classically \(p=k+1\), where \(k\) is the number of explanatory variables. Further, since \(\boldsymbol{H}\) is idempotent, we can write (from \(\boldsymbol{H}=\boldsymbol{H}^2\)) that\(\boldsymbol{H}_{i,i}=\boldsymbol{H}_{i,i}^2 + \sum_{j\neq i}\boldsymbol{H}_{i,j}\boldsymbol{H}_{j,i}=\boldsymbol{H}_{i,i}^2 + \sum_{j\neq i}\boldsymbol{H}_{i,j}^2\)One the one hand, since the second term is positive \(\boldsymbol{H}_{i,i}\geq\boldsymbol{H}_{i,i}^2\), i.e. \(1\geq\boldsymbol{H}_{i,i}\). And since both terms are positive, then \(\boldsymbol{H}_{i,i}\in[0,1]\). And there was a question in the course on the sharpeness of the bounds.

Using Anscombe’s dataset, we’ve seen that it was possible to get a leverage of 1. Using something rather similar

df = data.frame(x = c(rep(1,10),6), y = c(1:10,8))
plot(df)

we obtain

model = lm(y~x,data=df)
abline(model,col="red",lwd=2)
H = lm.influence(model)$hat
plot(1:11,H,type="h")

The very last observation, the one one the right, is here extremely influencial : if we remove it, the model is completely different ! And here, we reach the upper bound, \(\boldsymbol{H}_{11,11}=1\). Observe that all other points are equally influencial, and because on the constraint on the trace of the matrix, \(\boldsymbol{H}_{i,i}=1/10\) when \(i\in\{1,2,\cdots,10\}\).

Now, what about the lower bound ? In order to have some sort of “non-influencial” observations, consider the two following case.

  • the case where one observation (below the first one) is such that \(\widehat{\boldsymbol{y}}_{i}=\boldsymbol{y}_{i}\) (perfect prediction)
  • the case where one observation (below the tenth one) is such that \(\boldsymbol{x}_{i}=\overline{\boldsymbol{x}}\) and \(\boldsymbol{y}_{i}=\overline{\boldsymbol{y}}\) (from the first order condition – or normal equation), the fitted regression line always go through point \((\overline{\boldsymbol{x}},\overline{\boldsymbol{y}})\)

Let us move two observations from our dataset,

mean(c(4,rep(1,8),6))
[1] 1.8
df = data.frame(x = c(4,rep(1,8),6,1.8),
y = c(predict(model,newdata=data.frame(x=4)),
2:9,8,
predict(model,newdata=data.frame(x=1.8))))

We now have

If we compute the leverages, we obtain

model = lm(y~x,data=df)
H = lm.influence(model)$hat
plot(1:11,H,type="h")

so, for the first observation, its leverage actually increased (the blue part), and for the tenth one, we have the lowest influence, but it is not zero. Is it possible to reach zero ?

Here, observe that for the tenth observation, \(\boldsymbol{H}_{i,i}=1/n\). And actually, that’s the best we can do… We can prove that, in the case of a simple regression (as above)\(\boldsymbol{H}_{i,i}=\frac{1}{n}+\frac{(x_i-\overline{x})^2}{n\text{Var}(x)}\)which is minimum when \(x_i=\overline{x}\), and then \(\boldsymbol{H}_{i,i}=1/n\), otherwise \(\boldsymbol{H}_{i,i}>1/n\). And this property is also valid in a multiple regression (as soon as an intercept is included in the regression – which should always be the case). To prove that result, let \(\tilde{\boldsymbol{X}}\) denote the matrix of centered variables \(\boldsymbol{X}\), then we can prove that \(\boldsymbol{H}_{i,i}=\frac{1}{n}+\big[\tilde{\boldsymbol{X}}(\tilde{\boldsymbol{X}}^\top\tilde{\boldsymbol{X}})^{-1}\tilde{\boldsymbol{X}}^\top\big]_{i,i}\)(which is basically a matrix version of the previous equation).

I can maybe add another comment on Anscombe’s data. We’ve seen that on the right that we did reach 1. But I did not prove it. One way to prove it is actually to focus on the remaining \(n-1\) points, on the left. Those have all the same \(x\) values. We can prove that if \(\boldsymbol{X}_{i_1}=\boldsymbol{X}_{i_2}\), then \(\boldsymbol{H}_{i_1,i_2}=\boldsymbol{X}_{i_1}^\top(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}_{i_2}=\boldsymbol{H}_{i_1,i_1}\)hence, using the relationship obtained since the hat matrix is idempotent\(\boldsymbol{H}_{i_1,i_1}=2\boldsymbol{H}_{i_1,i_1}^2+\sum_{j\notin\{i_1,i_2\}}\boldsymbol{H}_{i_1,j}^2\)thus, we now have\(\boldsymbol{H}_{i_1,i_1}\big(1-2\boldsymbol{H}_{i_1,i_1}\big)>0\)i.e. \(\boldsymbol{H}_{i_1,i_1}\in[0,1/2]\), where the upper bound becomes \(1/(n-1)\) “duplicates”. So for \(n-1\) \(\boldsymbol{H}_{i,i}\)‘s, we have values below \(1/(n-1)\), the last one should be below \(1\) and the sum has to be \(k=2\) . So we have the value of the \(n\) \(\boldsymbol{H}_{i,i}\)‘s.

 

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)