# Heteroscedasticity in Regression — It Matters!

**Mad (Data) Scientist**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R’s main linear and nonlinear regression functions, **lm()** and **nls(),** report standard errors for parameter estimates under the assumption of *homoscedasticity*, a fancy word for a situation that rarely occurs in practice. The assumption is that the (conditional) variance of the response variable is the same at any set of values of the predictor variables.

Take for instance my favorite introductory example, predicting human weight from height. (It is unlikely we’d need to make such a prediction, but it serves as a quick and easy illustration.) The homoscedasticity assumption says that tall people have no more variation in weight than short people, certainly not true.

So, confidence intervals and p-values calculated by **lm()** and **nls()** are generally invalid. In this blog post, I’ll show how to remedy this problem — including some new code that I’ll provide here — and give an example showing just how far wrong one can be without a remedy.

In the linear case, the remedy is simple, but probably not widely known. Years ago, researchers such as Eickert and White derived large-sample theory for the heteroscedastic variance case, and it is coded in CRAN’s **car** and **sandwich** packages in the functions **hccm()** and **vcovHC()**, respectively. I’ll use the latter here, as its name is similar to that of R’s **vcov()** function.

The latter inputs the result of a call to **lm()** or **nls()**, and outputs the estimated covariance matrix of your estimated parameter vector. Thus the standard errors of the estimated parameters are the square roots of the diagonal elements of the matrix returned by **vcov()**. Let’s check that, using the example from the online help for **lm()**:

> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) > trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) > group <- gl(2, 10, 20, labels = c("Ctl","Trt")) > weight <- c(ctl, trt) > lm.D9 <- lm(weight ~ group) > summary(lm.D9) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.2202 22.850 9.55e-15 groupTrt -0.3710 0.3114 -1.191 0.249 > diag(sqrt(vcov(lm.D9))) (Intercept) groupTrt 0.2202177 0.3114349

Sure enough, the standard errors match. (Were you doubting it? 🙂 )

So, the linear case, i.e. **lm(),** is taken care of. But unfortunately, there doesn’t seem to be anything for the nonlinear case, **nls()**. The following code solves that problem:

library(minpack.lm) library(sandwich) nlsvcovhc <- function(nlslmout) { b <- coef(nlslmout) m <- nlslmout$m resid <- m$resid() hmat <- m$gradient() fakex <- hmat fakey <- resid + hmat %*% b lmout <- lm(fakey ~ fakex - 1) vcovHC(lmout) }

(The names such as **fakex**, which I should remove, are explained in the last paragraph below.) You can download this code (complete with comments, which I’ve omitted here for brevity) here. The function is applied to the output of **nlsLM()**, which is a modified version of **nls()** in the package **minpack.lm**, which has better convergence behavior.

Here’s an example, using the enzyme data set **vmkmki** from the CRAN package **nlstools**. (I removed the last 12 observations, which for reasons I won’t go into here seem to be anomalous.) The nonlinear model here is the one given in the online help for **vmkmki**.

> regftn <- function(t,u,b1,b2,b3) b1 * t / (t + b2 * (1 + u/b3)) > bstart <- list(b1=1,b2=1, b3=1) > z <- nls(v ~ regftn(S,I,b1,b2,b3),data=vmkmki,start=list(b1=1,b2=1, b3=1)) > z Nonlinear regression model model: v ~ regftn(S, I, b1, b2, b3) data: vmkmki b1 b2 b3 18.06 15.21 22.28 ... > vcov(z) b1 b2 b3 b1 0.4786776 1.374961 0.8930431 b2 1.3749612 7.568837 11.1332821 b3 0.8930431 11.133282 29.1363366

Compare that to using Eickert-White:

ttt# get new z from nlsLM(), not shown > nlsvcovhc(z) fakex1 fakex2 fakex3 fakex1 0.4708209 1.706591 2.410712 fakex2 1.7065910 10.394496 20.314688 fakex3 2.4107117 20.314688 53.086958

This is rather startling! Except for the estimated variance of the first parameter estimate, the estimated variances and covariances from Eickert-White are much larger than what **nls()**} found under the assumption of homoscedasticity. In other words, **nls()** was way off the mark. In fact, this simulation code shows that the standard errors reported by **nls()** can lead, for instance, to confidence intervals having only 60% coverage probability rather than 90%, due to heteroscedasticity, while the above code fixes the problem.

How does that code work? (Optional reading.) Most nonlinear least-squares procedures use a local linear approximation, such as in the Gauss-Newton algorithm. This results computationally in a fake **lm()** setting. As such, we are already set up to use the *delta method*. We can then apply **vcovHC().**

**leave a comment**for the author, please follow the link and comment on their blog:

**Mad (Data) Scientist**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.