Can You Say “Heteroscedasticity” 3 Times Fast?

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, we know that the assumption of homoscedasticity can’t be true, even approximately.

Typical books discuss assessment of that assumption using residual plots and the like, but then leave it at that. Rather few books mention Eicker-White theory, which develops valid asymptotic inference for heteroscedastic data. E-W is really nice, and guess what! — it’s long been available in R, in the sandwich and car packages on CRAN.  (Note that the latter package is intended for use with a book that does cover this topic, J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011.) Then instead of using R’s standard vcov() function to obtain estimated variances and covariances of the estimates of the βi, we use vcovHC() and hccm(), respectively.

One can make a similar derivation for nonlinear regression, which is available as the function nlshc() in my regtools package. The package will be associated with my own book on regression, currently in progress. (The package is currently in progress as well, but already contains several useful functions.) The rest of this post is adapted from an example in the book.

The model I chose for this simple example was E(Y | X = t) = 1 / t’β, where the distributions of the quantities can be seen in the following simulation code:

sim <- function(n,nreps) {
 b <- 1:2
 res <- replicate(nreps,{
 x <- matrix(rexp(2*n),ncol=2)
 meany <- 1 / (x %*% b)
 y <- meany + (runif(n) - 0.5) * meany
 xy <- cbind(x,y)
 xy <- data.frame(xy)
 nlout <- nls(X3 ~ 1 / (b1*X1+b2*X2),
 data=xy,start=list(b1 = 1,b2=1)) 
 b <- coef(nlout)
 vc <- vcov(nlout)
 vchc <- nlsvcovhc(nlout)
 z1 <- (b[1] - 1) / sqrt(vc[1,1])
 z2 <- (b[1] - 1) / sqrt(vchc[1,1])
 c(z1,z2)
})
 print(mean(res[1,] < 1.28))
 print(mean(res[2,] < 1.28))
}

The results were striking:

> sim(250,2500)
[1] 0.6188
[1] 0.9096

Since the true value should be 0.90 (1.28 is the 0.90 quantile of N(0,1)), the Eicker-White method is doing an outstanding job here — and R’s built-in nonlinear regression function, nls() is not. The latter function’s reported standard errors are way, way off the mark.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)