Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Another one from the series on “misspecified regression models” (started with Model Misspecification and Linear Sandwiches).
< section id="intro" class="level2">Intro
Lately I’ve been messing around with the {lmtest}
R package, a nice collection of hypothesis tests for classical linear model assumptions: linearity (of course) and heteroskedasticity (
Just to clarify, here the relevant “linearity” assumption is that the conditional mean
First attempt: residual autocorrelation
My initial intuition was that it should be possible to test functional specification through the following procedure:
- Perform linear regression with the specified functional form.
- Order the residuals according to the corresponding values of
1. - Test for serial correlation (e.g. performing a Durbin-Watson test,
lmtest::dwtest
) on the series of ordered residuals.
The idea is quite simple: if residuals exhibit some systematic pattern when plotted against
set.seed(840) x <- rnorm(1e2) y <- x^3 + rnorm(length(x)) plot(x, y) abline(lm(y ~ x))
This, I suspect, is the reason why functions such as lmtest::dwtest()
have an order.by
argument which precisely allows to sort residuals before performing the test.
Unfortunately, it turns out that such a method is not only sensitive to functional misspecification, but also to heteroskedasticity – as one can quickly verify by running a simulation using lmtest::dwtest()
.
The overall idea is interesting, and works for homoskedastic noise, but the limitation to constant variance may be a bit too stringent. For this reason I turned to a second method, which also allows to take into account the possibility of heteroskedastic noise.
< section id="second-attempt-reset-heteroskedastic-consistent-variance-estimates" class="level2">Second attempt: RESET + Heteroskedastic Consistent variance estimates
The idea of RESET tests (see ?lmtest::resettest()
) is also quite simple: if the linear model is correct, there should be relatively little gain in adding additional non-linear functions of the original covariates to the fit’s formula.
The statistical significance of these model adjustments can be tested through a standard
The code that follows illustrates this procedure with an example dataset. The following section contains a more in-depth simulation study of the property of the RESET test.
fit_cars <- lm(dist ~ speed, data = cars) with(data = cars, plot(speed, dist)) abline(fit_cars)
lmtest::resettest(fit_cars, type = "regressor", power = 2, vcov = sandwich::vcovHC )
RESET test data: fit_cars RESET = 2.32, df1 = 1, df2 = 48, p-value = 0.1344
Unfortunately, the output of lmtest::resettest
does not include the results of the extended fit, which can be useful to understand the impact of the omitted covariates on the overall model picture (independently of the RESET
In order to get some insight on the effect of misspecification, we need to manually perform the RESET fit and make the relevant comparisons:
fit_cars_sq <- lm(dist ~ speed + I(speed*speed), data = cars) with(data = cars, plot(speed, dist)) abline(fit_cars) lines(x = cars$speed, y = fitted(fit_cars_sq), col = "blue")
RESET + HC vcov: a simulation study
We consider a univariate regression problem, with a regressor
Data will consist of independent samples
#' Helper to generate data with prescribed: #' * Regressor distribution: `x` #' * Response conditional mean: `f` #' * Response conditional noise: `eps` dgp_fun <- function(x, f, eps) { function(n) { .x <- x(n) data.frame(x = .x, y = f(.x) + eps(.x)) } } #' Helper to simulate results of linear regression, with prescribed: #' * Data generating process: `dgp` #' * Sample size of simulated datasets: `n` #' * Summary function (e.g. p-value of RESET test): `summarize_fun` lm_simulate <- function(dgp, n, summarize_fun, nsim, simplify) { replicate(nsim, { data <- dgp(n) fit <- lm(y ~ x, data) summarize_fun(fit) }, simplify = simplify) } #' Helper to perform RESET test on a `lm` fit object, and plot the p-value #' distribution. The estimator for regression coefficients variance-covariance #' matrix can be set through the `vcov` argument. reset_pvalue <- function( dgp, n, # Data generating process params power = 2:3, type = "regressor", vcov = sandwich::vcovHC, # RESET params nsim = 1e3 # Simulation params ) { summarize_fun <- function(fit) lmtest::resettest(fit, power = power, type = type, vcov = vcov)$p.value p <- lm_simulate( dgp = dgp, n = n, summarize_fun = summarize_fun, nsim = nsim, simplify = TRUE ) return(data.frame( p = p, dgp = deparse(substitute(dgp)), n = n, vcov = deparse(substitute(vcov)), nsim = nsim )) }
Furthermore, we will use:
library(dplyr) library(ggplot2)
for plotting.
< section id="data-generating-processes" class="level3">Data generating processes
The data generating processes can be defined as follows:
dgp_t1 <- dgp_fun( x = rnorm, f = \(x) 0.2 * x, eps = \(x) rnorm(length(x)) ) dgp_t2 <- dgp_fun( x = rnorm, f = \(x) 0.2 * x, eps = \(x) abs(x) * rnorm(length(x)) ) dgp_t3 <- dgp_fun( x = rnorm, f = \(x) 0.2 * x^3, eps = \(x) rnorm(length(x)) )
Data generated according to these three distributions looks as follows:
bind_rows( tibble(dgp_t1(100), dgp = "dgp_t1"), tibble(dgp_t2(100), dgp = "dgp_t2"), tibble(dgp_t3(100), dgp = "dgp_t3"), ) |> ggplot(aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm", formula = y ~ x, se = F) + facet_grid(~ dgp)
RESET -value distributions
The RESET
For the ground truths
and , represents the false positive rate (or Type I Error Rate) in rejecting the null hypothesis “no functional misspecification” at a given size of the test . For a valid -value, these curves should lie on or below the straight line .For the ground truth
, represents the Power (or one minus the Type II Error Rate) in detecting functional misspecification at a given size . High values correspond to high sensitivity.
sim_data <- dplyr::bind_rows( reset_pvalue(dgp = dgp_t1, n = 10, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t1, n = 100, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t1, n = 1000, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t1, n = 10000, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t1, n = 10, vcov = stats::vcov), reset_pvalue(dgp = dgp_t1, n = 100, vcov = stats::vcov), reset_pvalue(dgp = dgp_t1, n = 1000, vcov = stats::vcov), reset_pvalue(dgp = dgp_t1, n = 10000, vcov = stats::vcov), reset_pvalue(dgp = dgp_t2, n = 10, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t2, n = 100, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t2, n = 1000, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t2, n = 10000, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t2, n = 10, vcov = stats::vcov), reset_pvalue(dgp = dgp_t2, n = 100, vcov = stats::vcov), reset_pvalue(dgp = dgp_t2, n = 1000, vcov = stats::vcov), reset_pvalue(dgp = dgp_t2, n = 10000, vcov = stats::vcov), reset_pvalue(dgp = dgp_t3, n = 10, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t3, n = 100, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t3, n = 1000, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t3, n = 10000, vcov = sandwich::vcovHC), reset_pvalue(dgp = dgp_t3, n = 10, vcov = stats::vcov), reset_pvalue(dgp = dgp_t3, n = 100, vcov = stats::vcov), reset_pvalue(dgp = dgp_t3, n = 1000, vcov = stats::vcov), reset_pvalue(dgp = dgp_t3, n = 10000, vcov = stats::vcov) ) sim_data |> mutate(n_label = paste("n", n, sep = " = ")) |> ggplot(aes(p, color = vcov)) + stat_ecdf() + scale_color_discrete("vcov") + scale_x_continuous("p-value", labels = scales::percent) + scale_y_continuous("Empirical CDF", labels = scales::percent) + geom_abline(slope = 1, intercept = 0, linetype = "dashed") + facet_grid(n_label ~ dgp, ) + ggtitle( "p-value distribution of RESET test", paste("nsim", max(sim_data$nsim), sep = " = ") )
The plots illustrate qualitatively the behavior of the RESET test with and without the vcov
correction for noise heteroskedasticity. Various remarks:
The test with the standard
stats::vcov
estimator is sensitive not only to pure functional misspecification ( ), but also to pure heteroskedastic noise ( ).The
sandwich::vcovHC
estimator leads to an asymptotically correct Type I Error Rate in the case, but to a somewhat lower sensitivity (with respect tostats::vcov
) in the case.We need to keep in mind that
sandwich::vcovHC
only provides asymptotically correct variance-covariance estimates. Thus, for small , the -value distribution of the RESET test using thesandwich::vcovHC
can also be distorted (even in the perfectly specified case ).
Conclusions
This post explained how to perform model validation checks that are sensitive to functional misspecification, but relatively robust to heteroskedasticity.
The general idea is to extend the original model, allowing for more general functional forms in the conditional mean of the response, and test whether such extension significantly improves the fit. The catch is that, when performing the latter test, we need to somehow keep into account the possibility of heteroskedastic noise.
This idea is readily implemented with RESET tests for linear models: one can simply use a variance-covariance estimator for regression coefficients that is robust to heteroskedasticity. In R, this can be achieved with a single line of code, using lmtest::resettest(vcov = sandwich::vcovHC)
.
With some effort, one may be able to generalize such a procedure to any parametric model fitted by Maximum Likelihood Estimation, since a sandwich estimator is available also in this more general case (see e.g. the presentation of sandwich estimators in this paper by D.A. Freedman).
Footnotes
Here I’m implicitly assuming that we have a single
, but a similar logic should also apply to multivariate regression.↩︎With enough data, the RESET test would likely test positive for a variety of misspecifications, but that doesn’t mean that such misspecification are necessarily relevant from a modeling perspective. Here, for instance, a large coefficient for
with a -score of two s could be more worrying than a minuscule coefficient with a -score of five s.↩︎Sometimes also referred to as “second order misspecification”.↩︎
The code is a bit unelegant 😬 but it works.↩︎
Reuse
< section class="quarto-appendix-contents" id="quarto-citation">Citation
@online{gherardi2023, author = {Gherardi, Valerio}, title = {Testing Functional Specification in Linear Regression}, date = {2023-07-11}, url = {https://vgherard.github.io/posts/2023-07-11-testing-functional-specification-in-linear-regression/}, langid = {en} }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.