lmSubsets: Exact variable-subset selection in linear regression

[This article was first published on Achim Zeileis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R package lmSubsets for flexible and fast exact variable-subset selection is introduced and illustrated in a weather forecasting case study.


Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). “lmSubsets: Exact Variable-Subset Selection in Linear Regression for R.” Journal of Statistical Software, 93(3), 1-21. doi:10.18637/jss.v093.i03


An R package for computing the all-subsets regression problem is presented. The proposed algorithms are based on computational strategies recently developed. A novel algorithm for the best-subset regression problem selects subset models based on a predetermined criterion. The package user can choose from exact and from approximation algorithms. The core of the package is written in C++ and provides an efficient implementation of all the underlying numerical computations. A case study and benchmark results illustrate the usage and the computational efficiency of the package.



Illustration: Variable selection in weather forecasting

Advances in numerical weather prediction (NWP) have played an important role in the increase of weather forecast skill over the past decades. Numerical models simulate physical systems that operate at a large, typically global, scale. The horizontal (spatial) resolution is limited by the computational power available today and hence, typically, the NWP outputs are post-processed to correct for local and unresolved effects in order to obtain forecasts for specific locations. So-called model output statistics (MOS) develops a regression relationship based on past meteorological observations of the variable to be predicted and forecasted NWP quantities at a certain lead time. Variable-subset selection is often employed to determine which NWP outputs should be included in the regression model for a specific location.

Here, the lmSubsets package is used to build a MOS regression model predicting temperature at Innsbruck Airport, Austria, based on data from the Global Ensemble Forecast System. The data frame IbkTemperature contains 1824 daily cases for 42 variables: the temperature at Innsbruck Airport (observed), 36 NWP outputs (forecasted), and 5 deterministic time trend/season patterns. The NWP variables include quantities pertaining to temperature (e.g., 2-meter above ground, minimum, maximum, soil), precipitation, wind, and fluxes, among others.

First, package and data are loaded and the few missing values are omitted for simplicity.

data("IbkTemperature", package = "lmSubsets")
IbkTemperature <- na.omit(IbkTemperature)

A simple output model for the observed temperature (temp) is constructed, which will serve as the reference model. It consists of the 2-meter temperature NWP forecast (t2m), a linear trend component (time), as well as seasonal components with annual (sin, cos) and bi-annual (sin2, cos2) harmonic patterns.

MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2,
  data = IbkTemperature)

When looking at summary(MOS0) or the coefficient table below, it can be observed that despite the inclusion of the NWP variable t2m, the coefficients for the deterministic components remain significant, which indicates that the seasonal temperature fluctuations are not fully resolved by the numerical model.

(Intercept)-345.252 ** (109.212)-666.584 ***(95.349)-661.700 ***(95.225)
t2m0.318 ***(0.016)0.055    (0.029)             
time0.132 *  (0.054)0.149 ** (0.047)0.147 ** (0.047)
sin-1.234 ***(0.126)0.522 ***(0.147)0.811 ***(0.120)
cos-6.329 ***(0.164)-0.812 ** (0.273)             
sin20.240 *  (0.110)-0.794 ***(0.119)-0.870 ***(0.118)
cos2-0.332 ** (0.109)-1.067 ***(0.101)-1.128 ***(0.097)
sshnf             0.016 ***(0.004)0.018 ***(0.004)
vsmc             20.200 ***(3.115)20.181 ***(3.106)
tmax2m             0.145 ***(0.037)0.181 ***(0.023)
st             1.077 ***(0.051)1.142 ***(0.043)
wr             0.450 ***(0.109)0.505 ***(0.103)
t2pvu             0.064 ***(0.011)0.149 ***(0.028)
mslp                          -0.000 ***(0.000)
p2pvu                          -0.000 ** (0.000)
AIC9493.602         8954.907         8948.182         
BIC9537.650         9031.992         9025.267         
RSS19506.469         14411.122         14357.943         
Sigma3.281         2.825         2.820         
R-squared0.803         0.854         0.855         
*** p < 0.001; ** p < 0.01; * p < 0.05.

Next, the reference model is extended with selected regressors taken from the remaining 35 NWP variables.

MOS1_best <- lmSelect(temp ~ ., data = IbkTemperature,
  include = c("t2m", "time", "sin", "cos", "sin2", "cos2"),
  penalty = "BIC", nbest = 20)
MOS1 <- refit(MOS1_best)

Best-subset regression with respect to the BIC criterion is employed to determine pertinent veriables in addition to the regressors already used in MOS0. The 20 best submodels are computed and the selected variables can be visualized by image(MOS1_best, hilite = 1) (see below) while the corresponding BIC values can be visualized by plot(MOS1_best). All in all, these 20 best models are very similar with only a few variables switching between being included and excluded. Using the refit() method the best submodel can be extracted and fitted via lm(). Summary statistics are shown in the table above. Overall, the model MOS1 improves the model fit considerably compared to the basic MOS0 model.


Finally, an all-subsets regression is conducted instead of the cheaper best-subsets regression. It considers all 41 variables without any restrictions to determine what is the best model in terms of BIC that could be found for this data set.

MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature)
MOS2 <- refit(lmSelect(MOS2_all, penalty = "BIC"))

Again, the best model is refitted with lm() to facilitate further inspections, see above for the summary table.

The best-BIC models MOS1 and MOS2 both have 13 regressors. The deterministic trend and all but one of the harmonic seasonal components are retained in MOS2 even though they are not forced into the model (as in MOS1). In addition, MOS1 and MOS2 share six NWP outputs relating to temperature (tmax2m, st, t2pvu), pressure (mslp, p2pvu), hydrology (vsmc, wr), and heat flux (sshnf). However, and most remarkably, MOS1 does not include the direct 2-meter temperature output from the NWP model (t2m). In fact, t2m is not included by any of the 20 submodels (sizes 8 to 27) shown by image(MOS2_all, size = 8:27, hilite = 1, hilite_penalty = "BIC") whereas the temperature quantities tmax2m, st, t2pvu are included by all. (Additionally, plot(MOS2_all) would show the associated BIC and residual sum of squares across the different model sizes.) The summary statistics reveal that both MOS1 and MOS2 significantly improve over the simple reference model MOS0, with MOS2 being only slightly better than MOS1.


To leave a comment for the author, please follow the link and comment on their blog: Achim Zeileis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)