# lmSubsets: Exact variable-subset selection in linear regression

**Achim Zeileis**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R package lmSubsets for flexible and fast exact variable-subset selection is introduced and illustrated in a weather forecasting case study.

## Citation

Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). “lmSubsets: Exact Variable-Subset Selection in Linear Regression for R.” *Journal of Statistical Software*, **93**(3), 1-21. doi:10.18637/jss.v093.i03

## Abstract

An R package for computing the all-subsets regression problem is presented. The proposed algorithms are based on computational strategies recently developed. A novel algorithm for the best-subset regression problem selects subset models based on a predetermined criterion. The package user can choose from exact and from approximation algorithms. The core of the package is written in C++ and provides an efficient implementation of all the underlying numerical computations. A case study and benchmark results illustrate the usage and the computational efficiency of the package.

## Software

https://CRAN.R-project.org/package=lmSubsets

## Illustration: Variable selection in weather forecasting

Advances in numerical weather prediction (NWP) have played an important role in the increase of weather forecast skill over the past decades. Numerical models simulate physical systems that operate at a large, typically global, scale. The horizontal (spatial) resolution is limited by the computational power available today and hence, typically, the NWP outputs are post-processed to correct for local and unresolved effects in order to obtain forecasts for specific locations. So-called model output statistics (MOS) develops a regression relationship based on past meteorological observations of the variable to be predicted and forecasted NWP quantities at a certain lead time. Variable-subset selection is often employed to determine which NWP outputs should be included in the regression model for a specific location.

Here, the `lmSubsets`

package is used to build a MOS regression model predicting temperature at Innsbruck Airport, Austria, based on data from the Global Ensemble Forecast System. The data frame `IbkTemperature`

contains 1824 daily cases for 42 variables: the temperature at Innsbruck Airport (observed), 36 NWP outputs (forecasted), and 5 deterministic time trend/season patterns. The NWP variables include quantities pertaining to temperature (e.g., 2-meter above ground, minimum, maximum, soil), precipitation, wind, and fluxes, among others.

First, package and data are loaded and the few missing values are omitted for simplicity.

library("lmSubsets") data("IbkTemperature", package = "lmSubsets") IbkTemperature <- na.omit(IbkTemperature)

A simple output model for the observed temperature (`temp`

) is constructed, which will serve as the reference model. It consists of the 2-meter temperature NWP forecast (`t2m`

), a linear trend component (`time`

), as well as seasonal components with annual (`sin`

, `cos`

) and bi-annual (`sin2`

, `cos2`

) harmonic patterns.

MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2, data = IbkTemperature)

When looking at `summary(MOS0)`

or the coefficient table below, it can be observed that despite the inclusion of the NWP variable `t2m`

, the coefficients for the deterministic components remain significant, which indicates that the seasonal temperature fluctuations are not fully resolved by the numerical model.

MOS0 | MOS1 | MOS2 | ||||

(Intercept) | -345.252 ** | (109.212) | -666.584 *** | (95.349) | -661.700 *** | (95.225) |

t2m | 0.318 *** | (0.016) | 0.055 | (0.029) | ||

time | 0.132 * | (0.054) | 0.149 ** | (0.047) | 0.147 ** | (0.047) |

sin | -1.234 *** | (0.126) | 0.522 *** | (0.147) | 0.811 *** | (0.120) |

cos | -6.329 *** | (0.164) | -0.812 ** | (0.273) | ||

sin2 | 0.240 * | (0.110) | -0.794 *** | (0.119) | -0.870 *** | (0.118) |

cos2 | -0.332 ** | (0.109) | -1.067 *** | (0.101) | -1.128 *** | (0.097) |

sshnf | 0.016 *** | (0.004) | 0.018 *** | (0.004) | ||

vsmc | 20.200 *** | (3.115) | 20.181 *** | (3.106) | ||

tmax2m | 0.145 *** | (0.037) | 0.181 *** | (0.023) | ||

st | 1.077 *** | (0.051) | 1.142 *** | (0.043) | ||

wr | 0.450 *** | (0.109) | 0.505 *** | (0.103) | ||

t2pvu | 0.064 *** | (0.011) | 0.149 *** | (0.028) | ||

mslp | -0.000 *** | (0.000) | ||||

p2pvu | -0.000 ** | (0.000) | ||||

AIC | 9493.602 | 8954.907 | 8948.182 | |||

BIC | 9537.650 | 9031.992 | 9025.267 | |||

RSS | 19506.469 | 14411.122 | 14357.943 | |||

Sigma | 3.281 | 2.825 | 2.820 | |||

R-squared | 0.803 | 0.854 | 0.855 | |||

*** p < 0.001; ** p < 0.01; * p < 0.05. |

Next, the reference model is extended with selected regressors taken from the remaining 35 NWP variables.

MOS1_best <- lmSelect(temp ~ ., data = IbkTemperature, include = c("t2m", "time", "sin", "cos", "sin2", "cos2"), penalty = "BIC", nbest = 20) MOS1 <- refit(MOS1_best)

Best-subset regression with respect to the BIC criterion is employed to determine pertinent veriables in addition to the regressors already used in `MOS0`

. The 20 best submodels are computed and the selected variables can be visualized by `image(MOS1_best, hilite = 1)`

(see below) while the corresponding BIC values can be visualized by `plot(MOS1_best)`

. All in all, these 20 best models are very similar with only a few variables switching between being included and excluded. Using the `refit()`

method the best submodel can be extracted and fitted via `lm()`

. Summary statistics are shown in the table above. Overall, the model `MOS1`

improves the model fit considerably compared to the basic `MOS0`

model.

Finally, an all-subsets regression is conducted instead of the cheaper best-subsets regression. It considers all 41 variables without any restrictions to determine what is the best model in terms of BIC that could be found for this data set.

MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature) MOS2 <- refit(lmSelect(MOS2_all, penalty = "BIC"))

Again, the best model is refitted with `lm()`

to facilitate further inspections, see above for the summary table.

The best-BIC models `MOS1`

and `MOS2`

both have 13 regressors. The deterministic trend and all but one of the harmonic seasonal components are retained in `MOS2`

even though they are not forced into the model (as in `MOS1`

). In addition, `MOS1`

and `MOS2`

share six NWP outputs relating to temperature (`tmax2m`

, `st`

, `t2pvu`

), pressure (`mslp`

, `p2pvu`

), hydrology (`vsmc`

, `wr`

), and heat flux (`sshnf`

). However, and most remarkably, `MOS1`

does not include the direct 2-meter temperature output from the NWP model (`t2m`

). In fact, `t2m`

is not included by any of the 20 submodels (sizes 8 to 27) shown by `image(MOS2_all, size = 8:27, hilite = 1, hilite_penalty = "BIC")`

whereas the temperature quantities `tmax2m`

, `st`

, `t2pvu`

are included by all. (Additionally, `plot(MOS2_all)`

would show the associated BIC and residual sum of squares across the different model sizes.) The summary statistics reveal that both `MOS1`

and `MOS2`

significantly improve over the simple reference model `MOS0`

, with `MOS2`

being only slightly better than `MOS1`

.

**leave a comment**for the author, please follow the link and comment on their blog:

**Achim Zeileis**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.