When fitting a multiple linear regression model to data a natural question is whether a model can be simplified by excluding variables from the model. There are automatic procedures for undertaking these tests but some people prefer to follow a more manual approach to variable selection rather than pressing a button and taking what comes out.
Fast Tube by Casper
When there are a large number of variables it is awkward to manually go through each one in turn to make a decision about simplification to a more parsimonious model. In R there is a function dropterm that removes some of this task by assuming that we are interested in considering the outcome of dropping each model term one at a time.
To illustrate this consider the cpus data set in the MASS package which contains information about a relative performance measure and characteristics of 209 CPUs. We load the package first to make the data available:
We first fit a linear model with six explanatory variables:
cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)
The function dropterm requires a fitted model, which we saved in the last command, and optionally we could specify what test to use to compare the initial model and each of the possible alternative models with one less variable. We can choose to perform an F test:
> dropterm(cpu.mod1, test = "F") Single term deletions Model: perf ~ syct + mmin + mmax + cach + chmin + chmax Df Sum of Sq RSS AIC F Value Pr(F)
727002 1718.3 syct 1 27995 754997 1724.2 7.779 0.005793 ** mmin 1 252211 979213 1778.5 70.078 9.416e-15 *** mmax 1 271147 998149 1782.5 75.339 1.326e-15 *** cach 1 75962 802964 1737.0 21.106 7.640e-06 *** chmin 1 358 727360 1716.4 0.100 0.752632 chmax 1 163396 890398 1758.6 45.400 1.640e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The output from the function call indicates that we could excude the chmin variable then re-fit the model and continue again with the same checking process.
The dropterm function considers each variable individually and considers what the change in residual sum of squares would be if this variable was excluded from the model. There is a link between this F test and the t test that appears as part of the model summary – this is because of the link between these two distributions. For this model we would have:
> summary(cpu.mod1) Call: lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus) Residuals: Min 1Q Median 3Q Max -195.841 -25.169 5.409 26.528 385.749 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.590e+01 8.045e+00 -6.948 4.99e-11 *** syct 4.886e-02 1.752e-02 2.789 0.00579 ** mmin 1.529e-02 1.827e-03 8.371 9.42e-15 *** mmax 5.571e-03 6.418e-04 8.680 1.33e-15 *** cach 6.412e-01 1.396e-01 4.594 7.64e-06 *** chmin -2.701e-01 8.557e-01 -0.316 0.75263 chmax 1.483e+00 2.201e-01 6.738 1.64e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 59.99 on 202 degrees of freedom Multiple R-squared: 0.8649, Adjusted R-squared: 0.8609 F-statistic: 215.5 on 6 and 202 DF, p-value: < 2.2e-16
Let us consider the syct variable. The t statistic in the model summary is 2.789 and if we square this value we get 7.779 which is the F statistic produced by the dropterm function.