**R – Real Data**, and kindly contributed to R-bloggers)

Recently I have been working on a Kaggle competition where participants are tasked with predicting Russian housing prices. In developing a model for the challenge, I came across a few methods for selecting the best regression

model for a given dataset. Let’s load up some data and take a look.

library(ISLR) set.seed(123) swiss <- data.frame(datasets::swiss) dim(swiss)

```
## [1] 47 6
```

head(swiss)

```
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
```

This data provides data on Swiss fertility and some socioeconomic

indicators. Suppose we want to predict fertility rate. Each observation

is as a percentage. In order to assess our prediction ability, create

test and training data sets.

test <- swiss[sample(nrow(swiss),5),] test <- data.frame(test) train <- swiss[-(sample(nrow(swiss), 5)),] train <- data.frame(train)

Next, have a quick look at the data.

par(mfrow=c(2,2)) plot(train$Education, train$Fertility) plot(train$Catholic, train$Fertility) plot(train$Infant.Mortality, train$Fertility) hist(train$Fertility)

Looks like there could be some interesting relationships here. The

following block of code will take a model formula (or model matrix) and

search for the best combination of predictors.

library(leaps) leap <- regsubsets(Fertility~., data = train, nbest = 10) leapsum <- summary(leap) plot(leap, scale = 'adjr2')

According to the adjusted R-squared value (larger is better), the best

two models are: those with all predictors and all predictors less

Examination. Both have adjusted R-squared values around .64 – a decent fit. Fit these models so we can evaluate them further.

swisslm <- lm(Fertility~., data = train) swisslm2 <- lm(Fertility~.-Examination, data = train) #use summary() for a more detailed look.

First we'll compute the mean squared error (MSE).

mean(swisslm$residuals^2)

```
## [1] 45.44987
```

mean(swisslm2$residuals^2)

```
## [1] 47.16062
```

Looks like the model with all predictors is marginally better. We're

looking for a low value of MSE here. We can also look at the Akaike

information criterion (AIC) for more information. Lower is better here

as well.

library(MASS) step1 <- stepAIC(swisslm, direction = "both")

step1$anova

```
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Final Model:
## Fertility ~ Education + Catholic + Infant.Mortality
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 36 1908.895 172.2976
## 2 - Examination 1 71.85140 37 1980.746 171.8495
## 3 - Agriculture 1 78.11025 38 2058.856 171.4739
```

Here, the model without `Examination`

scores lower than the full

model. It seems that both models are evenly matched, though I might be

inclined to use the full model since the MSE is lower.

Since we sampled our original data to make the train and test datasets,

the difference in these tests is subject to change based on the training

data used. I encourage anyone who wants to test themselves to change the

`set.seed`

at the beginning and evaluate the above results again. Even

better: use your own data!

If you learned something or have questions feel free to leave a comment

or shoot me an email.

Happy modeling,

Kiefer

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Real Data**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...