[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Regression analysis in R, just look at the Boston housing data and we can see a total of 506 observations and 14 variables.

In this dataset, medv is the response variable, and the remaining are the predictors.

We want to make a regression prediction model for medv based on other predictor variables.

Most of the variables are numeric variables except one variable.

First, we need to look at the multicollinearity problem, for that exclude factor variable.

In this case, some of the pairs are highly correlated and this may lead to inaccurate results.

Rank order analysis in R

## How to avoid collinearity issues?

Collinearity leads to overfitting

The first solution is to fit ridge regression, shrinks coefficient to non-zero values to prevent overfitting, but keeps all variables.

The second option is lasso regression, which shrinks regression coefficients, with some shrunk to zero. Thus, it also helps with feature selection.

The third option is too elastic net regression, Mix of the ridge and lasso.

Elastic net regression sum of squares reduces to the ridge when alpha equals zero and reduces to lasso regression when alpha equals 1.

Elastic net regression models are more flexible. When we fit the elastic net regression model end up with the best model maybe 20% ridge and 80% lasso or it could be another combination of ridge and lasso.

tidyverse in R

## Regression analysis in R

library(caret)
library(glmnet)
library(mlbench)
library(psych)

### Getting Data

data("BostonHousing")
data <- BostonHousing

### Data Partition

set.seed(222)
ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]

Custom Control Parameters with 10 number cross validation

custom <- trainControl(method = "repeatedcv",number = 10,repeats = 5,verboseIter = T)

### Linear Model

set.seed(1234)
lm <- train(medv~.,train,methods='lm', trControl=custom)
Linear Regression
353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results:
RMSE     Rsquared  MAE
4.23222  0.778488  3.032342
Tuning parameter 'intercept' was held constant at a value
of TRUE
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min       1Q   Median       3Q      Max
-10.1018  -2.3528  -0.7279   1.7047  27.7868

You can see RMSE is 4.23 and R squares is 0.77. Cross-validation is 10 indicates 9 parts used for training the model and one part used for testing the error and its repeated with five number of times.

How to clean datasets?

summary(lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  25.742808   5.653389   4.554 7.37e-06 ***
crim         -0.165452   0.036018  -4.594 6.15e-06 ***
zn            0.047202   0.015401   3.065 0.002352 **
indus         0.013377   0.067401   0.198 0.842796
chas1         1.364633   0.947288   1.441 0.150630
nox         -13.065313   4.018576  -3.251 0.001264 **
rm            5.072891   0.468889  10.819  < 2e-16 ***
age          -0.028573   0.013946  -2.049 0.041247 *
dis          -1.421107   0.208908  -6.803 4.66e-11 ***
rad           0.260863   0.070092   3.722 0.000232 ***
tax          -0.013556   0.004055  -3.343 0.000922 ***
ptratio      -0.906744   0.139687  -6.491 3.03e-10 ***
b             0.008912   0.002986   2.985 0.003040 **
lstat        -0.335149   0.056920  -5.888 9.40e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.192 on 339 degrees of freedom
Multiple R-squared:  0.7874,     Adjusted R-squared:  0.7793
F-statistic: 96.59 on 13 and 339 DF,  p-value: < 2.2e-16

The variables do not have a star indicates those variables are not statistically significant.

### Ridge Regression

set.seed(1234)
ridge <- train(medv~.,train, method='glmnet',tuneGrid=expand.grid(alpha=0,lambda=seq(0.0001,1,length=5)),trControl=custom)
ridge
353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results across tuning parameters:
lambda    RMSE      Rsquared   MAE
0.000100  4.242204  0.7782278  3.008339
0.250075  4.242204  0.7782278  3.008339
0.500050  4.242204  0.7782278  3.008339
0.750025  4.248536  0.7779462  3.012397
1.000000  4.265479  0.7770264  3.023091

Tuning parameter ‘alpha’ was held constant at a value of 0

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were alpha = 0 and lambda = 0.50005.

You can see alpha is 0 because we are doing ridge regression and lambda is 0.5000.

#### Plot Results

plot(ridge)


Increase the lambda increases the error and the appropriate lambda is 0.5.

plot(ridge$finalModel, xvar = "lambda", label = T) X axis has log lambda, when log lambda around 9 all coefficients are zero. plot(ridge$finalModel, xvar = 'dev', label=T)

In this plot, you can see that the fraction deviation 60% model explains very well and after that lot of deviation noticed.

Repeated Measures of ANOVA

plot(varImp(ridge, scale=T))

The most important variables you can see in the top of the graph and at least once are at the bottom.

### Lasso Regression

set.seed(1234)
lasso <- train(medv~.,train,
method='glmnet',
tuneGrid=expand.grid(alpha=1,
lambda=seq(0.0001,1,length=5)),trControl=custom)
glmnet
353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results across tuning parameters:
lambda    RMSE      Rsquared   MAE
0.000100  4.230700  0.7785841  3.025998
0.250075  4.447615  0.7579974  3.135095
0.500050  4.611916  0.7438984  3.285522
0.750025  4.688806  0.7406668  3.362630
1.000000  4.786658  0.7366188  3.445216

Tuning parameter ‘alpha’ was held constant at a value of 1

RMSE was used to select the optimal model using the  smallest value.

The final values used for the model were alpha = 1 and  lambda = 1e-04.

In this case lambda is close to zero that is the best value.

Deep Neural Network in R

#### Plot Results

plot(lasso)
best <- en$finalModel coef(best, s = en$bestTune$lambda) You can find out best coefficients based on above command. ### Prediction p1 <- predict(fm, train) sqrt(mean((train$medv-p1)^2))
4.108671
p2 <- predict(fm, test)
sqrt(mean((test\$medv-p2)^2))
6.14675

## Conclusion

If we look at the RMSE the lowest value coming in the elastic net model. Elastic Net regression model avoids multicollinearity issue and provides the best model.

The post Regression analysis in R-Model Comparison appeared first on finnstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)