**Data Science, Data Mining and Predictive Analytics**, and kindly contributed to R-bloggers)

Continuing on the below post, I am going to use a gradient boosted machine model to predict combined miles per gallon for all 2019 motor vehicles.

Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles

The raw data is located on the EPA government site

The variables/features I am using for the models are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve.

There are 1253 vehicles in the dataset (does not include pure electric vehicles) summarized below.

`fuel_economy_combined eng_disp num_cyl transmission`

Min. :11.00 Min. :1.000 Min. : 3.000 A :301

1st Qu.:19.00 1st Qu.:2.000 1st Qu.: 4.000 AM : 46

Median :23.00 Median :3.000 Median : 6.000 AMS: 87

Mean :23.32 Mean :3.063 Mean : 5.533 CVT: 50

3rd Qu.:26.00 3rd Qu.:3.600 3rd Qu.: 6.000 M :148

Max. :58.00 Max. :8.000 Max. :16.000 SA :555

SCV: 66

num_gears air_aspired_method

Min. : 1.000 Naturally Aspirated :523

1st Qu.: 6.000 Other : 5

Median : 7.000 Supercharged : 55

Mean : 7.111 Turbocharged :663

3rd Qu.: 8.000 Turbocharged+Supercharged: 7

Max. :10.000

regen_brake batt_capacity_ah

No :1194 Min. : 0.0000

Electrical Regen Brake: 57 1st Qu.: 0.0000

Hydraulic Regen Brake : 2 Median : 0.0000

Mean : 0.3618

3rd Qu.: 0.0000

Max. :20.0000

drive cyl_deactivate

2-Wheel Drive, Front :345 Y: 172

2-Wheel Drive, Rear :345 N:1081

4-Wheel Drive :174

All Wheel Drive :349

Part-time 4-Wheel Drive: 40

fuel_type

Diesel, ultra low sulfur (15 ppm, maximum): 28

Gasoline (Mid Grade Unleaded Recommended) : 16

Gasoline (Premium Unleaded Recommended) :298

Gasoline (Premium Unleaded Required) :320

Gasoline (Regular Unleaded Recommended) :591

variable_valve

N: 38

Y:1215

Starting with an untuned base model:

`trees <- 1200`

m_boosted_reg_untuned <- gbm(

formula = fuel_economy_combined ~ .,

data = train,

n.trees = trees,

distribution = "gaussian"

)

`> summary(m_boosted_reg_untuned)`

var rel.inf

eng_disp eng_disp 41.26273684

batt_capacity_ah batt_capacity_ah 24.53458898

transmission transmission 11.33253784

drive drive 8.59300859

regen_brake regen_brake 8.17877824

air_aspired_method air_aspired_method 2.11397865

num_gears num_gears 1.90999021

fuel_type fuel_type 1.65692562

num_cyl num_cyl 0.22260369

variable_valve variable_valve 0.11043532

cyl_deactivate cyl_deactivate 0.08441602

> boosted_stats_untuned

RMSE Rsquared MAE

2.4262643 0.8350367 1.7513331

The untuned GBM model performs better than the multiple linear regression model, but worse than the random forest.

I am going to tune the GBM by running a grid search:

`#create hyperparameter grid`

hyper_grid <- expand.grid(

shrinkage = seq(.07, .12, .01),

interaction.depth = 1:7,

optimal_trees = 0,

min_RMSE = 0

)

#grid search

for (i in 1:nrow(hyper_grid)) {

set.seed(123)

gbm.tune <- gbm(

formula = fuel_economy_combined ~ .,

data = train_random,

distribution = "gaussian",

n.trees = 5000,

interaction.depth = hyper_grid$interaction.depth[i],

shrinkage = hyper_grid$shrinkage[i],

)

hyper_grid$optimal_trees[i] <- which.min(gbm.tune$train.error)

hyper_grid$min_RMSE[i] <- sqrt(min(gbm.tune$train.error))

cat(i, "\n")

}

The hyper grid is 42 rows which is all combinations of shrinkage and interaction depths specified above.

`> head(hyper_grid)`

shrinkage interaction.depth optimal_trees min_RMSE

1 0.07 1 0 0

2 0.08 1 0 0

3 0.09 1 0 0

4 0.10 1 0 0

5 0.11 1 0 0

6 0.12 1 0 0

After running the grid search, it is apparent that there is overfitting. This is something to be very careful about. I am going to run a 5 fold cross validation to estimate out of bag error vs MSE. After running the 5 fold CV, this is the best model that does not overfit:

`> m_boosted_reg <- gbm(`

formula = fuel_economy_combined ~ .,

data = train,

n.trees = trees,

distribution = "gaussian",

shrinkage = .09,

cv.folds = 5,

interaction.depth = 5

)

best.iter <- gbm.perf(m_boosted_reg, method = "cv")

pred_boosted_reg_ <- predict(m_boosted_reg,n.trees=1183, newdata = test)

mse_boosted_reg_ <- RMSE(pred = pred_boosted_reg, obs = test$fuel_economy_combined) ^2

boosted_stats<-postResample(pred_boosted_reg,test$fuel_economy_combined)

The fitted black curve above is MSE and the fitted green curve is the out of bag estimated error. 1183 is the optimal amount of iterations.

`> pred_boosted_reg <- predict(m_boosted_reg,n.trees=1183, newdata = test)`

> mse_boosted_reg <- RMSE(pred = pred_boosted_reg, obs = test$fuel_economy_combined) ^2

> boosted_stats<-postResample(pred_boosted_reg,test$fuel_economy_combined)

> boosted_stats

RMSE Rsquared MAE

1.8018793 0.9092727 1.3334459

> mse_boosted_reg

3.246769

The tuned gradient boosted model performs better than the random forest with a MSE of 3.25 vs 3.67 for the random forest.

`> summary(res)`

Min. 1st Qu. Median Mean 3rd Qu. Max.

-5.40000 -0.90000 0.00000 0.07643 1.10000 9.10000

50% of the predictions are within 1 MPG of the EPA Government Estimate.

The largest residuals are exotics and a hybrid which are the more unique data points in the dataset.

`> tmp[which(abs(res) > boosted_stats[1] * 3), ] `

Division Carline fuel_economy_combined pred_boosted_reg

642 HYUNDAI MOTOR COMPANY Ioniq Blue 58 48.5

482 KIA MOTORS CORPORATION Forte FE 35 28.7

39 Lamborghini Aventador Coupe 11 17.2

40 Lamborghini Aventador Roadster 11 17.2

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Science, Data Mining and Predictive Analytics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...