**Data Science, Data Mining and Predictive Analytics**, and kindly contributed to R-bloggers)

I am going to use regression, decision trees, and the random forest algorithm to predict combined miles per gallon for all 2019 motor vehicles. The raw data is located on the EPA government site

After preliminary diagnostics, exploration and cleaning I am going to start with a multiple linear regression model.

The variables/features I am using for the models are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve.

There are 1253 vehicles in the dataset (does not include pure electric vehicles) summarized below.

`fuel_economy_combined eng_disp num_cyl transmission`

Min. :11.00 Min. :1.000 Min. : 3.000 A :301

1st Qu.:19.00 1st Qu.:2.000 1st Qu.: 4.000 AM : 46

Median :23.00 Median :3.000 Median : 6.000 AMS: 87

Mean :23.32 Mean :3.063 Mean : 5.533 CVT: 50

3rd Qu.:26.00 3rd Qu.:3.600 3rd Qu.: 6.000 M :148

Max. :58.00 Max. :8.000 Max. :16.000 SA :555

SCV: 66

num_gears air_aspired_method

Min. : 1.000 Naturally Aspirated :523

1st Qu.: 6.000 Other : 5

Median : 7.000 Supercharged : 55

Mean : 7.111 Turbocharged :663

3rd Qu.: 8.000 Turbocharged+Supercharged: 7

Max. :10.000

regen_brake batt_capacity_ah

No :1194 Min. : 0.0000

Electrical Regen Brake: 57 1st Qu.: 0.0000

Hydraulic Regen Brake : 2 Median : 0.0000

Mean : 0.3618

3rd Qu.: 0.0000

Max. :20.0000

drive cyl_deactivate

2-Wheel Drive, Front :345 Y: 172

2-Wheel Drive, Rear :345 N:1081

4-Wheel Drive :174

All Wheel Drive :349

Part-time 4-Wheel Drive: 40

fuel_type

Diesel, ultra low sulfur (15 ppm, maximum): 28

Gasoline (Mid Grade Unleaded Recommended) : 16

Gasoline (Premium Unleaded Recommended) :298

Gasoline (Premium Unleaded Required) :320

Gasoline (Regular Unleaded Recommended) :591

variable_valve

N: 38

Y:1215

`Call:`

lm(formula = fuel_economy_combined ~ eng_disp + transmission +

num_gears + air_aspired_method + regen_brake + batt_capacity_ah +

drive + fuel_type + cyl_deactivate + variable_valve, data = cars_19)

Residuals:

Min 1Q Median 3Q Max

-12.7880 -1.6012 0.1102 1.6116 17.3181

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.05642 0.82585 43.660 < 2e-16 ***

eng_disp -2.79257 0.08579 -32.550 < 2e-16 ***

transmissionAM 2.74053 0.44727 6.127 1.20e-09 ***

transmissionAMS 0.73943 0.34554 2.140 0.032560 *

transmissionCVT 6.83932 0.62652 10.916 < 2e-16 ***

transmissionM 1.08359 0.31706 3.418 0.000652 ***

transmissionSA 0.63231 0.22435 2.818 0.004903 **

transmissionSCV 2.73768 0.40176 6.814 1.48e-11 ***

num_gears 0.21496 0.07389 2.909 0.003691 **

air_aspired_methodOther -2.70781 1.99491 -1.357 0.174916

air_aspired_methodSupercharged -1.62171 0.42210 -3.842 0.000128 ***

air_aspired_methodTurbocharged -1.79047 0.22084 -8.107 1.24e-15 ***

air_aspired_methodTurbocharged+Supercharged -1.68028 1.04031 -1.615 0.106532

regen_brakeElectrical Regen Brake 12.59523 0.90030 13.990 < 2e-16 ***

regen_brakeHydraulic Regen Brake 6.69040 1.94379 3.442 0.000597 ***

batt_capacity_ah -0.47689 0.11838 -4.028 5.96e-05 ***

drive2-Wheel Drive, Rear -2.54806 0.24756 -10.293 < 2e-16 ***

drive4-Wheel Drive -3.14862 0.29649 -10.620 < 2e-16 ***

driveAll Wheel Drive -3.12875 0.22300 -14.030 < 2e-16 ***

drivePart-time 4-Wheel Drive -3.94765 0.46909 -8.415 < 2e-16 ***

fuel_typeGasoline (Mid Grade Unleaded Recommended) -5.54594 0.97450 -5.691 1.58e-08 ***

fuel_typeGasoline (Premium Unleaded Recommended) -5.44412 0.70009 -7.776 1.57e-14 ***

fuel_typeGasoline (Premium Unleaded Required) -6.01955 0.70542 -8.533 < 2e-16 ***

fuel_typeGasoline (Regular Unleaded Recommended) -6.43743 0.68767 -9.361 < 2e-16 ***

cyl_deactivateY 0.52100 0.27109 1.922 0.054851 .

variable_valveY 2.00533 0.59508 3.370 0.000775 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

standard error: 2.608 on 1227 degrees of freedom

Multiple R-squared: 0.8104, Adjusted R-squared: 0.8066

F-statistic: 209.8 on 25 and 1227 DF, p-value: < 2.2e-16

The fitted MSE is 6.8 and predicted MSE of 6.83. Some of the below residuals are too large. The extreme large residual is a Hyundai Ioniq which none of the models predict very well as it is unique vehicle (versus the other data points).

Let’s try a decision tree regression model.

`#regression tree full`

m_reg_tree_full <- rpart(formula = fuel_economy_combined ~ .,

data = train,

method = "anova",)

#regression tree tuned

m_reg_tree_trimmed <- rpart(

formula = fuel_economy_combined ~ .,

data = train,

method = "anova",

control = list(minsplit = 10, cp = .0005)

)

#rpart.plot(m_reg_tree_full)

plotcp(m_reg_tree_full)

pred_decision_tree_full <- predict(m_reg_tree_full, newdata = test)

mse_tree_full <- RMSE(pred = pred_decision_tree_full, obs = test$fuel_economy_combined) ^2

pred_decision_tree_trimmed <- predict(m_reg_tree_trimmed, newdata = test)

mse_tree_trimmed <- RMSE(pred = pred_decision_tree_trimmed, obs = test$fuel_economy_combined) ^2

plotcp(m_reg_tree_trimmed)

After tuning the decision tree the predicted MSE is 6.20 which is better than the regression model.

Finally let’s try a random forest model. The random forest should produce the best model as it will attempt to remove some of the correlation within the decision tree structure.

`#random forest`

m_random_forest_full <-randomForest(formula = fuel_economy_combined ~ ., data = train)

predict_random_forest_full <- predict(m_random_forest_full, newdata = test)

mse_random_forest_full <- RMSE(pred = predict_random_forest_full, obs = test$fuel_economy_combined) ^ 2

which.min(m_random_forest_full$mse)

#random forest tuned

m_random_forest <- randomForest(formula = fuel_economy_combined ~ ., data = train, ntree = 250)

plot(m_random_forest)

predict_random_forest <- predict(m_random_forest, newdata = test)

mse_random_forest <- RMSE(pred = predict_random_forest, obs = test$fuel_economy_combined) ^ 2

plot(tmp$test.fuel_economy_combined - tmp$r.predict_random_forrest., ylab = "residuals",main = "Random Forest")

varImpPlot(m_random_forest)

The error stabilizes at 250 trees. randomForest() by default uses 500 trees which is unnecessary.

After tuning the random forest the model has the lowest fitted and predicted MSE of 3.67 which is substantially better than the MSE of the decision tree 6.2

The random forest also has an r-squared of .9

Engine size, number of cylinders, and transmission type are the largest contributors to accuracy.

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Science, Data Mining and Predictive Analytics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...