R: SVM to Predict MPG for 2019 Vehicles

[This article was first published on Data Science, Machine Learning and Predictive Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continuing on the below post, I am going to use a support vector machine (SVM) to predict combined miles per gallon for all 2019 motor vehicles.

Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles

Part 2: Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles

The raw data is located on the EPA government site

The variables/features I am using for the models are: Engine displacement (size), number of cylinders, transmission type, number of gears, air inspired method, regenerative braking type, battery capacity Ah, drivetrain, fuel type, cylinder deactivate, and variable valve. 

There are 1253 vehicles in the dataset (does not include pure electric vehicles) summarized below.

fuel_economy_combined    eng_disp        num_cyl       transmission<br /> Min.   :11.00         Min.   :1.000   Min.   : 3.000   A  :301     <br /> 1st Qu.:19.00         1st Qu.:2.000   1st Qu.: 4.000   AM : 46     <br /> Median :23.00         Median :3.000   Median : 6.000   AMS: 87     <br /> Mean   :23.32         Mean   :3.063   Mean   : 5.533   CVT: 50     <br /> 3rd Qu.:26.00         3rd Qu.:3.600   3rd Qu.: 6.000   M  :148     <br /> Max.   :58.00         Max.   :8.000   Max.   :16.000   SA :555     <br />                                                        SCV: 66     <br />   num_gears                      air_aspired_method<br /> Min.   : 1.000   Naturally Aspirated      :523     <br /> 1st Qu.: 6.000   Other                    :  5     <br /> Median : 7.000   Supercharged             : 55     <br /> Mean   : 7.111   Turbocharged             :663     <br /> 3rd Qu.: 8.000   Turbocharged+Supercharged:  7     <br /> Max.   :10.000                                     <br />                                                    <br />                 regen_brake   batt_capacity_ah <br />             No        :1194   Min.   : 0.0000  <br /> Electrical Regen Brake:  57   1st Qu.: 0.0000  <br /> Hydraulic Regen Brake :   2   Median : 0.0000  <br />                               Mean   : 0.3618  <br />                               3rd Qu.: 0.0000  <br />                               Max.   :20.0000  <br />                                                <br />                     drive    cyl_deactivate<br /> 2-Wheel Drive, Front   :345  Y: 172<br /> 2-Wheel Drive, Rear    :345  N:1081<br /> 4-Wheel Drive          :174  <br /> All Wheel Drive        :349  <br /> Part-time 4-Wheel Drive: 40  <br />                              <br />                              <br />                                      fuel_type   <br /> Diesel, ultra low sulfur (15 ppm, maximum): 28           <br /> Gasoline (Mid Grade Unleaded Recommended) : 16           <br /> Gasoline (Premium Unleaded Recommended)   :298                 <br /> Gasoline (Premium Unleaded Required)      :320                 <br /> Gasoline (Regular Unleaded Recommended)   :591                 <br />                                                                <br />                                                                <br /> variable_valve<br /> N:  38        <br /> Y:1215        <br />

Starting with an untuned base model:
set.seed(123)<br />m_svm_untuned <- svm(formula = fuel_economy_combined ~ .,<br />                     data    = test)<br /><br />pred_svm_untuned <- predict(m_svm_untuned, test)<br /><br />yhat <- pred_svm_untuned<br />y <- test$fuel_economy_combined<br />svm_stats_untuned <- postResample(yhat, y)<br />

svm_stats_untuned<br />     RMSE  Rsquared       MAE <br />2.3296249 0.8324886 1.4964907 <br />

Similar to the results for the untuned boosted model.  I am going to run a grid search and tune the support vector machine.

hyper_grid <- expand.grid(<br />  cost = 2^seq(-5,5,1),<br />  gamma= 2^seq(-5,5,1)  <br />)<br />e <- NULL<br /><br />for(j in 1:nrow(hyper_grid)){<br />  set.seed(123)<br />  m_svm_untuned <- svm(<br />    formula = fuel_economy_combined ~ .,<br />    data    = train,<br />    gamma = hyper_grid$gamma[j],<br />    cost = hyper_grid$cost[j]<br />  )  <br />  <br />  pred_svm_untuned <-predict(<br />    m_svm_untuned,<br />    newdata = test<br />  )<br />  <br />  yhat <- pred_svm_untuned<br />  y <- test$fuel_economy_combined<br />  e[j] <- postResample(yhat, y)[1]<br />  cat(j, "\n")<br />}<br /><br />which.min(e)  #minimum MSE<br />

The best tuned support vector machine has a cost of 32 and a gamma of .25.

I am going to run this combination:

set.seed(123)<br />m_svm_tuned <- svm(<br />  formula = fuel_economy_combined ~ .,<br />  data    = test,<br />  gamma = .25,<br />  cost = 32,<br />  scale=TRUE<br />  )  <br /><br />pred_svm_tuned <- predict(m_svm_tuned,test)<br /><br />yhat<-pred_svm_tuned <br />y<-test$fuel_economy_combined<br />svm_stats<-postResample(yhat,y)<br /><br />

svm_stats<br />     RMSE  Rsquared       MAE <br />0.9331948 0.9712492 0.7133039 <br /><br />

The tuned support vector machine outperforms the gradient boosted model substantially with a MSE of .87 vs a MSE of 3.25 for the gradient boosted model and a MSE of 3.67 for the random forest.

summary(m_svm_tuned)<br /><br />Call:<br />svm(formula = fuel_economy_combined ~ ., data = test, gamma = 0.25, cost = 32, scale = TRUE)<br /><br /><br />Parameters:<br />   SVM-Type:  eps-regression <br /> SVM-Kernel:  radial <br />       cost:  32 <br />      gamma:  0.25 <br />    epsilon:  0.1 <br /><br /><br />Number of Support Vectors:  232<br /><br /><br />

sum(abs(res)<=1) / 314<br />[1] 0.8503185 

The model is able to predict 85% of vehicles within 1 MPG of EPA estimate. Considering I am not rounding this is a great result.

The model also does a much better job with outliers as none of the models predicted the Hyundai Ioniq well.

tmp[which(abs(res) > svm_stats[1] * 3), ] #what cars are 3 se residuals<br />                 Division        Carline fuel_economy_combined pred_svm_tuned<br />641 HYUNDAI MOTOR COMPANY          Ioniq                    55       49.01012<br />568                TOYOTA      CAMRY XSE                    26       22.53976<br />692            Volkswagen Arteon 4Motion                    23       26.45806<br />984            Volkswagen          Atlas                    19       22.23552<br /><br /><br />

To leave a comment for the author, please follow the link and comment on their blog: Data Science, Machine Learning and Predictive Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)