Does money buy happiness after all? Machine Learning with One Rule
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This week, I am exploring Holger K. von Jouanne-Diedrich’s OneR package for machine learning. I am running an example analysis on world happiness data and compare the results with other machine learning models (decision trees, random forest, gradient boosting trees and neural nets).
Conclusions
All in all, based on this example, I would confirm that OneR models do indeed produce sufficiently accurate models for setting a good baseline. OneR was definitely faster than random forest, gradient boosting and neural nets. However, the latter were more complex models and included cross-validation.
If you prefer an easy to understand model that is very simple, OneR is a very good way to go. You could also use it as a starting point for developing more complex models with improved accuracy.
When looking at feature importance across models, the feature OneR chose – Economy/GDP per capita – was confirmed by random forest, gradient boosting trees and neural networks as being the most important feature. This is in itself an interesting conclusion! Of course, this correlation does not tell us that there is a direct causal relationship between money and happiness, but we can say that a country’s economy is the best individual predictor for how happy people tend to be.
OneR
OneR has been developed for the purpose of creating machine learning models that are easy to interpret and understand, while still being as accurate as possible. It is based on the one rule classification algorithm from Holte (1993), which is basically a decision tree cut at the first level.
While the original algorithm has difficulties in handling missing values and numeric data, the package provides enhanced functionality to handle those cases better, e.g. introducing a separate class for NA values and the optbin() function to find optimal splitting points for each feature. The main function of the package is OneR, which finds an optimal split for each feature and only use the most important feature with highest training accuracy for classification.
I installed the latest stable version of the OneR package from CRAN.
<span class="n">library</span><span class="p">(</span><span class="n">OneR</span><span class="p">)</span><span class="w">
</span>
The dataset
I am using the World Happiness Report 2016 from Kaggle.
<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">data_16</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s2">"world-happiness/2016.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">data_15</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s2">"world-happiness/2015.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>
In the 2016 data there are upper and lower CI for the happiness score given, while in the 2015 data we have standard errors. Because I want to combine data from the two years, I am using only columns that are in both datasets.
<span class="n">common_feats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">data_16</span><span class="p">)[</span><span class="n">which</span><span class="p">(</span><span class="n">colnames</span><span class="p">(</span><span class="n">data_16</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">data_15</span><span class="p">))]</span><span class="w">
</span><span class="c1"># features and response variable for modeling
</span><span class="n">feats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setdiff</span><span class="p">(</span><span class="n">common_feats</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Country"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Happiness.Rank"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Happiness.Score"</span><span class="p">))</span><span class="w">
</span><span class="n">response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Happiness.Score"</span><span class="w">
</span><span class="c1"># combine data from 2015 and 2016
</span><span class="n">data_15_16</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">select</span><span class="p">(</span><span class="n">data_15</span><span class="p">,</span><span class="w"> </span><span class="n">one_of</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">feats</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">))),</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">data_16</span><span class="p">,</span><span class="w"> </span><span class="n">one_of</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">feats</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">))))</span><span class="w">
</span>
The response variable happiness score is on a numeric scale. OneR could also perform regression but here, I want to compare classification tasks. For classifying happiness, I create three bins for low, medium and high values of the happiness score. In order to not having to deal with unbalanced data, I am using the bin() function from OneR with method = "content"
. For plotting the cut-points, I am extracting the numbers from the default level names.
<span class="n">data_15_16</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bin</span><span class="p">(</span><span class="n">data_15_16</span><span class="o">$</span><span class="n">Happiness.Score</span><span class="p">,</span><span class="w"> </span><span class="n">nbins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"content"</span><span class="p">)</span><span class="w">
</span><span class="n">intervals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">levels</span><span class="p">(</span><span class="n">data_15_16</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">),</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">" "</span><span class="p">)</span><span class="w">
</span><span class="n">intervals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\\(|]"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">intervals</span><span class="p">)</span><span class="w">
</span><span class="n">intervals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="n">intervals</span><span class="p">)</span><span class="w">
</span><span class="n">intervals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">intervals</span><span class="p">,</span><span class="w"> </span><span class="s2">" "</span><span class="p">)[[</span><span class="m">1</span><span class="p">]]))</span><span class="w">
</span>
<span class="n">data_15_16</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_density</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Happiness.Score</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">intervals</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">intervals</span><span class="p">[</span><span class="m">3</span><span class="p">])</span><span class="w">
</span>
Now I am removing the original happiness score column from the data for modeling and rename the factor levels of the response variable.
<span class="n">data_15_16</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">data_15_16</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">Happiness.Score</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">plyr</span><span class="o">::</span><span class="n">revalue</span><span class="p">(</span><span class="n">Happiness.Score.l</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"(2.83,4.79]"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"low"</span><span class="p">,</span><span class="w"> </span><span class="s2">"(4.79,5.89]"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"medium"</span><span class="p">,</span><span class="w"> </span><span class="s2">"(5.89,7.59]"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"high"</span><span class="p">)))</span><span class="w">
</span>
Because there are only 9 features in this small dataset, I want to explore them all individually before modeling. First, I am plotting the only categorical variable: Region.
This plots shows that there are a few regions with very strong biases in happiness: People in Western Europe, Australia, New Zealand, North America, Latin American and the Caribbean tend to me in the high happiness group, while people in sub-saharan Africa and Southern Asia tend to be the least happiest.
<span class="n">data_15_16</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Region</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Happiness.Score.l</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_bar</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dodge"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.7</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="w"> </span><span class="n">vjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">plot.margin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">),</span><span class="w"> </span><span class="s2">"cm"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span><span class="w">
</span>
The remaining quantitative variables show happiness biases to varying degrees: e.g. low health and life expectancy is strongly biased towards low happiness, economic factors, family and freedom show a bias in the same direction, albeit not as strong.
<span class="n">data_15_16</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">Economy..GDP.per.Capita.</span><span class="o">:</span><span class="n">Dystopia.Residual</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Happiness.Score.l</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.7</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free"</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span><span class="w">
</span>
While OneR could also handle categorical data, in this example, I only want to consider the quantitative features to show the differences between OneR and other machine learning algorithms.
<span class="n">data_15_16</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">data_15_16</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">Region</span><span class="p">)</span><span class="w">
</span>
Modeling
The algorithms I will compare to OneR will be run via the caret package.
<span class="c1"># configure multicore
</span><span class="n">library</span><span class="p">(</span><span class="n">doParallel</span><span class="p">)</span><span class="w">
</span><span class="n">cl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">makeCluster</span><span class="p">(</span><span class="n">detectCores</span><span class="p">())</span><span class="w">
</span><span class="n">registerDoParallel</span><span class="p">(</span><span class="n">cl</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">caret</span><span class="p">)</span><span class="w">
</span>
I will also use caret’s createDataPartition() function to partition the data into training (70%) and test sets (30%).
<span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">index</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">createDataPartition</span><span class="p">(</span><span class="n">data_15_16</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.7</span><span class="p">,</span><span class="w"> </span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">train_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_15_16</span><span class="p">[</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">test_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_15_16</span><span class="p">[</span><span class="o">-</span><span class="n">index</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span>
OneR
OneR only accepts categorical features. Because we have numerical features, we need to convert them to factors by splitting them into appropriate bins. While the original OneR algorithm splits the values into ever smaller factors, this has been changed in this R-implementation with the argument of preventing overfitting. We can either split the data into pre-defined numbers of buckets (by length, content or cluster) or we can use the optbin() function to obtain the optimal number of factors from pairwise logistic regression or information gain.
<span class="c1"># default method length
</span><span class="n">data_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bin</span><span class="p">(</span><span class="n">train_data</span><span class="p">,</span><span class="w"> </span><span class="n">nbins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"length"</span><span class="p">)</span><span class="w">
</span><span class="c1"># method content
</span><span class="n">data_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bin</span><span class="p">(</span><span class="n">train_data</span><span class="p">,</span><span class="w"> </span><span class="n">nbins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"content"</span><span class="p">)</span><span class="w">
</span><span class="c1"># method cluster
</span><span class="n">data_3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bin</span><span class="p">(</span><span class="n">train_data</span><span class="p">,</span><span class="w"> </span><span class="n">nbins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cluster"</span><span class="p">)</span><span class="w">
</span><span class="c1"># optimal bin number logistic regression
</span><span class="n">data_4</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">optbin</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_data</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"logreg"</span><span class="p">)</span><span class="w">
</span><span class="c1"># optimal bin number information gain
</span><span class="n">data_5</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">optbin</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_data</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"infogain"</span><span class="p">)</span><span class="w">
</span>
This is how the data looks like following discretization:
- Default method
- 5 bins with
method = "content
- 3 bins with
method = "cluster
- optimal bin number according to logistic regression
- optimal bin number according to information gain
Model building
Now I am running the OneR models. During model building, the chosen attribute/feature with highest accuracy along with the top 7 features decision rules and accuracies are printed. Unfortunately, this information is not saved in the model object; this would have been nice in order to compare the importance of features across models later on.
Here, all five models achieved highest prediction accuracy with the feature Economy GDP per capita.
<span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"data_"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">OneR</span><span class="p">(</span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">~</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="n">assign</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"model_"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">model</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
##
## Attribute Accuracy
## 1 * Economy..GDP.per.Capita. 63.96%
## 2 Health..Life.Expectancy. 59.91%
## 3 Family 57.21%
## 4 Dystopia.Residual 51.8%
## 5 Freedom 49.55%
## 6 Trust..Government.Corruption. 45.5%
## 7 Generosity 41.89%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
##
##
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
##
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.365] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.365,0.73] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.73,1.09] then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.09,1.46] then Happiness.Score.l = high
## If Economy..GDP.per.Capita. = (1.46,1.83] then Happiness.Score.l = high
##
## Accuracy:
## 142 of 222 instances classified correctly (63.96%)
##
##
## Attribute Accuracy
## 1 * Economy..GDP.per.Capita. 64.41%
## 2 Health..Life.Expectancy. 60.81%
## 3 Family 59.91%
## 4 Trust..Government.Corruption. 55.41%
## 5 Freedom 53.15%
## 5 Dystopia.Residual 53.15%
## 7 Generosity 41.44%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
##
##
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
##
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.548] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.548,0.877] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.877,1.06] then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.06,1.28] then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.28,1.83] then Happiness.Score.l = high
##
## Accuracy:
## 143 of 222 instances classified correctly (64.41%)
##
##
## Attribute Accuracy
## 1 * Economy..GDP.per.Capita. 63.51%
## 2 Health..Life.Expectancy. 62.16%
## 3 Family 54.5%
## 4 Freedom 50.45%
## 4 Dystopia.Residual 50.45%
## 6 Trust..Government.Corruption. 43.24%
## 7 Generosity 36.49%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
##
##
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
##
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.602] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.602,1.1] then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.1,1.83] then Happiness.Score.l = high
##
## Accuracy:
## 141 of 222 instances classified correctly (63.51%)
##
##
## Attribute Accuracy
## 1 * Economy..GDP.per.Capita. 63.96%
## 2 Health..Life.Expectancy. 62.16%
## 3 Family 58.56%
## 4 Freedom 51.35%
## 5 Dystopia.Residual 50.9%
## 6 Trust..Government.Corruption. 46.4%
## 7 Generosity 40.09%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
##
##
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
##
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.754] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.754,1.12] then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.12,1.83] then Happiness.Score.l = high
##
## Accuracy:
## 142 of 222 instances classified correctly (63.96%)
##
##
## Attribute Accuracy
## 1 * Economy..GDP.per.Capita. 67.12%
## 2 Health..Life.Expectancy. 65.77%
## 3 Family 61.71%
## 4 Trust..Government.Corruption. 56.31%
## 5 Dystopia.Residual 55.41%
## 6 Freedom 50.9%
## 7 Generosity 43.69%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
##
##
## Call:
## OneR(data = data, formula = Happiness.Score.l ~ ., verbose = TRUE)
##
## Rules:
## If Economy..GDP.per.Capita. = (-0.00182,0.68] then Happiness.Score.l = low
## If Economy..GDP.per.Capita. = (0.68,1.24] then Happiness.Score.l = medium
## If Economy..GDP.per.Capita. = (1.24,1.83] then Happiness.Score.l = high
##
## Accuracy:
## 149 of 222 instances classified correctly (67.12%)
Model evaluation
The function eval_model() prints confusion matrices for absolute and relative predictions, as well as accuracy, error and error rate reduction. For comparison with other models, it would have been convenient to be able to extract these performance metrics directly from the eval_model object, instead of only the confusion matrix and values of correct/all instances and having to re-calculate performance metrics again manually.
<span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"model_"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="n">eval_model</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">test_data</span><span class="p">),</span><span class="w"> </span><span class="n">test_data</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
##
## Confusion matrix (absolute):
## Actual
## Prediction high low medium Sum
## high 23 0 11 34
## low 1 26 10 37
## medium 7 5 10 22
## Sum 31 31 31 93
##
## Confusion matrix (relative):
## Actual
## Prediction high low medium Sum
## high 0.25 0.00 0.12 0.37
## low 0.01 0.28 0.11 0.40
## medium 0.08 0.05 0.11 0.24
## Sum 0.33 0.33 0.33 1.00
##
## Accuracy:
## 0.6344 (59/93)
##
## Error rate:
## 0.3656 (34/93)
##
## Error rate reduction (vs. base rate):
## 0.4516 (p-value = 2.855e-09)
##
##
## Confusion matrix (absolute):
## Actual
## Prediction high low medium Sum
## high 19 0 1 20
## low 3 28 14 45
## medium 9 3 16 28
## Sum 31 31 31 93
##
## Confusion matrix (relative):
## Actual
## Prediction high low medium Sum
## high 0.20 0.00 0.01 0.22
## low 0.03 0.30 0.15 0.48
## medium 0.10 0.03 0.17 0.30
## Sum 0.33 0.33 0.33 1.00
##
## Accuracy:
## 0.6774 (63/93)
##
## Error rate:
## 0.3226 (30/93)
##
## Error rate reduction (vs. base rate):
## 0.5161 (p-value = 1.303e-11)
##
##
## Confusion matrix (absolute):
## Actual
## Prediction high low medium Sum
## high 23 0 11 34
## low 0 25 7 32
## medium 8 6 13 27
## Sum 31 31 31 93
##
## Confusion matrix (relative):
## Actual
## Prediction high low medium Sum
## high 0.25 0.00 0.12 0.37
## low 0.00 0.27 0.08 0.34
## medium 0.09 0.06 0.14 0.29
## Sum 0.33 0.33 0.33 1.00
##
## Accuracy:
## 0.6559 (61/93)
##
## Error rate:
## 0.3441 (32/93)
##
## Error rate reduction (vs. base rate):
## 0.4839 (p-value = 2.116e-10)
##
##
## Confusion matrix (absolute):
## Actual
## Prediction high low medium Sum
## high 23 0 11 34
## low 2 26 11 39
## medium 6 5 9 20
## Sum 31 31 31 93
##
## Confusion matrix (relative):
## Actual
## Prediction high low medium Sum
## high 0.25 0.00 0.12 0.37
## low 0.02 0.28 0.12 0.42
## medium 0.06 0.05 0.10 0.22
## Sum 0.33 0.33 0.33 1.00
##
## Accuracy:
## 0.6237 (58/93)
##
## Error rate:
## 0.3763 (35/93)
##
## Error rate reduction (vs. base rate):
## 0.4355 (p-value = 9.799e-09)
##
##
## Confusion matrix (absolute):
## Actual
## Prediction high low medium Sum
## high 21 0 3 24
## low 0 26 8 34
## medium 10 5 20 35
## Sum 31 31 31 93
##
## Confusion matrix (relative):
## Actual
## Prediction high low medium Sum
## high 0.23 0.00 0.03 0.26
## low 0.00 0.28 0.09 0.37
## medium 0.11 0.05 0.22 0.38
## Sum 0.33 0.33 0.33 1.00
##
## Accuracy:
## 0.7204 (67/93)
##
## Error rate:
## 0.2796 (26/93)
##
## Error rate reduction (vs. base rate):
## 0.5806 (p-value = 2.761e-14)
Because I want to calculate performance measures for the different classes separately and like to have a more detailed look at the prediction probabilities I get from the models, I prefer to obtain predictions with type = "prob
. While I am not looking at it here, this would also allow me to test different prediction thresholds.
<span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"model_"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"model_"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">),</span><span class="w">
</span><span class="n">sample_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">test_data</span><span class="p">),</span><span class="w">
</span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">test_data</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">),</span><span class="w">
</span><span class="n">actual</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_data</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">pred</span><span class="p">)[</span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">][</span><span class="n">apply</span><span class="p">(</span><span class="n">pred</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">which.max</span><span class="p">)]</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">correct</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">pred</span><span class="o">$</span><span class="n">actual</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="p">,</span><span class="w"> </span><span class="s2">"correct"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wrong"</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">pred_prob</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">pred</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pred</span><span class="p">[</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="s2">"pred_prob"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">pred</span><span class="p">[</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pred_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pred</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pred_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">pred_df</span><span class="p">,</span><span class="w"> </span><span class="n">pred</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
Comparing other algorithms
Decision trees
First, I am building a decision tree with the rpart package and rpart() function. This, we can plot with rpart.plot().
Economy GDP per capita is the second highest node here, the best predictor here would be health and life expectancy.
<span class="n">library</span><span class="p">(</span><span class="n">rpart</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rpart.plot</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rpart</span><span class="p">(</span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_data</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"class"</span><span class="p">,</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rpart.control</span><span class="p">(</span><span class="n">xval</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">),</span><span class="w">
</span><span class="n">parms</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">split</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"information"</span><span class="p">))</span><span class="w">
</span><span class="n">rpart.plot</span><span class="p">(</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">extra</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span>
In order to compare the models, I am producing the same output table for predictions from this model and combine it with the table from the OneR models.
<span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rpart"</span><span class="p">,</span><span class="w">
</span><span class="n">sample_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">test_data</span><span class="p">),</span><span class="w">
</span><span class="n">predict</span><span class="p">(</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">test_data</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">),</span><span class="w">
</span><span class="n">actual</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_data</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">pred</span><span class="p">)[</span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">][</span><span class="n">apply</span><span class="p">(</span><span class="n">pred</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">which.max</span><span class="p">)]</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">correct</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">pred</span><span class="o">$</span><span class="n">actual</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="p">,</span><span class="w"> </span><span class="s2">"correct"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wrong"</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">pred_prob</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">pred</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pred</span><span class="p">[</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="s2">"pred_prob"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">pred</span><span class="p">[</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
<span class="n">pred_df_final</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">pred_df</span><span class="p">,</span><span class="w">
</span><span class="n">pred</span><span class="p">)</span><span class="w">
</span>
Random Forest
Next, I am training a Random Forest model. For more details on Random Forest, check out my post “Can we predict flu deaths with Machine Learning and R?”.
<span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">model_rf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">caret</span><span class="o">::</span><span class="n">train</span><span class="p">(</span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_data</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rf"</span><span class="p">,</span><span class="w">
</span><span class="n">trControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">trainControl</span><span class="p">(</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"repeatedcv"</span><span class="p">,</span><span class="w">
</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w">
</span><span class="n">repeats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w">
</span><span class="n">verboseIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span><span class="w">
</span>
The varImp() function from caret shows us which feature was of highest importance for the model and its predictions.
Here, we again find Economy GDP per captia on top.
<span class="n">varImp</span><span class="p">(</span><span class="n">model_rf</span><span class="p">)</span><span class="w">
</span>
## rf variable importance
##
## Overall
## Economy..GDP.per.Capita. 100.00
## Dystopia.Residual 97.89
## Health..Life.Expectancy. 77.10
## Family 47.17
## Trust..Government.Corruption. 29.89
## Freedom 19.29
## Generosity 0.00
<span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rf"</span><span class="p">,</span><span class="w">
</span><span class="n">sample_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">test_data</span><span class="p">),</span><span class="w">
</span><span class="n">predict</span><span class="p">(</span><span class="n">model_rf</span><span class="p">,</span><span class="w"> </span><span class="n">test_data</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">),</span><span class="w">
</span><span class="n">actual</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_data</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">pred</span><span class="p">)[</span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">][</span><span class="n">apply</span><span class="p">(</span><span class="n">pred</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">],</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">which.max</span><span class="p">)]</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">correct</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">pred</span><span class="o">$</span><span class="n">actual</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="p">,</span><span class="w"> </span><span class="s2">"correct"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wrong"</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">pred_prob</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">pred</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pred</span><span class="p">[</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="s2">"pred_prob"</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">pred</span><span class="p">[</span><span class="n">j</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
<span class="n">pred_df_final</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">pred_df_final</span><span class="p">,</span><span class="w">
</span><span class="n">pred</span><span class="p">)</span><span class="w">
</span>
Extreme gradient boosting trees
Gradient boosting is another decision tree-based algorithm, explained in more detail in my post “Extreme Gradient Boosting and Preprocessing in Machine Learning”.
<span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">model_xgb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">caret</span><span class="o">::</span><span class="n">train</span><span class="p">(</span><span class="n">Happiness.Score.l</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">train_data</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"xgbTree"</span><span class="p">,</span><span class="w">
</span><span class="n">trControl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">trainControl</span><span class="p">(</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"repeatedcv"</span><span class="p">,</span><span class="w">
</span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w">
</span><span class="n">repeats</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w">
</span><span class="n">verboseIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span><span class="w">
</span>
As before, we again find Economy GDP per capita as most important feature.
<span class="n">varImp</span><span class="p">(</span><span class="n">model_xgb</span><span class="p">)</span><span class="w">
</span>
## xgbTree variable importance
##
## Overall
## Economy..GDP.per.Capita. 100.00
## Health..Life.Expectancy. 67.43
## Family 46.59
## Freedom 0.00
<span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"xgb"</span><span class="p">,</span><span class="w">
</span><span class="n">sample_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">test_data</span><span class="p">),</span><span class="w">
</span><span class="n">predict</span><span class="p">(</span><span class="n">model_xgb</span><span class="p">,</span><span class="w"> </span><span class="n">test_data</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">),</span><span class="w">
</span><span class="n">actual</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">test_data</span><span class="o">$</span><span class="n">Happiness.Score.l</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="o">$</span><span class="n">prediction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class=&qu...
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.