# Ordinal data models

**Modeling with R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

This tutorial aims to explore the most popular models used to predict an ordered response variable. We will use the **heart disease** data uploaded from kaggle website, where our response will be the chest pain **cp** variable instead of the **target** variable used usually.

## Data preparation

First, we call the data and the libraries that we need along with this illustration as follows.

```
library(tidyverse)
library(caret)
library(tidymodels)
mydata<-read.csv("../heart.csv",header = TRUE)
names(mydata)[1]<-"age"
```

The data at hand has the following features:

- age.
- sex: 1=male,0=female
- cp : chest pain type.
- trestbps : resting blood pressure.
- chol: serum cholestoral.
- fbs : fasting blood sugar.
- restecg : resting electrocardiographic results.
- thalach : maximum heart rate achieved
- exang : exercise induced angina.
- oldpeak : ST depression induced by exercise relative to rest.
- slope : the slope of the peak exercise ST segment.
- ca : number of major vessels colored by flourosopy.
- thal : it is not well defined from the data source.
- target: have heart disease or not.

I think the best start to explore the summary of all predictors and missing values is by using the powerful function **skim** from **skimr** package.

`skimr::skim(mydata)`

Name | mydata |

Number of rows | 303 |

Number of columns | 14 |

_______________________ | |

Column type frequency: | |

numeric | 14 |

________________________ | |

Group variables | None |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

age | 0 | 1 | 54.37 | 9.08 | 29 | 47.5 | 55.0 | 61.0 | 77.0 | ▁▆▇▇▁ |

sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | ▃▁▁▁▇ |

cp | 0 | 1 | 0.97 | 1.03 | 0 | 0.0 | 1.0 | 2.0 | 3.0 | ▇▃▁▅▁ |

trestbps | 0 | 1 | 131.62 | 17.54 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | ▃▇▅▁▁ |

chol | 0 | 1 | 246.26 | 51.83 | 126 | 211.0 | 240.0 | 274.5 | 564.0 | ▃▇▂▁▁ |

fbs | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |

restecg | 0 | 1 | 0.53 | 0.53 | 0 | 0.0 | 1.0 | 1.0 | 2.0 | ▇▁▇▁▁ |

thalach | 0 | 1 | 149.65 | 22.91 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | ▁▂▅▇▂ |

exang | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▃ |

oldpeak | 0 | 1 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | ▇▂▁▁▁ |

slope | 0 | 1 | 1.40 | 0.62 | 0 | 1.0 | 1.0 | 2.0 | 2.0 | ▁▁▇▁▇ |

ca | 0 | 1 | 0.73 | 1.02 | 0 | 0.0 | 0.0 | 1.0 | 4.0 | ▇▃▂▁▁ |

thal | 0 | 1 | 2.31 | 0.61 | 0 | 2.0 | 2.0 | 3.0 | 3.0 | ▁▁▁▇▆ |

target | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | ▇▁▁▁▇ |

For our case we will use the chest pain type **cp** variable as our target variable since it is a categorical variable. However, for pedagogic purposes, we will manipulate it so that it will be an ordered factor with only three levels **no pain**,**moderate pain**, **severe pain** (instead of 4 levels now).

Looking at the above output, we convert the variables that should be of factor type, which are: **sex**, **target**, **fbs**, **resecg**, **exang**, **slope**, **ca**, **thal**. For the response variable **cp**, we drop its less frequently level with all its related rows, then we rename the remaining ones as **no** pain for the most frequently one, **severe** pain for the less frequently one, and **moderate** pain for the last one.

`table(mydata$cp)`

```
0 1 2 3
143 50 87 23
```

we see the level **3** is the less frequently one.

```
mydata<-mydata %>%
modify_at(c("cp", "sex", "target", "fbs", "resecg", "exang", "slope", "ca", "thal"),
as.factor)
mydata<-mydata[mydata$cp!=3,]
mydata$cp<-fct_drop(mydata$cp,only=levels(mydata$cp))
table(mydata$cp)
```

```
0 1 2
143 50 87
```

According to these frequencies we rename and we order the levels as follows.

```
mydata$cp<-fct_recode(mydata$cp,no="0",sev="1",mod="2")
mydata$cp<-factor(mydata$cp,ordered = TRUE)
mydata$cp<-fct_infreq(mydata$cp)
mydata$cp[1:5]
```

```
[1] mod sev sev no no
Levels: no < mod < sev
```

Similar to the logistic regression, the number of cases in each cell from each cross table between the outcome and each factor should be above the threshold of 5 applied in practice.

`xtabs(~cp+sex,data=mydata)`

```
sex
cp 0 1
no 39 104
mod 35 52
sev 18 32
```

`xtabs(~cp+target,data=mydata)`

```
target
cp 0 1
no 104 39
mod 18 69
sev 9 41
```

`xtabs(~cp+fbs,data=mydata)`

```
fbs
cp 0 1
no 125 18
mod 70 17
sev 45 5
```

`xtabs(~cp+restecg,data=mydata)`

```
restecg
cp 0 1 2
no 78 62 3
mod 36 50 1
sev 19 31 0
```

`xtabs(~cp+exang,data=mydata)`

```
exang
cp 0 1
no 63 80
mod 76 11
sev 46 4
```

`xtabs(~cp+slope,data=mydata)`

```
slope
cp 0 1 2
no 11 84 48
mod 5 33 49
sev 2 12 36
```

`xtabs(~cp+ca,data=mydata)`

```
ca
cp 0 1 2 3 4
no 65 34 29 14 1
mod 57 20 2 5 3
sev 37 8 3 1 1
```

`xtabs(~cp+thal,data=mydata)`

```
thal
cp 0 1 2 3
no 1 12 52 78
mod 1 2 62 22
sev 0 2 39 9
```

The following variables do not respect this threshold and hence they will be removed from the predictors set: **restecg**, **exang**, **slope**, **ca**, and **thal**.

`mydata<-mydata[,setdiff(names(mydata), c("restecg", "exang", "slope", "ca", "thal"))]`

The data is ready and we can now split the data between training and testing set.

```
set.seed(1122)
parts <- initial_split(mydata, prop=0.8, strata = cp)
train <- training(parts)
test <- testing(parts)
```

The most popular models that we will use are: ordinal logistic model, cart model, ordinal random forest model, Continuation ratio model.

## ordered logistic regression (logit)

Before training this type of model let’s show how it works. For simplicity suppose we have data that has an ordered outcome \(y\) with three class labels (“1”,“2”,“3”) and only two features \(x_1\) and \(x_2\).

First we define a latent variable as a linear combination of the features:

\[y_i^*=\beta_1 X_{i1}+\beta_2 X_{i2}\]

Then since we have three classes we define two thresholds for this latent variable \(\alpha_1\) and \(\alpha_2\) such that a particular observation \(y_i\) will be classified as follows:

\[\begin{cases} y_i=1 & \text{if $y_i^* \leq \alpha_1$} \\

y_i=2 & \text{if $\alpha_1 < y_i^* \leq \alpha_2$} \\

y_i=3 & \text{if $y_i^* > \alpha_2$}\end{cases}\]

Now we can obtain the probability of a particular observation to fall into a specific class as follows:

\[\begin{cases} p(y_i=1)=p(y_i^* \leq \alpha_1)=F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\

p(y_i=2)=p(\alpha_1 < y_i^* \leq \alpha_2)=F(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})-F(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2}) \\

p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]

It remains now to define the suitable distribution function F. There are two commonly used ones for this type of data, the **logit** function \(F(x)=\frac{1}{1+exp^{-x}}\) and the normal distribution function aka **probit**.

**Note**: there exist other functions like **loglog**, **cloglog**, and **cauchit**.

Using the **logit** function the probabilities will be.

\[\begin{cases} p(y_i=1)=\frac{1}{1+exp^{-(\alpha_1-\beta_1 X_{i1}-\beta_2 X_{i2})}} \\

p(y_i=2)=\frac{1}{1+exp^{-(\alpha_2-\beta_1 X_{i1}-\beta_2 X_{i2})}}-p(y_i=1) \\

p(y_i=3)=1-p(y_i=2)-p(y_i=1)\end{cases}\]

The **MASS** package provides the method **polr** to perform an ordinal logistic regression.

```
library(MASS)
set.seed(1234)
model_logistic<-train(cp~., data=train,
method="polr",
tuneGrid=expand.grid(method="logistic"))
summary(model_logistic)
```

```
Coefficients:
Value Std. Error t value
age 0.0112236 0.018219 0.61605
sex1 0.2593720 0.316333 0.81993
trestbps -0.0002329 0.009090 -0.02562
chol -0.0013238 0.002697 -0.49082
fbs1 0.3188826 0.401836 0.79356
thalach 0.0226246 0.008199 2.75933
oldpeak -0.3360326 0.163547 -2.05465
target1 1.7234740 0.376279 4.58031
Intercepts:
Value Std. Error t value
no|mod 4.5786 1.9271 2.3759
mod|sev 6.5004 1.9527 3.3289
Residual Deviance: 376.4697
AIC: 396.4697
```

This table does not provide the p-values. However, it is not a big problem since we can add the p_values by the following script.

```
prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)
```

```
Value Std. Error t value prob
age 0.0112236479 0.018218848 0.61604597 5.378642e-01
sex1 0.2593719567 0.316332564 0.81993442 4.122535e-01
trestbps -0.0002329023 0.009090066 -0.02562163 9.795591e-01
chol -0.0013237835 0.002697079 -0.49082122 6.235529e-01
fbs1 0.3188825831 0.401836034 0.79356393 4.274493e-01
thalach 0.0226246089 0.008199317 2.75932853 5.792027e-03
oldpeak -0.3360326371 0.163547467 -2.05464899 3.991292e-02
target1 1.7234739863 0.376278770 4.58031152 4.642839e-06
no|mod 4.5785821473 1.927119568 2.37586822 1.750771e-02
mod|sev 6.5003986218 1.952726089 3.32888399 8.719471e-04
```

Using the threshold p-value 0.05, we remove the non significant variables. **age**, **trestbps**, **chol**.

```
set.seed(1234)
model_logistic<-train(cp~.-age-trestbps-chol, data=train,
method="polr",tuneGrid=expand.grid(method="logistic"))
prob <- pnorm(abs(summary(model_logistic)$coefficients[,3]),lower.tail = FALSE)*2
cbind(summary(model_logistic)$coefficients,prob)
```

```
Value Std. Error t value prob
sex1 0.25427581 0.308143065 0.8251875 4.092651e-01
fbs1 0.37177505 0.384667871 0.9664832 3.338024e-01
thalach 0.02050951 0.007487511 2.7391620 6.159602e-03
oldpeak -0.33669473 0.161699555 -2.0822242 3.732199e-02
target1 1.71338020 0.369558584 4.6362885 3.547208e-06
no|mod 4.00836398 1.143111953 3.5065367 4.539789e-04
mod|sev 5.92987585 1.185074388 5.0038005 5.621092e-07
```

Notice that we do not remove the factors **sex** and **fbs** even they are not significant due to the significance of the intercepts.

To well understand these coefficients lets restrict the model with only two predictors.

```
set.seed(1234)
model1<-train(cp~target+thalach,
data=train,
method = "polr",
tuneGrid=expand.grid(method="logistic"))
summary(model1)
```

```
Coefficients:
Value Std. Error t value
target1 1.87953 0.333153 5.642
thalach 0.02347 0.007372 3.184
Intercepts:
Value Std. Error t value
no|mod 4.6457 1.0799 4.3018
mod|sev 6.5325 1.1271 5.7959
Residual Deviance: 383.3144
AIC: 391.3144
```

Let’s plug in these coefficients in the above equations we obtain the probability of each class as follows:

\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953X_{i1}-0.02347X_{i2})}} \\

p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953X_{i1}-0.02347X_{i2})}}-p(no) \\

p(sev)=1-p(mod)-p(no)\end{cases}\]

Let’s now predict a particular patient, say the third one.

`train[3,c("cp","thalach","target")]`

```
cp thalach target
4 sev 178 1
```

We plug in the predictor values as follows:

\[\begin{cases} p(no)=\frac{1}{1+exp^{-(4.6457-1.87953*1-0.02347*178)}} \\

p(mod)=\frac{1}{1+exp^{-(6.5325-1.87953*1-0.02347*178)}}-p(no) \\

p(sev)=1-p(mod)-p(no)\end{cases}=\begin{cases} p(no)=0.1959992 \\

p(mod)=0.6166398-0.1959992=0.4206406 \\

p(sev)=1-0.4206406-0.1959992=0.3833602\end{cases}\]

Using the highest probability this patient will be predicted to have **mod** pain.

Now let’s compare these probabilities with those obtained from function **predict**.

`predict(model1, train[1:3,], type = "prob") %>% tail(1)`

```
no mod sev
4 0.1958709 0.4205981 0.383531
```

Now we go back to our original model and compute the accuracy rate for the training data.

```
predict(model_logistic, train) %>%
bind_cols(train) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.611

with the logistic regression model we get 61% accuracy for the training set, which is quite bad. So let’s test the model using the testing set now.

```
predict(model_logistic, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.648

Surprisingly, the accuracy rate for the testing set is about 65%, which is larger than that computed from the training data (61%). This is an indication of an underfitting problem (The opposite effect of overfitting problem). Is there any way to improve the model performance? Maybe yes, by going back and tune some hyperparameters, but since we have an underfitting problem we do not have much hyperparameters for this model except the type of function used which is by default the **logistic** function, but there exist as well other functions like **probit**, **loglog**, …ect.

For our case let’s try this model with the probit function

## Ordinal logistic rgeression (probit)

```
set.seed(1234)
model_probit<-train(cp~.-age-trestbps-chol, data=train, method="polr",
tuneGrid=expand.grid(method="probit"))
predict(model_probit, train) %>%
bind_cols(train) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.606

This rate is slightly worse than that from the previous model. But what about the testing set.

```
predict(model_probit, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.593

This one also is worse than the previous model. So this means that the logistic function for this data performs better than the probit one.

When we try many things to improve the model performance and we do not gain much, it will be better to think to try different types of models.

## CART model

This is a tree-based model. To train this model we make use of **rpartScore** package, and for simplification, we will include only the significant predictors from the previous model.

```
library(rpartScore)
set.seed(1234)
model_cart<-train(cp~.-age-trestbps-chol, data=train,
method="rpartScore")
model_cart
```

```
CART or Ordinal Responses
226 samples
8 predictor
3 classes: 'no', 'mod', 'sev'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ...
Resampling results across tuning parameters:
cp split prune Accuracy Kappa
0.02702703 abs mr 0.5748197 0.2845545
0.02702703 abs mc 0.5796085 0.3011122
0.02702703 quad mr 0.5711605 0.2764466
0.02702703 quad mc 0.5805216 0.3020125
0.04504505 abs mr 0.5620975 0.2719646
0.04504505 abs mc 0.5966801 0.3274893
0.04504505 quad mr 0.5592845 0.2608402
0.04504505 quad mc 0.5930817 0.3208220
0.21621622 abs mr 0.5303342 0.1266324
0.21621622 abs mc 0.6004116 0.3343997
0.21621622 quad mr 0.5290009 0.1143360
0.21621622 quad mc 0.5928132 0.3225686
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were cp = 0.2162162, split = abs and
prune = mc.
```

The caret model uses the bootstrapping technique for hyperparameters tuning. In our case, the largest accuracy rate is about 59.59%, with the complexity parameter `**cp**=0.2162162`

, the `**split**=abs`

, and `**prune**= **mc**`

.

The argument **split** controls the splitting function used to grow the tree by setting the misclassification costs in the generalized **Gini** impurity function to the absolute **abs** or squared **quad**.

The argument **prune** is used to select the performance measure to prune the tree between total misclassification rate **mr** or misclassification cost **mc**.

```
predict(model_cart, train) %>%
bind_cols(train) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.615

Surprisingly, we get approximately the same accuracy rate as the logit model. Let’s check the testing set.

```
predict(model_cart, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.630

Now wit this model we get a lower accuracy rate than that of the logistic model.

## Ordinal Random forst model.

This model is a corrected version of random forest model that takes into account the ordinal nature of the response variable. For more detail about this model read this great paper.

To train ordinal random forest model, we need to call the following packages:

**e1071**, **ranger**, **ordinalForest**.

```
library(ordinalForest)
library(ranger)
library(e1071)
```

Since the create function **train** use bootstrapping method to perform hyperparameters tuning to choose the best values, this makes the training process very slow, that is why i save the resulted output and load it again

```
# set.seed(1234)
# model_forest<-train(cp~.-age-trestbps-chol, data=train,
# method='ordinalRF')
# saveRDS(model_forest, "C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds")
model_forest <- readRDS("C://Users/dell/Documents/new-blog/content/post/ordinal/model_forest.rds")
model_forest
```

```
Random Forest
226 samples
8 predictor
3 classes: 'no', 'mod', 'sev'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ...
Resampling results across tuning parameters:
nsets ntreeperdiv ntreefinal Accuracy Kappa
50 50 200 0.5878619 0.3173073
50 50 400 0.5862998 0.3174139
50 50 600 0.5793897 0.3052328
50 100 200 0.5856573 0.3157938
50 100 400 0.5865246 0.3155563
50 100 600 0.5908431 0.3248941
50 150 200 0.5929076 0.3277110
50 150 400 0.5822072 0.3092856
50 150 600 0.5878854 0.3209507
100 50 200 0.5903104 0.3277174
100 50 400 0.5817918 0.3127438
100 50 600 0.5863685 0.3197145
100 100 200 0.5841147 0.3136250
100 100 400 0.5830929 0.3115348
100 100 600 0.5875075 0.3228354
100 150 200 0.5837402 0.3167977
100 150 400 0.5841116 0.3187849
100 150 600 0.5888038 0.3250622
150 50 200 0.5872276 0.3193143
150 50 400 0.5873943 0.3198634
150 50 600 0.5874236 0.3224713
150 100 200 0.5870714 0.3218660
150 100 400 0.5805275 0.3107488
150 100 600 0.5816353 0.3135750
150 150 200 0.5884693 0.3247512
150 150 400 0.5830154 0.3154980
150 150 600 0.5885156 0.3253794
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nsets = 50, ntreeperdiv = 150
and ntreefinal = 200.
```

We can plot the important predictors as follows.

`plot(varImp(model_forest))`

Now we can obtain the accuracy rate for the training rate as follows.

```
predict(model_forest, train) %>%
bind_cols(train) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.845

Great!, with this model, the accuracy rate has largely improved to roughly 84%. But wait, what matters is the accuracy of the testing set.

```
predict(model_forest, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.519

This is exactly what is called the overfitting problem. The model generalizes poorly to new unseen data. We can go back and tune some other hyperparameters like increasing the minimum size of nodes (default is 5) to fight the overfitting problem. we do not, however, do that here since it is not the purpose of this tutorial.

## Continuation Ratio Model

This model uses The vector generalized additive models which are available in the **VGAM** package. for more detail about these models click here.

```
library(VGAM)
set.seed(1234)
model_vgam<-train(cp~.-age-trestbps-chol, data=train,
method="vglmContRatio", trace=FALSE)
```

`model_vgam`

```
Continuation Ratio Model for Ordinal Data
226 samples
8 predictor
3 classes: 'no', 'mod', 'sev'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 226, 226, 226, 226, 226, 226, ...
Resampling results across tuning parameters:
parallel link Accuracy Kappa
FALSE logit 0.5962581 0.3323075
FALSE probit 0.5942637 0.3302998
FALSE cloglog 0.5973844 0.3293056
FALSE cauchit 0.5967368 0.3316896
FALSE logc 0.5945121 0.3152759
TRUE logit 0.5758330 0.2961673
TRUE probit 0.5738297 0.2924747
TRUE cloglog 0.5838764 0.3014038
TRUE cauchit 0.5810184 0.3067004
TRUE logc 0.5302522 0.1031624
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were parallel = FALSE and link = cloglog.
```

the best model is obtained when the argument **parallel** is FALSE and **link** is **cauchit** which is the tangent function.

The accuracy rate of the training data is:

```
predict(model_vgam, train) %>%
bind_cols(train) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.659

And the accuracy of the testing set is:

```
predict(model_vgam, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
```
1 accuracy multiclass 0.630

This the best accuracy rate compared to the other models.

## Compare models

We can compare between the above models using **resample** caret function.

```
models_eval<-resamples(list(logit=model_logistic,
cart=model_cart,
forest=model_forest,
vgam=model_vgam))
summary(models_eval)
```

```
Call:
summary.resamples(object = models_eval)
Models: logit, cart, forest, vgam
Number of resamples: 25
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.5060241 0.5731707 0.5822785 0.5871083 0.6097561 0.6627907 0
cart 0.3734940 0.5824176 0.6097561 0.6004116 0.6279070 0.6746988 0
forest 0.5357143 0.5569620 0.5853659 0.5929076 0.6117647 0.6860465 0
vgam 0.4936709 0.5760870 0.6046512 0.5973844 0.6202532 0.6626506 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.189086980 0.2792369 0.3144822 0.3100458 0.3437500 0.4512651 0
cart -0.004889406 0.3185420 0.3474144 0.3343997 0.3775576 0.4526136 0
forest 0.233434988 0.2852665 0.3289346 0.3277110 0.3554862 0.4618772 0
vgam 0.144558744 0.2993406 0.3367647 0.3293056 0.3690791 0.4142980 0
```

Based on the training set and using the mean of the accuracy rate we can say that **cart** model is the best model for this data with 60.97% accuracy for the training set. However, things are different when it comes to use the testing set instead.

```
tibble(models=c("logit", "cart", "forest", "vgam"),
accuracy=c(
predict(model_logistic, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth) %>%
.[, ".estimate"],
predict(model_cart, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth) %>%
.[, ".estimate"],
predict(model_forest, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth) %>%
.[, ".estimate"],
predict(model_vgam, test) %>%
bind_cols(test) %>%
rename(pred="...1", truth=cp) %>%
accuracy(pred, truth) %>%
.[, ".estimate"])) %>%
unnest()
```

```
# A tibble: 4 x 2
models accuracy
```
1 logit 0.648
2 cart 0.630
3 forest 0.519
4 vgam 0.630

Using the testing set, the logistic model with the link **logit** is the best model to predict this data.

## Conclusion

We have seen so far how to model ordinal data by exploring several models, and it happened that the logistic model is the best on for our data. However, in general the best model depends strongly on the data at hand.

## Session information

`sessionInfo()`

```
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] splines stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] VGAM_1.1-3 e1071_1.7-3 ranger_0.12.1
[4] ordinalForest_2.3-1 rpartScore_1.0-1 rpart_4.1-15
[7] MASS_7.3-51.6 yardstick_0.0.6 workflows_0.1.1
[10] tune_0.1.0 rsample_0.0.7 recipes_0.1.12
[13] parsnip_0.1.1 infer_0.5.2 dials_0.0.7
[16] scales_1.1.1 broom_0.5.6 tidymodels_0.1.0
[19] caret_6.0-86 lattice_0.20-41 forcats_0.5.0
[22] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
[25] readr_1.3.1 tidyr_1.1.0 tibble_3.0.1
[28] ggplot2_3.3.1 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.1.7 tidytext_0.2.4
[4] plyr_1.8.6 igraph_1.2.5 repr_1.1.0
[7] crosstalk_1.1.0.1 listenv_0.8.0 SnowballC_0.7.0
[10] rstantools_2.0.0 inline_0.3.15 digest_0.6.25
[13] foreach_1.5.0 htmltools_0.4.0 rsconnect_0.8.16
[16] fansi_0.4.1 magrittr_1.5 globals_0.12.5
[19] modelr_0.1.8 gower_0.2.1 matrixStats_0.56.0
[22] RcppParallel_5.0.1 xts_0.12-0 prettyunits_1.1.1
[25] colorspace_1.4-1 skimr_2.1.1 blob_1.2.1
[28] rvest_0.3.5 haven_2.3.1 xfun_0.14
[31] callr_3.4.3 crayon_1.3.4 jsonlite_1.6.1
[34] lme4_1.1-23 survival_3.1-12 zoo_1.8-8
[37] iterators_1.0.12 glue_1.4.1 gtable_0.3.0
[40] ipred_0.9-9 pkgbuild_1.0.8 rstan_2.19.3
[43] DBI_1.1.0 miniUI_0.1.1.1 Rcpp_1.0.4.6
[46] xtable_1.8-4 GPfit_1.0-8 StanHeaders_2.21.0-5
[49] lava_1.6.7 prodlim_2019.11.13 DT_0.13
[52] htmlwidgets_1.5.1 httr_1.4.1 threejs_0.3.3
[55] ellipsis_0.3.1 loo_2.2.0 pkgconfig_2.0.3
[58] nnet_7.3-14 dbplyr_1.4.4 utf8_1.1.4
[61] tidyselect_1.1.0 rlang_0.4.6 DiceDesign_1.8-1
[64] reshape2_1.4.4 later_1.1.0.1 munsell_0.5.0
[67] cellranger_1.1.0 tools_4.0.1 cli_2.0.2
[70] generics_0.0.2 ggridges_0.5.2 evaluate_0.14
[73] fastmap_1.0.1 yaml_2.2.1 ModelMetrics_1.2.2.2
[76] processx_3.4.2 knitr_1.28 fs_1.4.1
[79] future_1.17.0 nlme_3.1-148 mime_0.9
[82] rstanarm_2.19.3 xml2_1.3.2 tokenizers_0.2.1
[85] compiler_4.0.1 bayesplot_1.7.2 shinythemes_1.1.2
[88] rstudioapi_0.11 reprex_0.3.0 tidyposterior_0.0.3
[91] lhs_1.0.2 statmod_1.4.34 stringi_1.4.6
[94] highr_0.8 ps_1.3.3 blogdown_0.19
[97] Matrix_1.2-18 nloptr_1.2.2.1 markdown_1.1
[100] shinyjs_1.1 vctrs_0.3.1 pillar_1.4.4
[103] lifecycle_0.2.0 furrr_0.1.0 data.table_1.12.8
[106] httpuv_1.5.4 R6_2.4.1 bookdown_0.19
[109] promises_1.1.1 gridExtra_2.3 janeaustenr_0.1.5
[112] codetools_0.2-16 boot_1.3-25 colourpicker_1.0
[115] gtools_3.8.2 assertthat_0.2.1 withr_2.2.0
[118] shinystan_2.5.0 parallel_4.0.1 hms_0.5.3
[121] grid_4.0.1 timeDate_3043.102 minqa_1.2.4
[124] class_7.3-17 rmarkdown_2.2 pROC_1.16.2
[127] tidypredict_0.4.5 shiny_1.4.0.2 lubridate_1.7.9
[130] base64enc_0.1-3 dygraphs_1.1.1.6
```

**leave a comment**for the author, please follow the link and comment on their blog:

**Modeling with R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.