[This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

## The rise of machine learning

In this current 4th industrial revolution, data science has penetrated all industries and healthcare is no exception. There has been an exponential use of machine learning in clinical research in the past decade and it is expected to continue to grow at an even faster rate in the following decade. Many machine learning techniques are considered as black box algorithms as the intrinsic workings of the models are too complex in justifying the reasons for the predictions. However, the adoption of machine learning techniques is inevitable. Thus, instead of avoiding them, clinicians need to learn how to “understand and explain the predictions of these models” to patients for “moral, legal and scientific reasons”.

## Explaining models

There are 2 approaches to explaining models

1. Use non-complex models. These algorithms are associated with traditional statistical models. The simple structure allows us to understand the intrinsic workings of the models.
2. Conduct post-hoc interpretation on models. Post-hoc interpretation can be applied on both simple and black box models. These analyses done after model training can be further broken down into model specific and model agnostic approaches.

## Direction of the post

In this post we will explore the first approach of explaining models, using interpretable models such as logistic regression and decision trees (decision trees will be covered in another post) . I will be using the tidymodels approach to create these algorithms. The dataset used is the Cleveland heart dataset which is a binary classification problem if heart disease is present or absent for a patient.

# Model production pipeline

## Glossary

There are 3 columns in the glossary below:

1. Original: The original variable names
2. Definition: The expansion of the abbreviated original variable names and the details for each factor level for categorical variables.
3. Rename: The renamed variable names which is meant to less jargonistic than the original variable names.
#library
library(tidyverse)
library(tidymodels)
#import
heart<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = F)

#glossary
tribble(~Original, ~Definition, ~Rename,
"age", "" , "age",
"sex", "1= Male, 2=Female", "sex",
"cp", "Chest pain at rest, there is another variable related to chest pain during the exercise stress test. 1=typical angina, 2= atypical angina, 3= non-chestpain pain, 4=asymtomatic" ,"rest_cp",
"tresbps", "Resting blood pressure (in mm Hg on admission to the hospital)", "rest_bp",
"chol", "Serum cholesterol", "chol",
"fbs", "Fasting blood sugar. 1= >120 mg/dl, 0 = <120 mg/dl","fast_bloodsugar",
"restecg", "Resting electrocardiographic results. 0=normal, 1=ST-T wave abnormality, 2=left ventricular hypertrophy", "rest_ecg",
"thalach", "Maximum heart rate achieved. based on values, likely taken during exercise stress test", "ex_maxHR",
"exang", "Exercise induced angina (chest pain). 1=yes, 0=no", "ex_cp",
"oldpeak", "ST depression induced by exercise relative to rest", "ex_STdepression_dur",
"slope", "Slope of the peak exercise ST segment. 1=upsloping/normal, 2=flat, 3=downsloping", "ex_STpeak",
"ca", "Number of major vessels colored by flourosopy", "coloured_vessels",
"thal", "Thalassemia. 3=normal, 6=fixed defect, 7=reversable defect", "thalassemia",
"num", "Angiographic status of heart disease. 0= <50% diameter narrowing (no heart disease), >1= >50% diameter narrowing (heart disease present)", "heart_disease") %>% DT::datatable(rownames = F, options = list(paging= T))

## Data wrangling

### Variable and value names

We will convert the numeric encoding of categorical variables to their intended meaning. Using the intended meaning will facilitate the interpretation of models. For instance, saying “typical resting chest pain is the most influential variable” is more comprehensible than “resting chest pain =1 is the most influential variable”.

# Renaming var
colnames(heart)<- c("age", "sex", "rest_cp", "rest_bp",
"chol", "fast_bloodsugar","rest_ecg","ex_maxHR","ex_cp",
"ex_STdepression_dur", "ex_STpeak","coloured_vessels", "thalassemia","heart_disease")

#elaborating cat var
##simple ifelse conversion
heart<-heart %>% mutate(
sex= ifelse(sex=="1", "male", "female"),
fast_bloodsugar= ifelse(fast_bloodsugar=="1", ">120", "<120"), ex_cp=ifelse(ex_cp=="1", "yes", "no"),
heart_disease=ifelse(heart_disease=="0", "no", "yes"))

## complex ifelse conversion using case_when
heart<-heart %>% mutate(
rest_cp=case_when(rest_cp== "1" ~ "typical",
rest_cp=="2" ~ "atypical",
rest_cp== "3" ~ "non-CP pain",
rest_cp== "4" ~ "asymptomatic"),
rest_ecg=case_when(rest_ecg=="0" ~ "normal",
rest_ecg=="1" ~ "ST-T abnorm",
rest_ecg=="2" ~ "LV hyperthrophy"),
ex_STpeak=case_when(ex_STpeak=="1" ~ "up/norm",
ex_STpeak== "2" ~ "flat",
ex_STpeak== "3" ~ "down"),
thalassemia=case_when(thalassemia=="3.0" ~ "norm",
thalassemia== "6.0" ~ "fixed",
thalassemia== "7.0" ~ "reversable"))

### Missing data

Missing values often reflected as NA or “?”. First, let’s identify which variables have missing values recorded as “?”

heart %>% map_df(~sum((.x)=="?")) %>% gather(key="var", value = "value") %>% filter(value >0)
## # A tibble: 1 x 2
##   var              value
##
## 1 coloured_vessels     4

We will convert the 4 observations in “coloured_vessels” with missing values recorded as “?” into true NA.

heart<-heart%>% mutate_if(is.character, funs(replace(., .=="?", NA)))

Now let’s tally the total number of missing values reported as NA for each variable. We will impute these missing values later in the pre-processing stage with recipe::step_knnimpute()

heart %>% map_df(~sum(is.na(.x)))
## # A tibble: 1 x 14
##     age   sex rest_cp rest_bp  chol fast_bloodsugar rest_ecg ex_maxHR ex_cp
##
## 1     0     0       0       0     0               0        0        0     0
## # ... with 5 more variables: ex_STdepression_dur , ex_STpeak ,
## #   coloured_vessels , thalassemia , heart_disease 

### Variable class

Currently, categorical variables are treated as characters. We will need to convert them to factors before feeding them into the model.

heart %>% map_df(~class(.x)) %>%  gather(key="var", value = "class")
## # A tibble: 14 x 2
##    var                 class
##
##  1 age                 numeric
##  2 sex                 character
##  3 rest_cp             character
##  4 rest_bp             numeric
##  5 chol                numeric
##  6 fast_bloodsugar     character
##  7 rest_ecg            character
##  8 ex_maxHR            numeric
##  9 ex_cp               character
## 10 ex_STdepression_dur numeric
## 11 ex_STpeak           character
## 12 coloured_vessels    character
## 13 thalassemia         character
## 14 heart_disease       character

I prefer to convert the categorical variables during the data wrangling stage rather than during the model pre-processing stage with recipes::step_string2factors. Having the dataset in the right form during the data wrangling phase helps to prevent any errors further upstream during pre-processing and model production.

heart<-heart %>% mutate_if(is.character, as.factor)

## Models

### Training/test set

We will use functions from Rsample to create the training and test set.

set.seed(4595)
data_split <- initial_split(heart, prop=0.75, strata = "heart_disease")
heart_train <- training(data_split)
heart_test <- testing(data_split)

### Pre-processing

The training and test sets are pre-processed using functions from recipes. We will not explicitly create one hot encoding for categorical variables as both logistic regressions and decision trees are able to accommodate them. I kept the feature engineering brief as I wanted the explanation of the models to be succinct.

# create recipe object
heart_recipe<-recipe(heart_disease ~., data= heart_train) %>%
step_knnimpute(all_predictors())

# process the training set/ prepare recipe(non-cv)
heart_prep <-heart_recipe %>% prep(training = heart_train, retain = TRUE)

### Model creation

The model will be created using functions from parsnip

lr_model <- logistic_reg(penalty = 10, mixture = 0.1, mode = "classification") %>% set_engine("glm") %>%
fit(heart_disease ~ ., data = juice(heart_prep))

Now that we’ve built the model, let’s interpret the white box model.

# Logistic regression

Logistic regression is one of the classic models use in medical research to solve classification problems. Logistic regression provides us with coefficient estimates but most often we use a derivate of the coefficient estimate, odds ratio, in comprehending the model.

## Coefficient estimate

Before elaborating about odds ratio, let me quickly define coefficient estimate. Coefficient estimate from logistic regression characterize the relationship between the predictor and the outcome on a log-odds scale. One-unit increase in a predictor (e.g. resting blood pressure rest_bp) is associated with an increase in the log odds of the outcome (e.g. heart disease heart_disease) by the value of the coefficient estimate (e.g. 0.0453).

broom::tidy(lr_model$fit) %>% filter(term=="rest_bp") ## # A tibble: 1 x 5 ## term estimate std.error statistic p.value ## ## 1 rest_bp 0.0453 0.0161 2.81 0.00492 ## Odds ratio The odds ratio represents the odds that an outcome will occur given the presence of a specific predictor, compared to the odds of the outcome occurring in the absence of that predictor, assuming all other predictors remain constant. The odds ratio is calculated from the exponential function of the coefficient estimate based on a unit increase in the predictor (on a side note, coefficient estimates are unbiased but odds ratio are biases due to transformation). An example with a continuous variable, will be for 1 mm Hg increased in resting blood pressure rest_bp, the odds of having heart disease increases by a factor of 1.05. broom::tidy(lr_model$fit) %>% filter(term=="rest_bp") %>% mutate(odds_ratio= exp(estimate)) %>% select(term, estimate, odds_ratio)
## # A tibble: 1 x 3
##   term    estimate odds_ratio
##
## 1 rest_bp   0.0453       1.05

An example with a categorical variable will be chest pain during exercise stress test ex_cpyes. If chest pain is present, the odds of having heart disease increases by a factor of 1.52. In percentage change, the odds for heart disease is 52.6% higher for individuals with chest pain during the exercise stress test than individuals with no chest pain.

broom::tidy(lr_model$fit) %>% filter(term=="ex_cpyes") %>% mutate(odds_ratio= exp(estimate), percentage= (odds_ratio-1)*100) %>% select(term, estimate, odds_ratio, percentage) ## # A tibble: 1 x 4 ## term estimate odds_ratio percentage ## ## 1 ex_cpyes 0.437 1.55 54.8 Researchers like odds ratio as it can be used to determine whether a variable is a risk factor for a particular outcome, and compare the magnitude of various risk factors for that outcome. Variables such as cholesterol chol where the odds ratio is 1 means that cholesterol does not affect odds of having heart disease. broom::tidy(lr_model$fit)  %>% mutate(odds_ratio= round(exp(estimate),2)) %>% select(term, estimate, odds_ratio) %>% filter(odds_ratio==1) #https://stackoverflow.com/questions/21509346/r-displays-numbers-in-scientific-notation
## # A tibble: 1 x 3
##   term  estimate odds_ratio
##
## 1 chol   0.00447          1

Variables such as age, normal resting ECG rest_ecgnormal have odds ratio less than 1 which suggests that exposure with these variables are associated with lower odds of having heart disease. We tend to ignore the intercept when talking about odds ratio.

broom::tidy(lr_model$fit) %>% mutate(odds_ratio= round(exp(estimate),2)) %>% select(term, estimate, odds_ratio) %>% filter(odds_ratio<1) ## # A tibble: 10 x 3 ## term estimate odds_ratio ## ## 1 (Intercept) -0.750 0.47 ## 2 age -0.0226 0.98 ## 3 rest_cpatypical -2.13 0.12 ## 4 rest_cpnon-CP pain -1.86 0.16 ## 5 rest_cptypical -4.18 0.02 ## 6 fast_bloodsugar>120 -0.667 0.51 ## 7 rest_ecgnormal -0.779 0.46 ## 8 ex_maxHR -0.0471 0.95 ## 9 ex_STpeakup/norm -0.183 0.83 ## 10 thalassemianorm -0.201 0.82 Variables such as reversable thalassemia thalassemiareversable and resting blood pressure rest_bp have odds ratio greater than 1 which suggest exposure to these variables are associated with higher odds of heart disease. ST-T abnormal during resting ECG rest_ecgST-T abnorm, 2 vessels coloured during angiogram coloured_vessels2.0 and males sexmale have unusually large odds ratio. More thorough data exploration and feature engineering will be needed to address them. Our pre-processing was rather brief, only imputation of missing values was done. broom::tidy(lr_model$fit)  %>% mutate(odds_ratio= round(exp(estimate),2)) %>% select(term, estimate, odds_ratio) %>% filter(odds_ratio>1) %>% arrange(desc(odds_ratio))
## # A tibble: 10 x 3
##    term                  estimate odds_ratio
##
##  1 rest_ecgST-T abnorm    13.1     465136.
##  2 coloured_vessels2.0     4.68       107.
##  3 sexmale                 2.48        12.0
##  4 coloured_vessels3.0     1.62         5.04
##  5 coloured_vessels1.0     1.52         4.59
##  6 thalassemiareversable   1.26         3.53
##  7 ex_cpyes                0.437        1.55
##  8 ex_STdepression_dur     0.296        1.34
##  9 rest_bp                 0.0453       1.05
## 10 ex_STpeakflat           0.0512       1.05

# Conclusion

In this post we looked at explaining models using white box models such as logistic regression. In the next post, we will learn about another white box model, decision tree.