logistic regression

Posted on December 18, 2019 by Modeling with R in R bloggers | 0 Comments

[This article was first published on Modeling with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this paper we will fit a logistic regression model to the heart disease data uploaded from kaggle website.

For the data preparation we will follow the same steps as we did in my previous paper about naive bayes model, for more detail thus click here to get access to that paper.

Data preparation

First we call our data with the required packages

library(tidyverse, warn.conflicts = FALSE)
library(caret, warn.conflicts = FALSE)
mydata<-read.csv("heart.csv",header = TRUE)
names(mydata)[1]<-"age"
glimpse(mydata)
## Observations: 303
## Variables: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

The data at hand has the following features:

age.
sex: 1=male,0=female
cp : chest pain type.
trestbps : resting blood pressure.
chol: serum cholestoral.
fbs : fasting blood sugar.
restecg : resting electrocardiographic results.
thalach : maximum heart rate achieved
exang : exercise induced angina.
oldpeak : ST depression induced by exercise relative to rest.
slope : the slope of the peak exercise ST segment.
ca : number of major vessels colored by flourosopy.
thal : it is not well defined from the data source.
target: have heart disease or not.

We see that some features should be converted to factor type as follows:m

mydata<-mydata %>%
  modify_at(c(2,3,6,7,9,11,12,13,14),as.factor)
glimpse(mydata)
## Observations: 303
## Variables: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Before going head we should check the relationships between the target variable and the remaining factors

xtabs(~target+sex,data=mydata)
##       sex
## target   0   1
##      0  24 114
##      1  72  93
xtabs(~target+cp,data=mydata)
##       cp
## target   0   1   2   3
##      0 104   9  18   7
##      1  39  41  69  16
xtabs(~target+fbs,data=mydata)
##       fbs
## target   0   1
##      0 116  22
##      1 142  23
xtabs(~target+restecg,data=mydata)
##       restecg
## target  0  1  2
##      0 79 56  3
##      1 68 96  1
xtabs(~target+exang,data=mydata)
##       exang
## target   0   1
##      0  62  76
##      1 142  23
xtabs(~target+slope,data=mydata)
##       slope
## target   0   1   2
##      0  12  91  35
##      1   9  49 107
xtabs(~target+ca,data=mydata)
##       ca
## target   0   1   2   3   4
##      0  45  44  31  17   1
##      1 130  21   7   3   4
xtabs(~target+thal,data=mydata)
##       thal
## target   0   1   2   3
##      0   1  12  36  89
##      1   1   6 130  28

As we see the restecg,ca and thal variables have values less than the threshold of 5 casses required for logistic regression. In addition if we split the data between training set and test set the level 2 of the restecg variable will not be found in one of the sets since we have only one case. Therfore we should remove these variables from the model.

mydata<-mydata[,-c(7,12,13)]
glimpse(mydata)
## Observations: 303
## Variables: 11
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ target   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Before training our model, we can get a vague insight about the predictors that have some importance for the prediction of the dependent variable.

Let’s plot the relationships between the target variabl and the other features.

ggplot(mydata,aes(sex,target,color=target))+
  geom_jitter()

If we look only at the red points (healthy patients) we can wrongly interpret that females are less healthy than males. This is because we do not take into account that we have imbalanced number of each sex level (96 females , 207 males). in contrast, if we look only at females we can say that a particular female are more likely to have the disease than not.

ggplot(mydata,aes(cp, fill = target))+
  geom_bar(stat = "count", position = "dodge")

From this plot we can conclude that if the patient does not have any chest pain he/she will be highly unlikely to get the disease, otherwise for any chest type the patient will be more likely to be pathologique by this disease. we can expect therfore that this predictor will have a significant importance on the training model.

ggplot(mydata, aes(age,fill=target))+
  geom_density(alpha=.5)

Data partition

we take out 80% of the data to use as training set and the rest will be put aside to evaluate the model performance.

set.seed(1234)
index<-createDataPartition(mydata$target, p=.8,list=FALSE)
train<-mydata[index,]
test<-mydata[-index,]

train the model

We are now ready to train our model.

model <- glm(target~., data=train,family = "binomial")
summary(model)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5855  -0.5294   0.1990   0.6120   2.4022  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.715274   2.883238   1.289 0.197545    
## age         -0.014712   0.023285  -0.632 0.527502    
## sex1        -1.686359   0.479254  -3.519 0.000434 ***
## cp1          1.212919   0.549670   2.207 0.027340 *  
## cp2          2.010255   0.486638   4.131 3.61e-05 ***
## cp3          2.139066   0.682727   3.133 0.001730 ** 
## trestbps    -0.020471   0.012195  -1.679 0.093220 .  
## chol        -0.005840   0.003776  -1.547 0.121959    
## fbs1        -0.200690   0.519116  -0.387 0.699053    
## thalach      0.024461   0.010928   2.238 0.025196 *  
## exang1      -0.792717   0.431434  -1.837 0.066151 .  
## oldpeak     -0.820508   0.231100  -3.550 0.000385 ***
## slope1      -0.999768   1.015514  -0.984 0.324872    
## slope2      -0.767247   1.097448  -0.699 0.484477    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 191.33  on 229  degrees of freedom
## AIC: 219.33
## 
## Number of Fisher Scoring iterations: 5

we see that some variables are not significant using p-value such as age, chol,fbs,slope, and also the intercept. First let’s remove the insignificant factor variables fbs and slope.

model <- glm(target~.-fbs-slope, data=train,family = "binomial")
summary(model)
## 
## Call:
## glm(formula = target ~ . - fbs - slope, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6702  -0.5505   0.1993   0.6344   2.4495  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.826395   2.695175   1.049 0.294322    
## age         -0.016677   0.023157  -0.720 0.471420    
## sex1        -1.729320   0.470656  -3.674 0.000239 ***
## cp1          1.243879   0.548288   2.269 0.023289 *  
## cp2          1.987151   0.472994   4.201 2.65e-05 ***
## cp3          2.125766   0.677257   3.139 0.001696 ** 
## trestbps    -0.020672   0.012005  -1.722 0.085084 .  
## chol        -0.006434   0.003721  -1.729 0.083816 .  
## thalach      0.026567   0.010432   2.547 0.010873 *  
## exang1      -0.848162   0.423189  -2.004 0.045047 *  
## oldpeak     -0.798699   0.198597  -4.022 5.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 192.66  on 232  degrees of freedom
## AIC: 214.66
## 
## Number of Fisher Scoring iterations: 5

Now we remove the age variable since it is the least significance.

model <- glm(target~.-fbs-slope-age, data=train,family = "binomial")
summary(model)
## 
## Call:
## glm(formula = target ~ . - fbs - slope - age, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6925  -0.5397   0.2032   0.6345   2.4032  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.703126   2.188741   0.778 0.436492    
## sex1        -1.677986   0.463447  -3.621 0.000294 ***
## cp1          1.221925   0.545175   2.241 0.025004 *  
## cp2          1.961200   0.468443   4.187 2.83e-05 ***
## cp3          2.085409   0.676469   3.083 0.002051 ** 
## trestbps    -0.022133   0.011872  -1.864 0.062273 .  
## chol        -0.006900   0.003675  -1.878 0.060443 .  
## thalach      0.029761   0.009471   3.142 0.001676 ** 
## exang1      -0.820113   0.420434  -1.951 0.051101 .  
## oldpeak     -0.803423   0.198400  -4.050 5.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 193.19  on 233  degrees of freedom
## AIC: 213.19
## 
## Number of Fisher Scoring iterations: 5

we remove now the variables exang.

model <- glm(target~.-fbs-slope-age-exang, data=train,family = "binomial")
summary(model)
## 
## Call:
## glm(formula = target ~ . - fbs - slope - age - exang, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7030  -0.5643   0.2004   0.6510   2.5728  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.832691   2.105139   0.396 0.692436    
## sex1        -1.713577   0.459659  -3.728 0.000193 ***
## cp1          1.494091   0.528172   2.829 0.004672 ** 
## cp2          2.205121   0.454341   4.853 1.21e-06 ***
## cp3          2.220423   0.668760   3.320 0.000899 ***
## trestbps    -0.021812   0.011704  -1.864 0.062375 .  
## chol        -0.007110   0.003597  -1.977 0.048054 *  
## thalach      0.033412   0.009291   3.596 0.000323 ***
## oldpeak     -0.822277   0.195993  -4.195 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 196.98  on 234  degrees of freedom
## AIC: 214.98
## 
## Number of Fisher Scoring iterations: 5

Notice that we can not remove intercept even it is not significant because it contains the first level of “0” of the factor cp which is significant. This is hence our final model.

prediction and confusion matrix

we will use this model to predict the training set.

pred <- predict(model,train, type="response")
head(pred)
##         2         3         4         6         7         8 
## 0.5202639 0.9331630 0.8330192 0.3354247 0.7730621 0.8705651

using the confusion matrix we get the accuracy rate in the training set.

pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),train$target, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  87  17
##          1  24 115
##                                           
##                Accuracy : 0.8313          
##                  95% CI : (0.7781, 0.8761)
##     No Information Rate : 0.5432          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6583          
##                                           
##  Mcnemar's Test P-Value : 0.3487          
##                                           
##             Sensitivity : 0.8712          
##             Specificity : 0.7838          
##          Pos Pred Value : 0.8273          
##          Neg Pred Value : 0.8365          
##              Prevalence : 0.5432          
##          Detection Rate : 0.4733          
##    Detection Prevalence : 0.5720          
##       Balanced Accuracy : 0.8275          
##                                           
##        'Positive' Class : 1               
##

In the training set the accuracy rate is about 83,13% . But we are more intrested in the accuracy of the test set.

pred <- predict(model,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  3
##          1 11 30
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar's Test P-Value : 0.0613688       
##                                           
##             Sensitivity : 0.5926          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.4500          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.7508          
##                                           
##        'Positive' Class : 0               
##

With the test set we have lower accuracy rate about 76.67%.

altering the link function

By default the link function is logit from the sigmoid distribution, we can make use of the link function probit instead, which stands for the normal distribution.

model1 <- glm(target~.-fbs-slope-exang-age, data=train,
             family = binomial(link = "probit"))
summary(model1)
## 
## Call:
## glm(formula = target ~ . - fbs - slope - exang - age, family = binomial(link = "probit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7779  -0.5883   0.1666   0.6670   2.5989  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.373007   1.199910   0.311 0.755905    
## sex1        -0.940784   0.252631  -3.724 0.000196 ***
## cp1          0.830588   0.299919   2.769 0.005616 ** 
## cp2          1.275100   0.253681   5.026 5.00e-07 ***
## cp3          1.262407   0.387479   3.258 0.001122 ** 
## trestbps    -0.011677   0.006660  -1.753 0.079549 .  
## chol        -0.004068   0.002047  -1.987 0.046870 *  
## thalach      0.018999   0.005163   3.680 0.000233 ***
## oldpeak     -0.470191   0.108935  -4.316 1.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 335.05  on 242  degrees of freedom
## Residual deviance: 197.23  on 234  degrees of freedom
## AIC: 215.23
## 
## Number of Fisher Scoring iterations: 6
pred <- predict(model,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 16  3
##          1 11 30
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 0.0004231       
##                                           
##                   Kappa : 0.5156          
##                                           
##  Mcnemar's Test P-Value : 0.0613688       
##                                           
##             Sensitivity : 0.5926          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7317          
##              Prevalence : 0.4500          
##          Detection Rate : 0.2667          
##    Detection Prevalence : 0.3167          
##       Balanced Accuracy : 0.7508          
##                                           
##        'Positive' Class : 0               
##

As we see we get the same results with a slight difference between the AIC criterion 215.54 for probit link and 214.98 for logit link.

To leave a comment for the author, please follow the link and comment on their blog: Modeling with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

logistic regression

Introduction

Data preparation

Data partition

train the model

prediction and confusion matrix

altering the link function

Related

Introduction

Data preparation

Data partition

train the model

prediction and confusion matrix

altering the link function

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)