Logistic Regression in R with Healthcare data: Vitamin D and Osteoporosis

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Category

Tags

In my previous post, I showed how to run a linear regression model with medical data. In this post, I will show how to conduct a logistic regression model. The major difference between linear and logistic regression is that the latter needs a dichotomous (0/1) dependent (outcome) variable, whereas the first, work with a continuous outcome. I will run a logistic regression to evaluate the effect of calcium and vitD on the osteoporosis.

Let's start loading the packages:

library(tidyverse)
library(RNHANES)
library(ggplot2)
library(pROC)

Prepare the dataset

Variables included for this analysis are:

  • age (years)
  • sex (women, men)
  • serum levels of vitamin D (mg/ml)
  • serum levels of calcium (mg/ml)
  • osteoporosis (yes/no, 1/0).

All variables are assessed from NHANES in the cycles 2007-2008 and 2009-2010.

d07 = nhanes_load_data("DEMO_E", "2007-2008") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR) %>% 
  left_join(nhanes_load_data("VID_E", "2007-2008"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, LBXVIDMS) %>% 
  transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD=LBXVIDMS) %>% 
  left_join(nhanes_load_data("BIOPRO_E", "2007-2008"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, LBXSCA) %>% 
  transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium = LBXSCA) %>% 
  left_join(nhanes_load_data("OSQ_E", "2007-2008"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, OSQ060) %>% 
  transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, Osteop = OSQ060)

d09 = nhanes_load_data("DEMO_F", "2009-2010") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR) %>% 
  left_join(nhanes_load_data("VID_F", "2009-2010"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, LBXVIDMS) %>% 
  transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD=LBXVIDMS) %>% 
  left_join(nhanes_load_data("BIOPRO_F", "2009-2010"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD,  LBXSCA) %>% 
  transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium = LBXSCA) %>% 
  left_join(nhanes_load_data("OSQ_F", "2009-2010"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, OSQ060) %>% 
  transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, Osteop = OSQ060)

dat = bind_rows(d07, d09) %>% as.data.frame()

Create categories of Vitamin D

Institute of Medicine cutoffs for Vitamin D

  • Vitamin D deficiency: Serum 25OHD less than 30 nmol/L (12 ng/mL)
  • Vitamin D inadequacy: Serum 25OHD 30-49 nmol/L (12-19 ng/mL)
  • Vitamin D sufficiency: Serum 25OHD 50-125 nmol/L (20-50 ng/mL)

dat1 = dat %>% 
  mutate(
    vitD_group = case_when(
      vitD < 30 ~ "Deficiency",
      vitD >= 30 & vitD < 50 ~ "Inadequacy",
      vitD >= 50 & vitD <= 125 ~ "Sufficiency"))

Exclude missings

dat2 = dat1 %>% 
  filter(!is.na(vitD_group), !is.na(Calcium), !is.na(Osteop), Osteop!=9) %>% 
  mutate(Gender = recode_factor(RIAGENDR, 
                           `1` = "Men", 
                           `2` = "Women"),
         Osteop = recode_factor(Osteop, 
                           `1` = 1, 
                           `2` = 0))

head(dat2)</code>
<em>##    SEQN      wave RIAGENDR RIDAGEYR vitD Calcium Osteop  vitD_group Gender
## 1 41475 2007-2008        2       62 58.8     9.5      0 Sufficiency  Women
## 2 41477 2007-2008        1       71 81.8    10.0      0 Sufficiency    Men
## 3 41479 2007-2008        1       52 78.4     9.0      0 Sufficiency    Men
## 4 41482 2007-2008        1       64 61.9     9.1      0 Sufficiency    Men
## 5 41483 2007-2008        1       66 53.3     8.9      0 Sufficiency    Men
## 6 41485 2007-2008        2       30 39.1     9.3      0  Inadequacy  Women
</em></pre>
<h2>Logit regression model</h2>
<p>I will use the <code>glm()</code> function to run the logistic regression and then <code>summary()</code> command to get the results.</p>
<pre><code class="r">fit <- glm(Osteop ~ vitD_group + Calcium + Gender + RIDAGEYR, 
           data = dat2, 
           family = "binomial")
summary(fit)</code>
<em>## 
## Call:
## glm(formula = Osteop ~ vitD_group + Calcium + Gender + RIDAGEYR, 
##     family = "binomial", data = dat2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4265   0.1009   0.1894   0.3315   1.0305  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            7.81969    1.08054   7.237 4.59e-13 ***
## vitD_groupInadequacy  -0.17444    0.20124  -0.867  0.38603    
## vitD_groupSufficiency -0.53068    0.18159  -2.922  0.00347 ** 
## Calcium                0.10330    0.11404   0.906  0.36506    
## GenderWomen           -2.08873    0.12298 -16.984  < 2e-16 ***
## RIDAGEYR              -0.07127    0.00330 -21.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4591.4  on 10064  degrees of freedom
## Residual deviance: 3553.1  on 10059  degrees of freedom
## AIC: 3565.1
## 
## Number of Fisher Scoring iterations: 7
</em></pre>
<h3>Transforms beta's to the odds ratio</h3>
<p>The output of <code>summary()</code> does not provide the odds ratio which are often presented in research papers. The <code>exp</code> of beta's give the odds.</p>
<pre><code class="r">round(exp(coef(fit)), 2)</code>
<em>##           (Intercept)  vitD_groupInadequacy vitD_groupSufficiency 
##               2489.14                  0.84                  0.59 
##               Calcium           GenderWomen              RIDAGEYR 
##                  1.11                  0.12                  0.93
</em></pre>
<h2>Interpreting results</h2>
<p>From the output, I see that there is a significant association between vitamin D and osteoporosis. Compared to individuals with deficiency levels of vitamin D, those with sufficient levels of vitamin D in the blood have 41% (odds ratio: 0.59) lower risk of having osteoporosis. Inadequacy of vitamin D is not significantly (p=0.38) associated with osteoporosis.</p>
<p>To get the 95% confidence interval, I use <code>confit()</code> for confidence intervals of each variable.</p>
<pre><code class="r">round(exp(confint(fit)), 2)</code>
<em>##                        2.5 %   97.5 %
## (Intercept)           298.55 20650.10
## vitD_groupInadequacy    0.56     1.24
## vitD_groupSufficiency   0.41     0.83
## Calcium                 0.89     1.39
## GenderWomen             0.10     0.16
## RIDAGEYR                0.93     0.94
</em></pre>
<h2>Assessing discrimination of the model with ROC curve</h2>
<p>When studying a new biomarker, it is essential to illustrate the discrimination ability of the model, in addition to the association with the outcome, osteoporosis in our example. Levels of vitamin D in the blood are known to be related to osteoporosis, but here will show how much discrimination adds to the model.</p>
<p>First, I will run a model without vitamin D and assess the discrimination and after adding vitamin D in the model and see the differences in the ROC curve.</p>
<pre><code class="r"># model without vitamin D
fit1 <- glm(Osteop ~ Calcium + Gender + RIDAGEYR, 
           data = dat2, 
           family = "binomial")
# model with vitamin D
fit2 <- glm(Osteop ~ vitD_group + Calcium + Gender + RIDAGEYR, 
           data = dat2, 
           family = "binomial")

dat2$prob1=predict(fit1,type=c("response"))
dat2$prob2=predict(fit2,type=c("response"))

roc(Osteop ~ prob1, data = dat2)
## 
## Call:
## roc.formula(formula = Osteop ~ prob1, data = dat2)
## 
## Data: prob1 in 608 controls (Osteop 1) < 9457 cases (Osteop 0).
## Area under the curve: 0.8496

roc(Osteop ~ prob2, data = dat2)
## 
## Call:
## roc.formula(formula = Osteop ~ prob2, data = dat2)
## 
## Data: prob2 in 608 controls (Osteop 1) < 9457 cases (Osteop 0).
## Area under the curve: 0.8508

There is a slight improvement in discrimination with including vitamin D in the model from 0.8496 to 0.8508.

To learn more about AUC read this post: Interpretation of the AUC.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)