**R Programming – DataScience+**, and kindly contributed to R-bloggers)

## Category

## Tags

In my previous post, I showed how to run a linear regression model with medical data. In this post, I will show how to conduct a logistic regression model. The major difference between linear and logistic regression is that the latter needs a dichotomous (0/1) dependent (outcome) variable, whereas the first, work with a continuous outcome. I will run a logistic regression to evaluate the effect of calcium and vitD on the osteoporosis.

## Let's start loading the packages:

```
library(tidyverse)
library(RNHANES)
library(ggplot2)
library(pROC)
```

## Prepare the dataset

Variables included for this analysis are:

- age (years)
- sex (women, men)
- serum levels of vitamin D (mg/ml)
- serum levels of calcium (mg/ml)
- osteoporosis (yes/no, 1/0).

All variables are assessed from NHANES in the cycles 2007-2008 and 2009-2010.

```
d07 = nhanes_load_data("DEMO_E", "2007-2008") %>%
select(SEQN, cycle, RIAGENDR, RIDAGEYR) %>%
transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR) %>%
left_join(nhanes_load_data("VID_E", "2007-2008"), by="SEQN") %>%
select(SEQN, wave, RIAGENDR, RIDAGEYR, LBXVIDMS) %>%
transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD=LBXVIDMS) %>%
left_join(nhanes_load_data("BIOPRO_E", "2007-2008"), by="SEQN") %>%
select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, LBXSCA) %>%
transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium = LBXSCA) %>%
left_join(nhanes_load_data("OSQ_E", "2007-2008"), by="SEQN") %>%
select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, OSQ060) %>%
transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, Osteop = OSQ060)
d09 = nhanes_load_data("DEMO_F", "2009-2010") %>%
select(SEQN, cycle, RIAGENDR, RIDAGEYR) %>%
transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR) %>%
left_join(nhanes_load_data("VID_F", "2009-2010"), by="SEQN") %>%
select(SEQN, wave, RIAGENDR, RIDAGEYR, LBXVIDMS) %>%
transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD=LBXVIDMS) %>%
left_join(nhanes_load_data("BIOPRO_F", "2009-2010"), by="SEQN") %>%
select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, LBXSCA) %>%
transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium = LBXSCA) %>%
left_join(nhanes_load_data("OSQ_F", "2009-2010"), by="SEQN") %>%
select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, OSQ060) %>%
transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium, Osteop = OSQ060)
dat = bind_rows(d07, d09) %>% as.data.frame()
```

### Create categories of Vitamin D

Institute of Medicine cutoffs for Vitamin D

- Vitamin D deficiency: Serum 25OHD less than 30 nmol/L (12 ng/mL)
- Vitamin D inadequacy: Serum 25OHD 30-49 nmol/L (12-19 ng/mL)
- Vitamin D sufficiency: Serum 25OHD 50-125 nmol/L (20-50 ng/mL)

```
dat1 = dat %>%
mutate(
vitD_group = case_when(
vitD < 30 ~ "Deficiency",
vitD >= 30 & vitD < 50 ~ "Inadequacy",
vitD >= 50 & vitD <= 125 ~ "Sufficiency"))
```

### Exclude missings

`dat2 = dat1 %>% filter(!is.na(vitD_group), !is.na(Calcium), !is.na(Osteop), Osteop!=9) %>% mutate(Gender = recode_factor(RIAGENDR, `1` = "Men", `2` = "Women"), Osteop = recode_factor(Osteop, `1` = 1, `2` = 0)) head(dat2)`

## SEQN wave RIAGENDR RIDAGEYR vitD Calcium Osteop vitD_group Gender ## 1 41475 2007-2008 2 62 58.8 9.5 0 Sufficiency Women ## 2 41477 2007-2008 1 71 81.8 10.0 0 Sufficiency Men ## 3 41479 2007-2008 1 52 78.4 9.0 0 Sufficiency Men ## 4 41482 2007-2008 1 64 61.9 9.1 0 Sufficiency Men ## 5 41483 2007-2008 1 66 53.3 8.9 0 Sufficiency Men ## 6 41485 2007-2008 2 30 39.1 9.3 0 Inadequacy Women

## Logit regression model

I will use the `glm()`

function to run the logistic regression and then `summary()`

command to get the results.

`fit <- glm(Osteop ~ vitD_group + Calcium + Gender + RIDAGEYR, data = dat2, family = "binomial") summary(fit)`

## ## Call: ## glm(formula = Osteop ~ vitD_group + Calcium + Gender + RIDAGEYR, ## family = "binomial", data = dat2) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.4265 0.1009 0.1894 0.3315 1.0305 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 7.81969 1.08054 7.237 4.59e-13 *** ## vitD_groupInadequacy -0.17444 0.20124 -0.867 0.38603 ## vitD_groupSufficiency -0.53068 0.18159 -2.922 0.00347 ** ## Calcium 0.10330 0.11404 0.906 0.36506 ## GenderWomen -2.08873 0.12298 -16.984 < 2e-16 *** ## RIDAGEYR -0.07127 0.00330 -21.599 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 4591.4 on 10064 degrees of freedom ## Residual deviance: 3553.1 on 10059 degrees of freedom ## AIC: 3565.1 ## ## Number of Fisher Scoring iterations: 7

### Transforms beta's to the odds ratio

The output of `summary()`

does not provide the odds ratio which are often presented in research papers. The `exp`

of beta's give the odds.

`round(exp(coef(fit)), 2)`

## (Intercept) vitD_groupInadequacy vitD_groupSufficiency ## 2489.14 0.84 0.59 ## Calcium GenderWomen RIDAGEYR ## 1.11 0.12 0.93

## Interpreting results

From the output, I see that there is a significant association between vitamin D and osteoporosis. Compared to individuals with deficiency levels of vitamin D, those with sufficient levels of vitamin D in the blood have 41% (odds ratio: 0.59) lower risk of having osteoporosis. Inadequacy of vitamin D is not significantly (p=0.38) associated with osteoporosis.

To get the 95% confidence interval, I use `confit()`

for confidence intervals of each variable.

`round(exp(confint(fit)), 2)`

## 2.5 % 97.5 % ## (Intercept) 298.55 20650.10 ## vitD_groupInadequacy 0.56 1.24 ## vitD_groupSufficiency 0.41 0.83 ## Calcium 0.89 1.39 ## GenderWomen 0.10 0.16 ## RIDAGEYR 0.93 0.94

## Assessing discrimination of the model with ROC curve

When studying a new biomarker, it is essential to illustrate the discrimination ability of the model, in addition to the association with the outcome, osteoporosis in our example. Levels of vitamin D in the blood are known to be related to osteoporosis, but here will show how much discrimination adds to the model.

First, I will run a model without vitamin D and assess the discrimination and after adding vitamin D in the model and see the differences in the ROC curve.

```
# model without vitamin D
fit1 <- glm(Osteop ~ Calcium + Gender + RIDAGEYR,
data = dat2,
family = "binomial")
# model with vitamin D
fit2 <- glm(Osteop ~ vitD_group + Calcium + Gender + RIDAGEYR,
data = dat2,
family = "binomial")
dat2$prob1=predict(fit1,type=c("response"))
dat2$prob2=predict(fit2,type=c("response"))
```

`roc(Osteop ~ prob1, data = dat2)`

## ## Call: ## roc.formula(formula = Osteop ~ prob1, data = dat2) ## ## Data: prob1 in 608 controls (Osteop 1) < 9457 cases (Osteop 0). ## Area under the curve: 0.8496

`roc(Osteop ~ prob2, data = dat2)`

## ## Call: ## roc.formula(formula = Osteop ~ prob2, data = dat2) ## ## Data: prob2 in 608 controls (Osteop 1) < 9457 cases (Osteop 0). ## Area under the curve: 0.8508

There is a slight improvement in discrimination with including vitamin D in the model from 0.8496 to 0.8508.

To learn more about AUC read this post: Interpretation of the AUC.

Related Post

- Linear Regression with Healthcare Data for Beginners in R
- Multiple Linear Regression in Python
- Linear regression in Python
- Logistic Regression with Python
- Linear Regression with Python

**leave a comment**for the author, please follow the link and comment on their blog:

**R Programming – DataScience+**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...