Simple logistic regression on qualitative dichotomic variables

[This article was first published on Statistic on aiR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post we will see briefly how to implement a logistic regression model if you have categorical variables, or qualitative, organized in double entry contingency tables. In this model the dependent variable (Y) and independent variable (X) are both dichotomies (or Bernoullian).

In general, the probability that Y = 1 as a function of predictors is:

$$P(Y=1|X=x)=\pi(x)=\frac{exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}{1+exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}$$

Our goal is to estimate the value of the beta parameters (regressors).

We begin to examine a model of simple logistic regression (with only one predictor).

Consider the following example. The table below shows the results of a study on gastroesophageal reflux. You want to evaluate how the presence of a stress factor can influence the onset of this disease.



First we import the values in R. We must create a table with double entry; proceed as follows:

reflux <- matrix(c(251,131,4,33), nrow=2)
colnames(reflux) <- c("reflNO", "reflYES")
rownames(reflux) <- c("stressNO", "stressYES")
table <- as.table(reflux)

table

          reflNO reflYES
stressNO     251       4
stressYES    131      33


Now adjust the data for the logistic regression. We must create a data frame:

dft <- as.data.frame(table)
dft

       Var1    Var2 Freq
1  stressNO  reflNO  251
2 stressYES  reflNO  131
3  stressNO reflYES    4
4 stressYES reflYES   33



We can now fit the model, and then perform the logistic regression in R:

fit <- glm(Var2 ~ Var1, weights = Freq, data = dft, family = binomial(logit))
summary(fit)


Call:
glm(formula = Var2 ~ Var1, family = binomial(logit), data = dft, 
    weights = Freq)

Deviance Residuals: 
     1       2       3       4  
-2.817  -7.672   5.765  10.287  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -4.1392     0.5040  -8.213  < 2e-16 ***
Var1stressYES   2.7605     0.5403   5.109 3.23e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 250.23  on 3  degrees of freedom
Residual deviance: 205.86  on 2  degrees of freedom
AIC: 209.86

Number of Fisher Scoring iterations: 6



First we comment the code to perform the regression. The logistic regression is called imposing the family: family = binomial(logit). The code Var2 ~ Var1 means that we want to create a model that will explain the variable var2 (presence or absence of reflux) as a function of the variable var1 (presence or absence of stressful events). In practice var2 is the independent variable Y, and Var1 is the dependent variable X (the regressors). Provided the formula to be analyzed, you specify the weight of each variable, data in column Freq of the dataframe dft (so we write weights = Freq and data = dft to specify the location where the values are contained).

The values of the parameters $\beta_0$ and $\beta_1$ are respectively the values (intercept) and Var1stress1. We can then write our empirical model:

$$\pi(x)=\frac{exp(-4.139+2.760x)}{1+exp(-4.139+2.760x)}$$

The independent variable x can be zero or one. If you assume value 0 (ie in the absence of stressful events), then the probability of having reflux is:

$$\pi(x=0)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=0.016=1.6\%$$

If there are stressful events (x = 1), the probability of having reflux is:

$$\pi(x=1)=\frac{exp(\beta_0+\beta_1)}{1+exp(\beta_0+\beta_1)}=0.20=20\%$$

The odds are:

$$odds(x=1)=\frac{\pi(1)}{1-\pi(1)}=exp(\beta_0+\beta_1)$$

$$odds(x=0)=\frac{\pi(0)}{1-\pi(0)}=exp(\beta_0)$$

We can finally calculate the odd ratio OR:

$$OR=\frac{odds(x=1)}{odds(x=0)}=15.807$$

A person who has experienced a stressful event has a propensity to develop gastroesophageal reflux 15.807 times larger than the person who has not undergone stressful events.

The probabilities and the odds can be readily calculated in R recalling that:

fit$coefficient[1] = $\beta_0$ (intercept)
fit$coefficient[2] = $\beta_1$

Furthermore:

summary(fit)$coefficient[1,2] = standard error of $\beta_0$
summary(fit)$coefficient[2,2] = standard error of $\beta_1$

And so we have:

pi0 <- exp(fit$coefficient[1]) / (1 + exp(fit$coefficient[1]))
pi1 <- exp(fit$coefficient[1] + fit$coefficient[2]) / (1 + exp(fit$coefficient[1]+fit$coefficient[2]))

odds0 <- pi0 / (1 - pi0)
odds1 <- pi1 / (1 - pi1)

OR <- odds1 / odds0

#the same result with:
OR <- exp(fit$coefficient[2])

#the confidence interval for OR is:
ORmin <- exp( fit$coefficient[2] - qnorm(.975) * summary(fit)$coefficient[2,2] )

ORmax <- exp( fit$coefficient[2] + qnorm(.975) * summary(fit)$coefficient[2,2] )


We can obtain the same result for the odd-ratio, using the simplify formula:

$$OR=\frac{ad}{bc}=\frac{251\cdot33}{4\cdot131}=15.807$$

that in R is:

OR <- (table[1,1]*table[2,2]) / (table[1,2]*table[2,1])


The acronym AIC stands for Akaike's information criterion. This parameter does not provide any data on the model just created. It
may be useful in comparing this model with other possibly taken into account (the model with lowest AIC is the better).

To leave a comment for the author, please follow the link and comment on their blog: Statistic on aiR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)