**Statistic on aiR**, and kindly contributed to R-bloggers)

In this post we will see briefly how to implement a logistic regression model if you have categorical variables, or qualitative, organized in double entry contingency tables. In this model the dependent variable (Y) and independent variable (X) are both dichotomies (or Bernoullian).

In general, the probability that Y = 1 as a function of **predictors** is:

$$P(Y=1|X=x)=\pi(x)=\frac{exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}{1+exp(\beta_0+\beta_1x_1+\cdots +\beta_kx_k)}$$

Our goal is to estimate the value of the beta parameters (**regressors**).

We begin to examine a **model of simple logistic regression** (with *only one predictor*).

Consider the following example. The table below shows the results of a study on gastroesophageal reflux. You want to evaluate how the presence of a stress factor can influence the onset of this disease.

First we import the values in R. We must create a table with double entry; proceed as follows:

reflux <- matrix(c(251,131,4,33), nrow=2)

colnames(reflux) <- c("reflNO", "reflYES")

rownames(reflux) <- c("stressNO", "stressYES")

table <- as.table(reflux)

table

reflNO reflYES

stressNO 251 4

stressYES 131 33

Now adjust the data for the logistic regression. We must create a data frame:

dft <- as.data.frame(table)

dft

Var1 Var2 Freq

1 stressNO reflNO 251

2 stressYES reflNO 131

3 stressNO reflYES 4

4 stressYES reflYES 33

We can now fit the model, and then perform the logistic regression in R:

fit <- glm(Var2 ~ Var1, weights = Freq, data = dft, family = binomial(logit))

summary(fit)

Call:

glm(formula = Var2 ~ Var1, family = binomial(logit), data = dft,

weights = Freq)

Deviance Residuals:

1 2 3 4

-2.817 -7.672 5.765 10.287

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.1392 0.5040 -8.213 < 2e-16 ***

Var1stressYES 2.7605 0.5403 5.109 3.23e-07 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 250.23 on 3 degrees of freedom

Residual deviance: 205.86 on 2 degrees of freedom

AIC: 209.86

Number of Fisher Scoring iterations: 6

First we comment the code to perform the regression. The logistic regression is called imposing the family: `family = binomial(logit)`

. The code `Var2 ~ Var1`

means that we want to create a model that will explain the variable var2 (presence or absence of reflux) as a function of the variable var1 (presence or absence of stressful events). In practice var2 is the independent variable Y, and Var1 is the dependent variable X (the regressors). Provided the formula to be analyzed, you specify the weight of each variable, data in column `Freq`

of the dataframe `dft`

(so we write `weights = Freq`

and `data = dft`

to specify the location where the values are contained).

The values of the parameters $\beta_0$ and $\beta_1$ are respectively the values `(intercept)`

and `Var1stress1`

. We can then write our empirical model:

$$\pi(x)=\frac{exp(-4.139+2.760x)}{1+exp(-4.139+2.760x)}$$

The independent variable x can be zero or one. If you assume value 0 (ie in the absence of stressful events), then the probability of having reflux is:

$$\pi(x=0)=\frac{exp(\beta_0)}{1+exp(\beta_0)}=0.016=1.6\%$$

If there are stressful events (x = 1), the probability of having reflux is:

$$\pi(x=1)=\frac{exp(\beta_0+\beta_1)}{1+exp(\beta_0+\beta_1)}=0.20=20\%$$

The odds are:

$$odds(x=1)=\frac{\pi(1)}{1-\pi(1)}=exp(\beta_0+\beta_1)$$

$$odds(x=0)=\frac{\pi(0)}{1-\pi(0)}=exp(\beta_0)$$

We can finally calculate the odd ratio OR:

$$OR=\frac{odds(x=1)}{odds(x=0)}=15.807$$

A person who has experienced a stressful event has a propensity to develop gastroesophageal reflux 15.807 times larger than the person who has not undergone stressful events.

The probabilities and the odds can be readily calculated in R recalling that:

`fit$coefficient[1]`

= $\beta_0$ (intercept)`fit$coefficient[2]`

= $\beta_1$

Furthermore:

`summary(fit)$coefficient[1,2]`

= standard error of $\beta_0$`summary(fit)$coefficient[2,2]`

= standard error of $\beta_1$

And so we have:

pi0 <- exp(fit$coefficient[1]) / (1 + exp(fit$coefficient[1]))

pi1 <- exp(fit$coefficient[1] + fit$coefficient[2]) / (1 + exp(fit$coefficient[1]+fit$coefficient[2]))

odds0 <- pi0 / (1 - pi0)

odds1 <- pi1 / (1 - pi1)

OR <- odds1 / odds0

#the same result with:

OR <- exp(fit$coefficient[2])

#the confidence interval for OR is:

ORmin <- exp( fit$coefficient[2] - qnorm(.975) * summary(fit)$coefficient[2,2] )

ORmax <- exp( fit$coefficient[2] + qnorm(.975) * summary(fit)$coefficient[2,2] )

We can obtain the same result for the odd-ratio, using the simplify formula:

$$OR=\frac{ad}{bc}=\frac{251\cdot33}{4\cdot131}=15.807$$

that in R is:

OR <- (table[1,1]*table[2,2]) / (table[1,2]*table[2,1])

The acronym AIC stands for **Akaike's information criterion**. This parameter does not provide any data on the model just created. It

may be useful in comparing this model with other possibly taken into account (the model with lowest AIC is the better).

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistic on aiR**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...