In my first blog post, I have explained about the what is regression? And how linear regression model is generated in R? In this post, I will explain what is logistic regression? And how the logistic regression model is generated in R?
Let’s first understand logistic regression. Logistic regression is one of the type of regression and it is used to predict outcome of the categorical dependent variable. (i.e. categorical variable has limited number of categorical values) based on the one or more independent variables. For example, if you would like to predict who will win the next T20 world cup, based on player’s strength and other details. It is a prediction done with categorical variable. Logistic regression can be binomial or multinomial.
In the binomial or binary logistic regression, the outcome can have only two possible types of values (e.g. “Yes” or “No”, “Success” or “Failure”). Multinomial logistic refers to cases where the outcome can have three or more possible types of values (e.g., “good” vs. “very good” vs. “best” ). Generally outcome is coded as “0″ and “1″ in binary logistic regression. We will use binary logistic regression in the rest of the part of the blog. Now, we will look at how the logistic regression model is generated in R.
Logistic regression in R
glm(Y~X1+X2+X3, family=binomial(link=”logit”), data=mydata)
Here, Y is dependent variable and X1, X2 and X3 are independent variables. Function includes additional parameter family and it has value binomial(link=”logit”) which means the probability distribution of regression model is binomial and link function is logit (Refer book R in Action for more information). Let’s generate a simple model. Suppose we want to predict whether a student will get admission based on his two exam scores. For this problem we have a historical data from previous applicants which can be used as the training data set to build a model. The data set contains the following parameters.
- exam_1- Exam-1 score
- exam_2- Exam-2 score
- admitted- 1 if admitted or 0 if not admitted
In the above parameters, parameter admitted has value 1 or 0 for each observation. Now, we will generate a model that can predict, will student get admission based on two exam scores? For a given problem, admitted is considered as dependent variable, exam_1 and exam_2 are considered as independent variables. The R code for the model is given as below.
>Model_1<-glm(admitted ~ exam_1 +exam_2, family = binomial("logit"), data=data)
After generating the model, let’s try to predict using this model. Suppose we have two exam marks of a student, 60 of exam_1 and 85 of exam_2. We will predict that will student get admission? Following is R code for predicting probability of student to get admission.
>in_frame<-data.frame(exam_1=60,exam_2=86) >predict(Model_1,in_frame, type="response")
Here, the output is given as a probability score which has value in range 0 to 1. If the probability score is greater than 0.5 then it is considered as TRUE. If the probability score is less than or equal to 0.5 then it is considered as FALSE. In our case 1 or 0 will be considered as the output to decide, will student get admission or not? if it is 1 then student will get admission otherwise not. So I have used round() function to convert probability score to 0 or 1. It is as below.
>round(predict(Model_1, in_frame, type="response"))
Output is 1 means a student will get admission. We can also predict for other observations in the above manner. Finally we understood what is logistic regression? And how it works in R? If you want to do the same exercise, Click here for R code and sample data set of above example. In the next blog, we will discuss about a specific problem for Google Analytics data and see how to use logistic regression into?