In my earlier blog, I have discussed about what is logistic regression? And how logistic model is generated in R? Now we will apply that learning on a specific problem of prediction. In this post, I will create a basic model to predict whether a user will return on website in next 24 hours. This problem is based on the user characteristics as well as website characteristics, but here we will predict based on some measures (i.e. User’s visits, user’s landing page path, user’s exit page path, etc.). Here our predicted outcome is 1 or 0 . 1 stands for “Yes” and 0 stands for “No”. Let’s discuss possible data set to build a logistic regression model.
For this problem, we have collected the data of a website from the google analytics. The data set contains the following parameters.
Let’s understand the parameters first. First parameter is visitor_ID, which is the id of visitor. Second parameter is visitCount , which contains values in increasing order(i.e. For a particular visitor, if visitor visits site first time then value of visitCount is 1, visits second time then value of visitCount is 2 and so on). Third parameter is daySincelastVisit, which contains the days difference of two consecutive visits. Fourth one is medium, which contains categorical values(i.e. organic, referral, etc. ).Fifth parameter is landingPagePath, which contains a string value represent the entrance page of the user for each visit. Sixth parameter is exitPagepPath, which contains a string value represent the exit page of the user for each visit. The last parameter is PageDepth, which contains the values that represent the how many pages a user has visited during a single visit.
Here our goal is to predict whether a user will return on website in next 24 hours. From the collected data, we can say that, a user would have came back if his visitcount is more than 1 and daysinceLastVisit is less than or equal to 1. Based on this criteria, we have generated new variable named revisit, which contains values “1″ or “0″ for each user. “1″ indicates user has came back and “0″ indicates user has not came back. This variable(revisit) is considered as the dependent variable and visitCount, daySinceLastVisit, medium, landingPagepath, exitPagePath and pageDepth are considered as the independent variables. Let’s generate model.
Before generating a model, let we discuss one issue. Issue is that data set contains categorical variables then how to deal with them? In the linear regression, only numeric values were considered in blog Linear Regression using R, but in the logistic regression we need to consider categorical values. There are many solutions for this issue, but I have used the dummy variable codding.
In the dummy coding, variable with K categories is usually entered into a regression as a sequence of K-1 dummy variables. For our data set, we have three categorical variables which are medium, landingPagePath and exitPagePath. Each variable contains 14, 102 and 167 categories respectively. Generally we are not appending dummy variables, but we create a contrast matrix for each categorical variable. I have done dummy coding for our categorical variables. We will not go into the detail and coding scheme, because one blog is not enough to explain dummy coding. We will deal only with our actual problem of the prediction.
Let’s generate regression model based on the data set. The R code for our model is as below.
>Model_1 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data ,family = binomial("logit"))
Here, I am not going to show the summary of the model, because the summary is too large to view, then the question arises here that how to decide model effectiveness without summary ? I have chosen alternate option to measure the effectiveness of the model and it is the accuracy of the model. Accuracy of the model is calculated as how many % model have been successful in predicting true against the actual values. Through the accuracy, we can decide the effectiveness of the model. Following is the R code snippet to calculate the accuracy of our model.
>confusion_matrix <- ftable(actual_revisit, predicted_revisit) >accuracy <- sum(diag(confusion_matrix))/2555*100
In the above R code, I have used ftable() function to generate confusion matrix which is used in calculating the accuracy of the model. Here we will not discuss in detail about confusion matrix, because it is out of the scope of the blog. For more detail refer wiki page of the confusion matrix. Let’s see the output, from the output, we can see that accuracy of our model is 88.22%, which is good for us. But, we can increase the accuracy of the model, if we improve the model. If the accuracy of the model is above 95%, then we can predict more accurately. Before we generate some prediction, we will improve the model first and then try to predict using improved model. If you want to do the same exercise, Click here for R code and sample data set. In the next blog, we will discuss about model improvement and check the accuracy of the improved model.