Predict User’s Return Visit within a day part-2

October 22, 2012

(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

style="text-align: justify">Welcome to the second part of the series on predicting user’s revisit to the website. In my earlier blog title="Logistic Regression With R" href="" >Logistic Regression with R, I discussed what is logistic regression. In the title="Predict User's Return Visit within a day part-1" href="" >first part of the series, we applied logistic regression to available data set. The problem statement there was whether a user will return in the next 24 hours or not. The model is built and till now it was showing us 88% accuracy in predicting user’s revisit.

style="text-align: justify">In this post, I’d try to showcase ways to improve this accuracy and take it to the next level. This is more about technical optimization so  if you are a business reader you may want to skip and check title="Contact Us" href="" >how can you use this for your benefit. But, if you are techwiz or Data modeling guy like me, let’s get rolling.

style="text-align: justify">As I have discussed in blog title="Improving Bounce Rate Prediction Model for Google Analytics Data" href="" >Improving Bounce Rate prediction Model for Google Analytics Data, the first step of the model improvement is variable selection and the second step is outlier detection (If you want to know more details of steps, refer mentioned blog). Let’s apply these steps one by one.

Variable selection

style="text-align: justify">I have used stepwise backward selection method for variable selection. R code for the stepwise backward selection method is as below.

>Model_1 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data ,family = binomial("logit"))
>stepAIC(Model_1, direction="backward")
Start:  AIC=2119.37
revisit ~ DaySinceLastVisit + visitCount + f.medium + f.landingPagePath +  f.exitPagepath + pageDepth
                     Df Deviance    AIC
- f.exitPagepath    152   1732.4 1966.4
- f.landingPagePath  87   1751.0 2115.0
                    1581.4 2119.4
- pageDepth           1   1583.4 2119.4
- f.medium           11   1656.5 2172.5
- visitCount          1   1740.1 2276.1
- DaySinceLastVisit   1   1826.4 2362.4
Step:  AIC=1966.42
revisit ~ DaySinceLastVisit + visitCount + f.medium + f.landingPagePath + pageDepth
                     Df Deviance    AIC
                    1732.4 1966.4
- pageDepth           1   1738.9 1970.9
- f.landingPagePath 101   1987.5 2019.5
- f.medium           12   1821.2 2031.2
- visitCount          1   1929.3 2161.3
- DaySinceLastVisit   1   1978.4 2210.4
style="text-align: justify">Before we understand the output, let me explain how the variables are selected in stepwise backward selection? In the stepwise backward selection method, AIC is used as the selection criterion. General rule is lower the AIC, best the model(i.e. For a group of variables, if AIC decrease by removing any variable(s) from group,then remaining variables are used in the model. This process continues until AIC stops decreasing). From the output, we can see that AIC is decreased and variable exitPageapath is excluded from the model. Now, we will create new model(Model_2 ) which does not include exitPageapath. R code for new model is as below.

>Model_2<-glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +pageDepth, data=data,family = binomial("logit"))
style="text-align: justify">After generating the new model ,let’s check the accuracy of the new model and it is as below.

>predicted_revisit<- round(predict(Model_2,in_d,type="response"))
>confusion_matrix<- ftable(revisit, predicted_revisit)
>accuracy<- sum(diag(confusion_matrix))/2555*100
style="text-align: justify">From the output, we can see that accuracy of the new model is decreased. This does not seem good to us. Variable selection method did not help us in improving the model. Let’s try second step for model improvement which is outlier detection.

Outlier detection

style="text-align: justify">As we know that data set contains some unreliable observations which make model’s quality poor. We always need to detect outlier and remove them. For numerical variables, outliers can  be removed by observing the histogram of  frequency distribution of the values of each variable (Process is described in blog title="Improving Bounce Rate Prediction Model for Google Analytics Data" href="" >Improving Bounce Rate Prediction Model for Google Analytics Data). In our data set, there are three numerical variables named visitCount, daySinceLastVisit and pageDepth. I have generated new data set after removing outliers. Let’s create new model based on new data set and check the accuracy of the new model. R code for new model is as below.

>Model_3 <- glm(revisit ~ DaySinceLastVisit + visitCount +f.medium +f.landingPagePath +f.exitPagepath+pageDepth, data=data_outlier_removed ,family = binomial("logit"))
style="text-align: justify">Now, we will check the accuracy of the new model and it is as below.

>predicted_revisit <- round(predict(Model_3,in_d,type="response"))
>confusion_matrix <- ftable(revisit, predicted_revisit)
>accuracy <- sum(diag(confusion_matrix))/2292*100
style="text-align: justify">From the result, we can see that model has more accuracy than previous models (Model_1 and Model_2) and it is good for us. So, removing the outliers from the data set, the model got more improvement and prediction accuracy.  For now, we can conclude that through this model (Model_3), we can predict more accurately whether a user will return to website in next 24 hours. If you want to do exercise, href="" onclick="_gaq.push(['_trackEvent','Downloads','Logistic Regression 3','Blog',,1]);">Click here for R code and sample data set. In the title="Predict User's Return Visit within a day part-3" href="" >next blog, we will discuss about logistic regression with Google Prediction API, check the accuracy of the Google Prediction API for our data set and try to predict for a user that will user return to website in next 24 hours?

style="color:#2361A1">Would you like to understand the value of predictive analysis when applied on web analytics data to help improve your understanding relationship between different variables? We think you may like to watch our Webinar – How to perform predictive analysis on your web analytics tool data. href=";utm_campaign=webinar3" >Watch the Replay now!

class="wp-about-author-containter-top" style="background-color:#FFEAA8;"> class="wp-about-author-pic"> src="" alt="Amar Gondaliya" width="60" class="photo" />

href='' title='Amar Gondaliya'>Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API.
Google Plus Profile: : href="" >Amar Gondaliya

align="right" style="float: right; clear:left; padding: 0px 5px 0px 7px;"> name="fb_share" type="box_count" share_url="">

To leave a comment for the author, please follow the link and comment on his blog: Tatvic Blog » R. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Top 3 Posts from the past 2 days

Top 9 articles of the week

  1. Scatterplots
  2. In-depth introduction to machine learning in 15 hours of expert videos
  3. Installing R packages
  4. The Single Most Important Skill for a Data Scientist
  5. Illustrated Guide to ROC and AUC
  6. Using apply, sapply, lapply in R
  7. Network analysis with igraph
  8. R vs Python: Survival Analysis with Plotly
  9. KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!