Welcome to the third part of series posts. In previous post, I discussed about the data points which we require to perform predictive analysis. In this post I will discuss about the solution approach along with required methodology and its implementation in R. Before we move ahead in this part, let us recall the prediction objective set in first part and data points required to perform predictive analysis.
In first part, we discussed about the two major concerns of eCommerce,
- Who will revisit the sites in next couple of days?
- Who are likely to buy?
Then we started addressing solving first problem in second part and tried to identify data points which can be helpful in our analysis. We listed out following data points for building predictive model.
- Visitor Id
- Visitor Type
- Landing Page
- Exit Page
- Page depth
- Average time on site
- Page views
- Unique page views
- Visit count
- Days since last visit
In this list I would like to add another data point. It is medium (organic or direct visit). You can extract this data points from Google Analytics for your web property.
Now let us jump into predictive analysis part and set prediction objective as per our requirement. Also let us obtain how above data points would be useful in our analysis.
In predictive analysis it is very much essential part that you set prediction objective properly. In our case we want to identify the revisit of the user in next couple of days, so we are interested into predicting the revisit of the user. Our prediction objective is revisit prediction of users.
Once we have defined the prediction object, next part comes into predictive analysis life cycle is gathering the data points as per prediction objective. We have already discussed about the data points which can be extracted from the Google analytics. Let me discuss how you can extract data from Goole analytics using R. However, there is one package in CRAN called RgoogleAnalytics and very nice post on how to extract data from Google analytics. Here I will showcase you code that I used for extracting data from Google analytics of my web property pingax.com. I have used earlier version of RgoogleAnalytics package developed by Vignesh Prajapati.
# Create a new Google Analytics API object ga <- RGoogleAnalytics() # Authorize your account and paste the accesstoken query <- QueryBuilder() access_token <- query$authorize() # Create a new Google Analytics API object ga <- RGoogleAnalytics() ga.profiles <- ga$GetProfileData(access_token) # List the GA profiles ga.profiles #Build the query string query$Init(start.date = "2014-04-09", end.date = "2014-12-09", dimensions = "ga:dimension1,ga:medium,ga:landingPagePath,ga:exitPagePath,ga:userType,ga:sessionCount,ga:daysSinceLastSession", metrics = "ga:sessions,ga:pageviews,ga:uniquePageviews,ga:sessionDuration", #sort = "ga:visits", max.results = 11000, table.id = paste("ga:",ga.profiles$id,sep="",collapse=","), access_token=access_token) # Make a request to get the data from the API ga.data <- ga$GetReportData(query) # Look at the returned data head(ga.data) #Save extracted data points write.csv(ga.data,"data.csv",row.names=F)
You can use this code for extraction of the above data points. Let me know if you need help in extracting the data points from Google analytics for your web property, I would like to help you. Here I want to discuss about how we have identified which data points we have to extract for our analysis. Since we are interested into user’s revisit, we have to identify the behaviour of his/her previous visit. That means we have to extract session level data of previous visit. Now all the data points which have listed above are session level information and they are self explanatory.
Based on the information of previous visit of user, we want to predict the probability score of revisit in next few days for user. So session level information of previous visit would be useful in our analysis
Now, Let us move towards machine learning technique that can be used in our analysis. Here prediction objective clearly suggest, it is classification task means we want to classify our users whether they will revisit or not along with probability score. So in our case we will be using logistic regression, we will not go more deep into logistic regression otherwise this post will be finished into explaining it ☺ (however if you want to read more about it, you can visit this post).
Now we are in good position to start with the implementation. We will be using R for our analysis and develop logistic regression model. For POC purpose only, I have developed this R code. Actual implementation might differ from this. Here I have used actual data point of the pingax.com to create predictive model.
Before we create model, we have to do pre processing on data. My data is in data.csv file, so it can be simply read using following R-code. (You can download dataset which is provided in the end part of the post)
#read Dataset data <- read.csv("data.csv",stringsAsFactors=F)
Now I have stored my visitor Id in dimention1 variable, so I have used following code for renaming variable.
#Rename column name names(data)[names(data) == 'dimension1'] <- 'visitorId'
In my dataset, we have two variable landing page path and exit page path. These variables contains URLs, we need to remove some query parameter from URLs stored in these variables. Because, we need only absolute page path. It can be done using following R-code
#Remove query parameters from the url data$landingPagePath <- gsub("\?.*","",data$landingPagePath) data$landingPagePath <- gsub("http://pingax.com/","",data$landingPagePath) data$exitPagePath <- gsub("\?.*","",data$exitPagePath) data$exitPagePath <- gsub("http://pingax.com/","",data$exitPagePath)
We are ready with the dataset for model creation, but wait we didn’t have set our predictor and response variable. So it’s time to set response variable. (It is the variable which we want to have prediction for). In our raw datset we don’t have response variable present. So we have to create it. Basic idea is we will consider all the visits as revisit if the visit count is greater than one and day since last session is 10. This can be achieved by following R code
#Response variable data$revisit <- as.numeric(data$sessionCount>1&data$daysSinceLastSession<10)
Now our response variable is created and it is stored in data frame with name revisit. Variable ‘revisit’ holds two types of values 0 and 1. Generally we call it class (i.e. we are performing classification task) Next thing is let us remove visitor ID, because it won’t be helpful for us in model creation since it holds unique identity of the visitor.
#Remove visitor id data <- data[,-1]
Now we are ready with the dataset to be plugged in algorithm to create model. Here we will do one thing; we will split data in two parts: 1) train_data 2) test_data. Using train set, we will create model and on test set will apply model to identify the accuracy of the model. The reason behind this splitting is to ensure and crosscheck that how model is performing well on unseen data (i.e. test set).
We will keep 80% portion of the original dataset into train_data and remaining 20% in test_data. Data will be selected based on the random sampling, it is shown in the below code.
#Tranining and Tesing data index <- sample(1:nrow(data),size=nrow(data)*0.8) train_data <- data[index,] test_data <- data[-index,]
In order to see the distribution of the classes in both set, following set can be used.
#Distribution table(train_data$revisit) table(test_data$revisit)
Now it’s time to create the model. As discussed earlier in this post, we will be using the logistic regression for model generation. In R logistic regression model is generated using glm() function. It takes several arguments
#Logistic regression model logit.model <- glm(train_data$revisit~.,data=train_data,family = binomial("logit"))
In glm, we specify response and predictor variables separated by tilde sing (~). In our case response variable is revisit and predictors are rest of the other variables in the data. However, we fit the model on the training data, we are specifying parameter data equals to train_data in glm function.
This will create the model and will store into logit.model variable which will contain all the modelling parameters and coefficients along with several other measures.
Once the model is generated, we will apply it on the test data in order to identify the prediction accuracy. In test data set we are already having variable revisit, but we will predict for data points in test set using the model trained on the train data.
Here is the code for generating prediction for test data
#Apply model on testing data test.predicted.prob <- predict(logit.model,newdata=test_data[,-ncol(test_data)],type="response") test.predicted <- round(test.predicted.prob)
Now we will create the confusion matrix for actual and predicted values to see that how well prediction model has done and also calculate accuracy on it.
#Confusion matrix confusion_matrix <- table(test_data$revisit,test.predicted) #Model Accuracy accuracy <- sum(diag(confusion_matrix))*100/nrow(test_data)
When I created this model I got 98% accuracy on test data prediction, I will encourage using this code and dataset (click here to download). There are other measures for checking the model performance. I will discuss it on some other post. Now this model can be utilized for several purposes based on the business need.
So, this how predictive analysis is used in solving business question. Till now we have tried to solve first question, in the next post we will discuss on the second question and try to solve it using predictive analysis. If you want to implement predictive model for your web property contact us, we would love to create it for you ☺
Powered by Google+ Comments