Predictive analysis in eCommerce part-3

[This article was first published on Pingax » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to the third part of series posts. In previous post, I discussed about the data points which we require to perform predictive analysis. In this post I will discuss about the solution approach along with required methodology and its implementation in R. Before we move ahead in this part, let us recall the prediction objective set in first part and data points required to perform predictive analysis.

In first part, we discussed about the two major concerns of eCommerce,

  1. Who will revisit the sites in next couple of days?
  2. Who are likely to buy?

Then we started addressing solving first problem in second part and tried to identify data points which can be helpful in our analysis. We listed out following data points for building predictive model.

  1. Visitor Id
  2. Visitor Type
  3. Landing Page
  4. Exit Page
  5. Page depth
  6. Average time on site
  7. Page views
  8. Unique page views
  9. Visit count
  10. Days since last visit

In this list I would like to add another data point. It is medium (organic or direct visit). You can extract this data points from Google Analytics for your web property.

Now let us jump into predictive analysis part and set prediction objective as per our requirement. Also let us obtain how above data points would be useful in our analysis.

In predictive analysis it is very much essential part that you set prediction objective properly. In our case we want to identify the revisit of the user in next couple of days, so we are interested into predicting the revisit of the user. Our prediction objective is revisit prediction of users.

Once we have defined the prediction object, next part comes into predictive analysis life cycle is gathering the data points as per prediction objective. We have already discussed about the data points which can be extracted from the Google analytics. Let me discuss how you can extract data from Goole analytics using R. However, there is one package in CRAN called RgoogleAnalytics  and very nice post on how to extract data from Google analytics. Here I will showcase you code that I used for extracting data from Google analytics of my web property pingax.com. I have used earlier version of RgoogleAnalytics package developed by Vignesh Prajapati.

# Create a new Google Analytics API object
ga <- RGoogleAnalytics()

# Authorize your account and paste the accesstoken 
query <- QueryBuilder()
access_token <- query$authorize()

# Create a new Google Analytics API object
ga <- RGoogleAnalytics()
ga.profiles <- ga$GetProfileData(access_token)

# List the GA profiles 
ga.profiles

#Build the query string 
query$Init(start.date = "2014-04-09",
          end.date = "2014-12-09",
          dimensions = "ga:dimension1,ga:medium,ga:landingPagePath,ga:exitPagePath,ga:userType,ga:sessionCount,ga:daysSinceLastSession",
          metrics = "ga:sessions,ga:pageviews,ga:uniquePageviews,ga:sessionDuration",
          #sort = "ga:visits",
          max.results = 11000,
          table.id = paste("ga:",ga.profiles$id[3],sep="",collapse=","),
          access_token=access_token)

# Make a request to get the data from the API
ga.data <- ga$GetReportData(query)

# Look at the returned data
head(ga.data)

#Save extracted data points
write.csv(ga.data,"data.csv",row.names=F)

 

You can use this code for extraction of the above data points. Let me know if you need help in extracting the data points from Google analytics for your web property, I would like to help you. Here I want to discuss about how we have identified which data points we have to extract for our analysis. Since we are interested into user’s revisit, we have to identify the behaviour of his/her previous visit. That means we have to extract session level data of previous visit. Now all the data points which have listed above are session level information and they are self explanatory.

Based on the information of previous visit of user, we want to predict the probability score of revisit in next few days for user. So session level information of previous visit would be useful in our analysis

Now, Let us move towards machine learning technique that can be used in our analysis. Here prediction objective clearly suggest, it is classification task means we want to classify our users whether they will revisit or not along with probability score. So in our case we will be using logistic regression, we will not go more deep into logistic regression otherwise this post will be finished into explaining it ☺ (however if you want to read more about it, you can visit this post).

Now we are in good position to start with the implementation. We will be using R for our analysis and develop logistic regression model. For POC purpose only, I have developed this R code. Actual implementation might differ from this. Here I have used actual data point of the pingax.com to create predictive model.

Before we create model, we have to do pre processing on data. My data is in data.csv file, so it can be simply read using following R-code. (You can download dataset which is provided in the end part of the post)

#read Dataset
data <- read.csv("data.csv",stringsAsFactors=F)

Now I have stored my visitor Id in dimention1 variable, so I have used following code for renaming variable.

#Rename column name
names(data)[names(data) == 'dimension1'] <- 'visitorId'

In my dataset, we have two variable landing page path and exit page path. These variables contains URLs, we need to remove some query parameter from URLs stored in these variables. Because, we need only absolute page path. It can be done using following R-code

#Remove query parameters from the url
data$landingPagePath <- gsub("\?.*","",data$landingPagePath)
data$landingPagePath <- gsub("http://pingax.com/","",data$landingPagePath)
data$exitPagePath <- gsub("\?.*","",data$exitPagePath)
data$exitPagePath <- gsub("http://pingax.com/","",data$exitPagePath)

We are ready with the dataset for model creation, but wait we didn’t have set our predictor and response variable. So it’s time to set response variable. (It is the variable which we want to have prediction for). In our raw datset we don’t have response variable present. So we have to create it. Basic idea is we will consider all the visits as revisit if the visit count is greater than one and day since last session is 10. This can be achieved by following R code

#Response variable
data$revisit <- as.numeric(data$sessionCount>1&data$daysSinceLastSession<10)

Now our response variable is created and it is stored in data frame with name revisit. Variable ‘revisit’ holds two types of values 0 and 1. Generally we call it class (i.e. we are performing classification task) Next thing is let us remove visitor ID, because it won’t be helpful for us in model creation since it holds unique identity of the visitor.

#Remove visitor id
data <- data[,-1]

Now we are ready with the dataset to be plugged in algorithm to create model. Here we will do one thing; we will split data in two parts: 1) train_data 2) test_data. Using train set, we will create model and on test set will apply model to identify the accuracy of the model. The reason behind this splitting is to ensure and crosscheck that how model is performing well on unseen data (i.e. test set).

We will keep 80% portion of the original dataset into train_data and remaining 20% in test_data. Data will be selected based on the random sampling, it is shown in the below code.

#Tranining and Tesing data
index <- sample(1:nrow(data),size=nrow(data)*0.8)
train_data <- data[index,]
test_data <- data[-index,]

In order to see the distribution of the classes in both set, following set can be used.

#Distribution
table(train_data$revisit)
table(test_data$revisit)

Now it’s time to create the model. As discussed earlier in this post, we will be using the logistic regression for model generation. In R logistic regression model is generated using glm() function. It takes several arguments

#Logistic regression model
logit.model <- glm(train_data$revisit~.,data=train_data,family = binomial("logit"))

In glm, we specify response and predictor variables separated by tilde sing (~). In our case response variable is revisit and predictors are rest of the other variables in the data. However, we fit the model on the training data, we are specifying parameter data equals to train_data in glm function.

This will create the model and will store into logit.model variable which will contain all the modelling parameters and coefficients along with several other measures.

Once the model is generated, we will apply it on the test data in order to identify the prediction accuracy. In test data set we are already having variable revisit, but we will predict for data points in test set using the model trained on the train data.

Here is the code for generating prediction for test data

 #Apply model on testing data
test.predicted.prob <- predict(logit.model,newdata=test_data[,-ncol(test_data)],type="response")
test.predicted <- round(test.predicted.prob)

Now we will create the confusion matrix for actual and predicted values to see that how well prediction model has done and also calculate accuracy on it.

#Confusion matrix
confusion_matrix <- table(test_data$revisit,test.predicted)

#Model Accuracy
accuracy <- sum(diag(confusion_matrix))*100/nrow(test_data)

When I created this model I got 98% accuracy on test data prediction, I will encourage using this code and dataset (click here to download). There are other measures for checking the model performance. I will discuss it on some other post. Now this model can be utilized for several purposes based on the business need.

So, this how predictive analysis is used in solving business question. Till now we have tried to solve first question, in the next post we will discuss on the second question and try to solve it using predictive analysis. If you want to implement predictive model for your web property contact us, we would love to create it for you ☺

Happy reading!!!

Powered by Google+ Comments

The post Predictive analysis in eCommerce part-3 appeared first on Pingax.

To leave a comment for the author, please follow the link and comment on their blog: Pingax » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)