Predictive analysis on Web Analytics tool data

[This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our previous webinar, we discussed on predictive analytics and basic things to perform predictive analysis. We also discussed on an eCommerce problem and how it can be solved using predictive analysis. In this post, I will explain R script that I used to perform predictive analysis during webinar.

Before I explain about R script, let me recall eCommerce problem that we discussed during webinar so can get better idea about the data and R script. For eCommerce retailers product return is headache and higher return rates impact the bottom line of their business. So if return rate is reduced by a small amount then it would impact on the total revenue. In order to reduce return rate, we need to identify transactions where probability of product return is higher, if we can able to identify those transactions then we can perform some actions before delivering  products and reduce the return rate.

In webinar, we discussed that we can solve this problem using predictive analytics and use Google Analytics data. To perform predictive analysis we need to go through modeling process and following are the major steps of it.

  1. Load input data
  2. Introducing model variables
  3. Create model
  4. Check model performance
  5. Apply model on test data

I have included these steps in R script. So, let me explain R script that we used in webinar. R script is shown below.

# Step-1 : Read train dataset
train <- read.csv("train.csv")
# remove TransactionID from train dataset
train <- train[,-1]
# Step-3 : Create model
model <- glm(train$label~.,family=binomial(),data=train)
# Step-4 : Calculate accuracy of model
predicted <- round(predict(model,newdata=train,type="response"))
actual <- train$label
confusion_matix <- ftable(actual,predicted)
accuracy <- sum(diag(confusion_matrix))*100/length(actual)
#Step-5 : Applying model on test data
#Load test dataset
test <- read.csv("test.csv")
#Predict for test data
test_predict <- predict(model,newdata=test,type="response")
#creating label for test dataset
label <- rep(0,nrow(test))
# set label equal to 1 where probabilty of return > 0.6
label[test_predict>0.6] <- 1
# attach label to test dataset
test$label <- label
# Identify transactionID where label is 1.
high_prob_transactionIds <- test$TransactionID[test$label==1]

As you can see that first step is load input data set. In our case input data are train data and train data are loaded using read.csv() function. Train data contain the transaction based data and it contains TransactionID. TransactionID is not needed to use in the model, so it should be removed from the train data.

We also discussed about the variables during the webinar. Train data include pre-purchase, in-purchase and some general attributes. We can retrieve these data from the Google Analytics.

Next, model is created using glm() function and three arguments are given to it which are formula, family and data. In formula, we specify response variable and predictor variables separated by ~ sign. Second argument we set family equal to binomial and last we set data equal to train. Once model is created, its performance is checked where accuracy of the model is calculated. it is shown in the script.

Finally, model is applied on the test dataset and predict the probability of the product return for each transaction in test dataset. In the script, you can see that I have performed several steps to identify the transactionIDs from test data having higher probability of product return. Let me explain them, first test data are loaded. Second, predict() function is used which will generate the probabilities of product return and store in test_predict. Third, new variable label is created which contain 0 for all transactions initially and then using test_predict variable, 0 is replaced with the 1 where probability of return is greater than 0.6 or 60%. Now this label is attached to the test data. Finally all the transactionIDs are retrieved where label is 1 which means that probability of product return is greater than 60% in these transactionIDs.

So this is the script which I used during the webinar and performed the predictive analysis. I have created dummy datasets which you can use to perform these steps yourself. You can download data and R script from here

Here I want to share you one thing, this is not optimized model. This is a practice model. You can improve the model by taking other variables from Google Analytics or performing some optimization tasks, so you can get better results. However if you want to look at some other predictive models on web analytics tool data click here

Amar Gondaliya

Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API. Google Plus Profile: : Amar Gondaliya

To leave a comment for the author, please follow the link and comment on their blog: Tatvic Blog » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)