Product revenue prediction with R – part 2

October 8, 2012
By

(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

style="text-align: justify">After development of predictive model for class="GRcorrect">transactional product revenue -( href="http://www.tatvic.com/blog/product-revenue-prediction-with-r/" >Product revenue prediction with R – part 1), we can further improvise the model prediction by modifications in the model. In this post, we will see what are the steps required for model improvement. With the help of a set of model summary parameters, the data analyst can improve and evaluate the predictive model. Here, I have provided the information about how can we choose the best model or more fitted model for accurate prediction. We can do that by following ways using certain R functions.

  1. Choose Effective variables class="GRcorrect">for the model
  2. Model Comparisons
  3. Measure Prediction Accuracy
  4. Cross validation
style="margin-bottom: .571em">1. Choose Effective variables class="GRcorrect">for the model:

With this technique, we can choose appropriate variable as well as filter variables to take into the development of class="GRcorrect">predictive model. One of the common useful trick is to remove class="GRcorrect">outliers from class="GRcorrect">dataset to make a more accurate prediction.

style="margin-bottom: .571em"> class="GRcorrect">Outliers Detection and removal:

style="text-align: justify">We can check data ranges or distribution with the help of histogram function, set subsets of our datasets to better fit and reduce the RSS (Residual Sum of Squares) of the model. That will increase the prediction accuracy of the model by removing class="GRcorrect">outliers. One easy way to detect class="GRcorrect">outliers from our class="GRcorrect">dataset is to use histogram function. With class="GRcorrect">hist class="GRcorrect">(), we can check frequency class="GRcorrect">vs data values for a single variable. We have displayed it here for only one  variable. The output of class="GRcorrect">hist class="GRcorrect">() on the variable class="GRcorrect">xproductviews  is given below

href="http://www.tatvic.com/blog/wp-content/uploads/2012/09/hist_pageviews.png"> class="wp-image-3083 alignleft" src="http://www.tatvic.com/blog/wp-content/uploads/2012/09/hist_pageviews-300x147.png" alt="" width="300" height="147" />

style="text-align: justify">It represents that there are about 4000 numbers of observations having value of class="GRcorrect">xproductviews less than 8000. Here, we can choose observations having class="GRcorrect">xproductviews less than 5000 for class="GRcorrect">filteration. We can also check the distribution of data with summary function upon data variable. The class="GRcorrect">dataset is stored in “data” Object, the summary of which is given below.

> summary(data)
output
Nofinstancesofcartadd     Nofuniqueinstancesofcartadd    cartaddTotalRsValue
Min.   :  0.000                Min.   :  0.000             Min.   :     0
1st Qu.:  0.000                1st Qu.:  0.000             1st Qu.:     0
Median :  0.000                Median :  0.000             Median :     0
Mean   :  3.638                Mean   :  2.668             Mean   :  4207
3rd Qu.:  0.000                3rd Qu.:  0.000             3rd Qu.:     0
Max.   :833.000                Max.   :622.000             Max.   :752186
Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews
Min.   : 0.0000               Min.   : 0.0000             Min.   :    0.00
1st Qu.: 0.0000               1st Qu.: 0.0000             1st Qu.:   14.75
Median : 0.0000               Median : 0.0000             Median :   44.00
Mean   : 0.2553               Mean   : 0.1283             Mean   :  161.52
3rd Qu.: 0.0000               3rd Qu.: 0.0000             3rd Qu.:  130.00
Max.   :36.0000               Max.   :29.0000             Max.   :24306.00
cartremoveTotalvalueinRs  uniqueproductviews       productviewRsvalue      ItemrevenuenRs
Min.   :    0.0              Min.   :    0         Min.   :       0        Min.  :   0.0
1st Qu.:    0.0              1st Qu.:   11         1st Qu.:   11883        1st Qu:   0.0
Median :    0.0              Median :   35         Median :   40194        Median:   0.0
Mean   :  301.3              Mean   :  130         Mean   :  252390        Mean  :  64.8
3rd Qu.:    0.0              3rd Qu.:  104         3rd Qu.:  180365        3rd Qu:   0.0
Max.   :29994.0              Max.   :20498         Max.   :29930894        Max.  :80380.0
style="text-align: justify">Here, we can see that every explanatory variable has Min., 1 class="GRcorrect">st class="GRcorrect">Qu., Median, Mean, 3 class="GRcorrect">rd class="GRcorrect">Qu. class="GRcorrect">and Max. All sequential values should be near to each other but they are very far. One possible solution for this is to filter data with such conditions that would give more related data. With subset function, we can get class="GRcorrect">subset of our class="GRcorrect">dataset with certain conditions like  class="GRcorrect">xcartadd<200,  class="GRcorrect">xcartuniqadd<100,  class="GRcorrect">xcartaddtotalrs<2e+05, class="GRcorrect">xcartremove<5, class="GRcorrect">xcardtremovetotal<5, class="GRcorrect">xcardtremovetotalrs<5000, class="GRcorrect">xproductviews <5000,  class="GRcorrect">xuniqprodview<2500 and  class="GRcorrect">xuniqprodview<2500 by considering class="GRcorrect">histogram graph of  these variables. We have class="GRcorrect">choosed above conditions for formatting our class="GRcorrect">dataset variables such that they might have class="GRcorrect">large fraction of class="GRcorrect">original data and nearly similar values of Min., 1 class="GRcorrect">st  class="GRcorrect">Qu., Median, Mean, 3 class="GRcorrect">rd  class="GRcorrect">Qu. class="GRcorrect">and Max.  It will remove the class="GRcorrect">outliers from the class="GRcorrect">dataset and then store the class="GRcorrect">dataset to class="GRcorrect">newdata.

> newdata <- subset(data,xcartadd<200 & xcartuniqadd<100 & xcartaddtotalrs<2e+05 & xcartremove<5 & xcardtremovetotal<5 & xcardtremovetotalrs<5000 & xproductviews <5000 & xuniqprodview<2500 )

After removing class="GRcorrect">outliers from our datasets, summary of class="GRcorrect">newdata looks like

> summary(newdata)
output
Nofinstancesofcartadd        Nofuniqueinstancesofcartadd    cartaddTotalRsValue
Min.   : 0.0000                 Min.   : 0.0000             Min.   :    0.0
1st Qu.: 0.0000                 1st Qu.: 0.0000             1st Qu.:    0.0
Median : 0.0000                 Median : 0.0000             Median :    0.0
Mean   : 0.3275                 Mean   : 0.1857             Mean   :  295.4
3rd Qu.: 0.0000                 3rd Qu.: 0.0000             3rd Qu.:    0.0
Max.   :14.0000                 Max.   :10.0000             Max.   :48400.0
Nofinstancesofcartremoval NofUniqueinstancesofcartremoval   productviews
Min.   :0.0000                  Min.   :0.00000             Min.   : 0.00
1st Qu.:0.0000                  1st Qu.:0.00000             1st Qu.: 9.00
Median :0.0000                  Median :0.00000             Median :24.00
Mean   :0.0436                  Mean   :0.01666             Mean   :30.47
3rd Qu.:0.0000                  3rd Qu.:0.00000             3rd Qu.:47.00
Max.   :4.0000                  Max.   :2.00000             Max.   :99.00
cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue    ItemrevenuenRs
Min.   :   0.00           Min.   : 0.00     Min.   :     0        Min.   :  0.00
1st Qu.:   0.00           1st Qu.: 7.00     1st Qu.:  7077        1st Qu.:  0.00
Median :   0.00           Median :19.00     Median : 19383        Median :  0.00
Mean   :  24.22           Mean   :24.21     Mean   : 45150        Mean   : 33.42
3rd Qu.:   0.00           3rd Qu.:38.00     3rd Qu.: 47889        3rd Qu.:  0.00
Max.   :4190.00           Max.   :91.00     Max.   :942160        Max.   :989.44

Now, we will develop our second model model_out with the class="GRcorrect">newdata object.

model_out <- lm(formula=yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)

We have two models, one class="GRcorrect">(Model1) with class="GRcorrect">outlier values and other class="GRcorrect">(Model2) class="GRcorrect">is without class="GRcorrect">outlier values.

  1. Model 1 – model (Model with class="GRcorrect">outliers)
  2. Model 2 – model_out (Model without class="GRcorrect">outliers)

In model 2, after removing class="GRcorrect">outliers from explanatory variable we have updated variables names with postfix (_out). We can choose appropriate variables with two techniques like

  • Stepwise Regression
  • All Subsets Regression
style="margin-bottom: .571em">Stepwise Regression:

In stepwise Regression, variables are added to or deleted from   class="GRcorrect">model one at a time until  stopping  criterion  is  reached.  For example, in forward stepwise regression we add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables and then delete them one at a time until  class="GRcorrect">removing  variables  would  degrade  the  quality  of  the  model.  Model with lower AIC value will fit the data better, therefore its appropriate model. We have applied Stepwise Regression with backward direction on above class="GRcorrect">dataset. Here, we have applied stepwise regression with class="GRcorrect"> href="http://cran.r-project.org/web/packages/MASS/index.html">MASS package from R on model_out which is without class="GRcorrect">outliers.

> library(MASS)
> stepAIC(model_out,direction='backward')
output
Start:  AIC=27799.14
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
    xuniqprodview_out + xprodviewinrs_out
                      Df Sum of Sq      RSS   AIC
- xuniqprodview_out    1     25570 53512589 27799
                             53487020 27799
- xcartaddtotalrs_out  1     47194 53534214 27800
- xcartremove_out      1     48485 53535505 27800
- xproductviews_out    1    185256 53672276 27807
- xprodviewinrs_out    1    871098 54358118 27843
Step:  AIC=27798.49
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
    xprodviewinrs_out
                      Df Sum of Sq      RSS   AIC
                             53512589 27799
- xcartaddtotalrs_out  1     39230 53551819 27799
- xcartremove_out      1     50853 53563442 27799
- xprodviewinrs_out    1    940137 54452727 27846
- xproductviews_out    1   2039730 55552319 27902
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
    xproductviews_out + xprodviewinrs_out)
Coefficients:
        (Intercept)  xcartaddtotalrs_out      xcartremove_out    xproductviews_out
          8.8942468           -0.0023806           11.9088716            1.2072294
  xprodviewinrs_out
         -0.0002675

class="GRcorrect">where RSS – Residual sum of square= Σ class="GRcorrect">(Actual-predicted)2

This method suggests us to consider the four variables in the predictive model which are xcartaddtotalrs_out,  xcartremove_out, xprodviewinrs_out and xprodviews_out. This technique is controversial (by this href="http://en.wikipedia.org/wiki/Stepwise_regression#Criticism">criticism), there’s no guarantee that it will find the best model. So, we have another technique – All Subsets Regression to cross check this result.

style="margin-bottom: .571em">All Subsets Regression:

All  subsets  regression  is  implemented  using  the  class="GRcorrect">regsubsets class="GRcorrect">()  function  from  the  href="http://cran.r-project.org/web/packages/leaps/index.html">leaps package. This regression will suggest the best set of variables graphically. class="GRcorrect">Analyst can prefer this method for variable selection. It will suggest the set of variables having class="GRcorrect">p value less than 0.05. class="GRcorrect">p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables.

> library(leaps)
> leaps <- regsubsets(yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)
> plot(leaps,scale="adjr2")

And Result Graph is: /> href="http://www.tatvic.com/blog/wp-content/uploads/2012/09/final_leaps.png"> class="aligncenter wp-image-3039" src="http://www.tatvic.com/blog/wp-content/uploads/2012/09/final_leaps-300x178.png" alt="" width="300" height="178" />

From above graph, we can distinguish which variables to include and which not to. You can see, the first row of this graph having black strip on xcartaddtotalrs_out, xcartremove_out, xproductviews_out, xuniqprodview_out and xprodviewinrs_out to be considered in to model.

Now, we will update model_out variables with this output

model_out <- lm(formula=yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out, data = newdata)

2. Model Comparisons: /> We can compare models with AIC and class="GRcorrect">anova functions.

  • AIC
  • class="GRcorrect">anova

AIC: /> We can check class="GRcorrect">AIC value of both models (model1 and model2) with this function. And distinguish that smaller AIC value model is a better fit. Command for AIC is given below

> AIC(model,model_out)
output
         df      AIC
model     11 72204.46
model_out  7 58937.51

Here, class="GRcorrect">model is with class="GRcorrect">outliers data and model_out is without class="GRcorrect">outliers data. Here, we will choose model_out class="GRcorrect">having smaller AIC value as it is a better than class="GRcorrect">model for prediction.

class="GRcorrect">anova: /> We can choose better to fit class="GRcorrect">model class="GRcorrect">among nested models with this function. The probability value which is less than 0.05  or smaller is better model to fit the data values. We are having two models with class="GRcorrect">outliers and without class="GRcorrect">outliers which are not nested model, so it will not be applied in this case. This function is for comparing the two or three models, but for large numbers of model we can prefer stepwise selection or class="GRcorrect">subsets selection. 

style="text-align: justify">3. Measure Prediction Accuracy: /> For measuring the prediction accuracy of the model, we require model summary parameters to be checked. Like Residual standard error, Degrees of freedom, Multiple R squared and p-values. Model summary of model_out looks like below.

> summary(model_out)
output
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
    xproductviews_out + xuniqprodview_out + xprodviewinrs_out,
    data = newdata)
Residuals:
    Min      1Q  Median      3Q     Max
-2671.1  -173.6   -83.4   -42.9 14288.6
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)          3.992e+01  1.254e+01   3.183  0.00147 **
xcartaddtotalrs_out -7.888e-03  2.570e-03  -3.070  0.00216 **
xcartremove_out     -3.410e+01  2.431e+01  -1.403  0.16076
xproductviews_out    1.248e+01  1.222e+00  10.215  < 2e-16 ***
xuniqprodview_out   -1.350e+01  1.487e+00  -9.076  < 2e-16 ***
xprodviewinrs_out    3.705e-04  5.151e-05   7.193 7.62e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 656.4 on 3721 degrees of freedom
Multiple R-squared: 0.1398,	Adjusted R-squared: 0.1386
F-statistic: 120.9 on 5 and 3721 DF,  p-value: < 2.2e-16

We can check the model prediction accuracy based on summary parameters like Residual Standard error, p value and  R squared value. The theta (coefficients) values for all the explanatory variables of a linear model, which describe a positive or negative relationship between a response variable and explanatory variable. class="GRcorrect">e.g. Here we are predicting the product revenue so for class="GRcorrect">12.48 unit increase in  class="GRcorrect">transactional product revenue explained by 1 unit increase in  class="GRcorrect">product page view (if we check for xprodviewinrs_out , 0.0003705 unit increase in  class="GRcorrect">transactional product revenue explained by 1 unit increase in class="GRcorrect">productview in class="GRcorrect">rs). We can consider following points for choosing the model

  • RSS, Residual standard error and R squared error. The RSS should be as small as possible. Logically model with RSS value 0 will predict exact as actual value.
  • Variable with low (less than 0.5) p value describes significant to be exist in the model.
  • R squared error describes the correct prediction probability. From this we can choose the best model from class="GRcorrect">given two models, with lowest Residual standard error high R squared error.
  • The lower AIC valued model is a better fit than others. 

4. Cross validation: /> We can cross validate our regression model with several ways but we are doing this by two methods:

  1. Shrinkage method
  2. 80/20 datasets training/testing
style="margin-bottom: 0.571em;text-align: justify">Shrinkage method:

style="text-align: justify">With shrinkage method, we can cross check values of R squared values of training datasets and testing class="GRcorrect">datasets. It first folds class="GRcorrect">dataset class="GRcorrect">in k subsets and then picks k-1 class="GRcorrect">for training and rest of them for testing phase. Then class="GRcorrect">calaulate R-squared for training and testing. We can choose the model based on lower Multiple R squared difference of training and testing class="GRcorrect">dataset.

Below  is given snap of cross validation of two models

  • class="GRcorrect">model class="GRcorrect">(With class="GRcorrect">outliers)
  • model_out class="GRcorrect">(Without class="GRcorrect">outliers)
> shrinkage(model)
output
Original R-square = 0.7109059
10 Fold Cross-Validated R-square = 0.6222664
Change = 0.08863944
> shrinkage(model_out)
output
Original R-square = 0.1397824
10 Fold Cross-Validated R-square = 0.116201
Change = 0.02358148

Here we can see the change value for the model_out is lower than another model. Therefore we are considering model_out because of its small variance on prediction.

80/20 datasets training/testing: /> With this technique, we can choose 80% of our class="GRcorrect">dataset for training phase and 20% of our class="GRcorrect">dataset for testing phase. That means we can build our model on 80% of the class="GRcorrect">dataset and then class="GRcorrect">prediction is generated class="GRcorrect">on the input as 20% class="GRcorrect">dataset. The output is compared with class="GRcorrect">actual value class="GRcorrect">from 20% of historical class="GRcorrect">dataset. Therefore on the basis of   class="GRcorrect">ratio of correct predicted values class="GRcorrect">to the total observations class="GRcorrect">(from 20% of class="GRcorrect">dataset), we can measure the prediction accuracy of class="GRcorrect">different model.

In this blog, we have done  model development and evaluation in R. If you need to do it yourself in R, you can href="http://www.tatvic.com/blog/downloads/product_revenue_2.rar" onclick="_gaq.push(['_trackEvent','Downloads','Product Revenue-2','Blog',,1]);">download R code + sample class="GRcorrect">dataset. In next of my post class="GRcorrect">( href="http://www.tatvic.com/blog/product-revenue-prediction-with-r-part-3/" >Product revenue prediction with R – part 3), I will explain how to generate class="GRcorrect">prediction for class="GRcorrect">transactional product revenue with our model class="GRcorrect">by input data object and also compare it with class="GRcorrect">Google Prediction API model.

Want us to help you implement or analyze the data for your visitors.  href="http://www.tatvic.com/contact/?ref=blogpost">Contact us

class="wp-about-author-containter-top" style="background-color:#FFEAA8;"> class="wp-about-author-pic"> src="http://www.tatvic.com/blog/wp-content/uploads/userphoto/15.jpg" alt="Vignesh Prajapati" width="60" class="photo" />
class="wp-about-author-text">

href='http://www.tatvic.com/blog/author/vignesh/' title='Vignesh Prajapati'>Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.
Google Plus profile: href="https://plus.google.com/118111147659915240293/posts">Vignesh Prajapati

align="right" style="float: right; clear:left; padding: 0px 5px 0px 7px;"> name="fb_share" type="box_count" share_url="http://www.tatvic.com/blog/product-revenue-prediction-with-r-part-2/">

To leave a comment for the author, please follow the link and comment on his blog: Tatvic Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Top 3 Posts from the past 2 days

Top 9 articles of the week

  1. The Single Most Important Skill for a Data Scientist
  2. Scatterplots
  3. In-depth introduction to machine learning in 15 hours of expert videos
  4. Installing R packages
  5. Network analysis with igraph
  6. Using apply, sapply, lapply in R
  7. R vs Python: Survival Analysis with Plotly
  8. KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!
  9. An Attempt to Understand Boosting Algorithm(s)

Sponsors