# Product revenue prediction with R – part 2

October 8, 2012
By

(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

After development of predictive model for transactional product revenue -(Product revenue prediction with R – part 1), we can further improvise the model prediction by modifications in the model. In this post, we will see what are the steps required for model improvement. With the help of a set of model summary parameters, the data analyst can improve and evaluate the predictive model. Here, I have provided the information about how can we choose the best model or more fitted model for accurate prediction. We can do that by following ways using certain R functions.

1. Choose Effective variables for the model
2. Model Comparisons
3. Measure Prediction Accuracy
4. Cross validation

1. Choose Effective variables for the model:

With this technique, we can choose appropriate variable as well as filter variables to take into the development of predictive model. One of the common useful trick is to remove outliers from dataset to make a more accurate prediction.

Outliers Detection and removal:

We can check data ranges or distribution with the help of histogram function, set subsets of our datasets to better fit and reduce the RSS (Residual Sum of Squares) of the model. That will increase the prediction accuracy of the model by removing outliers. One easy way to detect outliers from our dataset is to use histogram function. With hist(), we can check frequency vs data values for a single variable. We have displayed it here for only one  variable. The output of hist() on the variable xproductviews  is given below

It represents that there are about 4000 numbers of observations having value of xproductviews less than 8000. Here, we can choose observations having xproductviews less than 5000 for filteration. We can also check the distribution of data with summary function upon data variable. The dataset is stored in “data” Object, the summary of which is given below.

> summary(data)
output
Min.   :  0.000                Min.   :  0.000             Min.   :     0
1st Qu.:  0.000                1st Qu.:  0.000             1st Qu.:     0
Median :  0.000                Median :  0.000             Median :     0
Mean   :  3.638                Mean   :  2.668             Mean   :  4207
3rd Qu.:  0.000                3rd Qu.:  0.000             3rd Qu.:     0
Max.   :833.000                Max.   :622.000             Max.   :752186
Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews
Min.   : 0.0000               Min.   : 0.0000             Min.   :    0.00
1st Qu.: 0.0000               1st Qu.: 0.0000             1st Qu.:   14.75
Median : 0.0000               Median : 0.0000             Median :   44.00
Mean   : 0.2553               Mean   : 0.1283             Mean   :  161.52
3rd Qu.: 0.0000               3rd Qu.: 0.0000             3rd Qu.:  130.00
Max.   :36.0000               Max.   :29.0000             Max.   :24306.00
cartremoveTotalvalueinRs  uniqueproductviews       productviewRsvalue      ItemrevenuenRs
Min.   :    0.0              Min.   :    0         Min.   :       0        Min.  :   0.0
1st Qu.:    0.0              1st Qu.:   11         1st Qu.:   11883        1st Qu:   0.0
Median :    0.0              Median :   35         Median :   40194        Median:   0.0
Mean   :  301.3              Mean   :  130         Mean   :  252390        Mean  :  64.8
3rd Qu.:    0.0              3rd Qu.:  104         3rd Qu.:  180365        3rd Qu:   0.0
Max.   :29994.0              Max.   :20498         Max.   :29930894        Max.  :80380.0

Here, we can see that every explanatory variable has Min., 1st Qu., Median, Mean, 3rd Qu. and Max. All sequential values should be near to each other but they are very far. One possible solution for this is to filter data with such conditions that would give more related data. With subset function, we can get subset of our dataset with certain conditions like xcartadd<200, xcartuniqadd<100, xcartaddtotalrs<2e+05, xcartremove<5, xcardtremovetotal<5, xcardtremovetotalrs<5000, xproductviews <5000, xuniqprodview<2500 and  xuniqprodview<2500 by considering histogram graph of  these variables. We have choosed above conditions for formatting our dataset variables such that they might have large fraction of original data and nearly similar values of Min., 1st Qu., Median, Mean, 3rd Qu. and Max.  It will remove the outliers from the dataset and then store the dataset to newdata.

> newdata <- subset(data,xcartadd<200 & xcartuniqadd<100 & xcartaddtotalrs<2e+05 & xcartremove<5 & xcardtremovetotal<5 & xcardtremovetotalrs<5000 & xproductviews <5000 & xuniqprodview<2500 )

After removing outliers from our datasets, summary of newdata looks like

> summary(newdata)
output
Min.   : 0.0000                 Min.   : 0.0000             Min.   :    0.0
1st Qu.: 0.0000                 1st Qu.: 0.0000             1st Qu.:    0.0
Median : 0.0000                 Median : 0.0000             Median :    0.0
Mean   : 0.3275                 Mean   : 0.1857             Mean   :  295.4
3rd Qu.: 0.0000                 3rd Qu.: 0.0000             3rd Qu.:    0.0
Max.   :14.0000                 Max.   :10.0000             Max.   :48400.0
Nofinstancesofcartremoval NofUniqueinstancesofcartremoval   productviews
Min.   :0.0000                  Min.   :0.00000             Min.   : 0.00
1st Qu.:0.0000                  1st Qu.:0.00000             1st Qu.: 9.00
Median :0.0000                  Median :0.00000             Median :24.00
Mean   :0.0436                  Mean   :0.01666             Mean   :30.47
3rd Qu.:0.0000                  3rd Qu.:0.00000             3rd Qu.:47.00
Max.   :4.0000                  Max.   :2.00000             Max.   :99.00
cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue    ItemrevenuenRs
Min.   :   0.00           Min.   : 0.00     Min.   :     0        Min.   :  0.00
1st Qu.:   0.00           1st Qu.: 7.00     1st Qu.:  7077        1st Qu.:  0.00
Median :   0.00           Median :19.00     Median : 19383        Median :  0.00
Mean   :  24.22           Mean   :24.21     Mean   : 45150        Mean   : 33.42
3rd Qu.:   0.00           3rd Qu.:38.00     3rd Qu.: 47889        3rd Qu.:  0.00
Max.   :4190.00           Max.   :91.00     Max.   :942160        Max.   :989.44

Now, we will develop our second model model_out with the newdata object.

model_out <- lm(formula=yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)

We have two models, one(Model1) with outlier values and other(Model2) is without outlier values.

1. Model 1 – model (Model with outliers)
2. Model 2 – model_out (Model without outliers)

In model 2, after removing outliers from explanatory variable we have updated variables names with postfix (_out). We can choose appropriate variables with two techniques like

• Stepwise Regression
• All Subsets Regression

Stepwise Regression:

In stepwise Regression, variables are added to or deleted from  model one at a time until  stopping  criterion  is  reached.  For example, in forward stepwise regression we add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables and then delete them one at a time until  removing  variables  would  degrade  the  quality  of  the  model.  Model with lower AIC value will fit the data better, therefore its appropriate model. We have applied Stepwise Regression with backward direction on above dataset. Here, we have applied stepwise regression with MASS package from R on model_out which is without outliers.

> library(MASS)
> stepAIC(model_out,direction='backward')
output
Start:  AIC=27799.14
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
xuniqprodview_out + xprodviewinrs_out
Df Sum of Sq      RSS   AIC
- xuniqprodview_out    1     25570 53512589 27799
53487020 27799
- xcartaddtotalrs_out  1     47194 53534214 27800
- xcartremove_out      1     48485 53535505 27800
- xproductviews_out    1    185256 53672276 27807
- xprodviewinrs_out    1    871098 54358118 27843
Step:  AIC=27798.49
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
xprodviewinrs_out
Df Sum of Sq      RSS   AIC
53512589 27799
- xcartaddtotalrs_out  1     39230 53551819 27799
- xcartremove_out      1     50853 53563442 27799
- xprodviewinrs_out    1    940137 54452727 27846
- xproductviews_out    1   2039730 55552319 27902
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
xproductviews_out + xprodviewinrs_out)
Coefficients:
8.8942468           -0.0023806           11.9088716            1.2072294
xprodviewinrs_out
-0.0002675

where RSS – Residual sum of square= Σ(Actual-predicted)2

This method suggests us to consider the four variables in the predictive model which are xcartaddtotalrs_out,  xcartremove_out, xprodviewinrs_out and xprodviews_out. This technique is controversial (by this criticism), there’s no guarantee that it will find the best model. So, we have another technique – All Subsets Regression to cross check this result.

All Subsets Regression:

All  subsets  regression  is  implemented  using  the  regsubsets()  function  from  the leaps package. This regression will suggest the best set of variables graphically. Analyst can prefer this method for variable selection. It will suggest the set of variables having p value less than 0.05. p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables.

> library(leaps)
> leaps <- regsubsets(yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)
> plot(leaps,scale="adjr2")

And Result Graph is:

From above graph, we can distinguish which variables to include and which not to. You can see, the first row of this graph having black strip on xcartaddtotalrs_out, xcartremove_out, xproductviews_out, xuniqprodview_out and xprodviewinrs_out to be considered in to model.

Now, we will update model_out variables with this output

model_out <- lm(formula=yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out, data = newdata)

2. Model Comparisons:
We can compare models with AIC and anova functions.

• AIC
• anova

AIC:
We can check AIC value of both models (model1 and model2) with this function. And distinguish that smaller AIC value model is a better fit. Command for AIC is given below

> AIC(model,model_out)
output
df      AIC
model     11 72204.46
model_out  7 58937.51

Here, model is with outliers data and model_out is without outliers data. Here, we will choose model_out having smaller AIC value as it is a better than model for prediction.

anova:
We can choose better to fit model among nested models with this function. The probability value which is less than 0.05  or smaller is better model to fit the data values. We are having two models with outliers and without outliers which are not nested model, so it will not be applied in this case. This function is for comparing the two or three models, but for large numbers of model we can prefer stepwise selection or subsets selection.

3. Measure Prediction Accuracy:
For measuring the prediction accuracy of the model, we require model summary parameters to be checked. Like Residual standard error, Degrees of freedom, Multiple R squared and p-values. Model summary of model_out looks like below.

> summary(model_out)
output
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
xproductviews_out + xuniqprodview_out + xprodviewinrs_out,
data = newdata)
Residuals:
Min      1Q  Median      3Q     Max
-2671.1  -173.6   -83.4   -42.9 14288.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)          3.992e+01  1.254e+01   3.183  0.00147 **
xcartaddtotalrs_out -7.888e-03  2.570e-03  -3.070  0.00216 **
xcartremove_out     -3.410e+01  2.431e+01  -1.403  0.16076
xproductviews_out    1.248e+01  1.222e+00  10.215  < 2e-16 ***
xuniqprodview_out   -1.350e+01  1.487e+00  -9.076  < 2e-16 ***
xprodviewinrs_out    3.705e-04  5.151e-05   7.193 7.62e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 656.4 on 3721 degrees of freedom
Multiple R-squared: 0.1398,	Adjusted R-squared: 0.1386
F-statistic: 120.9 on 5 and 3721 DF,  p-value: < 2.2e-16

We can check the model prediction accuracy based on summary parameters like Residual Standard error, p value and  R squared value. The theta (coefficients) values for all the explanatory variables of a linear model, which describe a positive or negative relationship between a response variable and explanatory variable. e.g. Here we are predicting the product revenue so for 12.48 unit increase in transactional product revenue explained by 1 unit increase in product page view (if we check for xprodviewinrs_out , 0.0003705 unit increase in transactional product revenue explained by 1 unit increase in productview in rs). We can consider following points for choosing the model

• RSS, Residual standard error and R squared error. The RSS should be as small as possible. Logically model with RSS value 0 will predict exact as actual value.
• Variable with low (less than 0.5) p value describes significant to be exist in the model.
• R squared error describes the correct prediction probability. From this we can choose the best model from given two models, with lowest Residual standard error high R squared error.
• The lower AIC valued model is a better fit than others.

4. Cross validation:
We can cross validate our regression model with several ways but we are doing this by two methods:

1. Shrinkage method
2. 80/20 datasets training/testing

Shrinkage method:

With shrinkage method, we can cross check values of R squared values of training datasets and testing datasets. It first folds dataset in k subsets and then picks k-1 for training and rest of them for testing phase. Then calaulate R-squared for training and testing. We can choose the model based on lower Multiple R squared difference of training and testing dataset.

Below  is given snap of cross validation of two models

• model(With outliers)
• model_out(Without outliers)
> shrinkage(model)
output
Original R-square = 0.7109059
10 Fold Cross-Validated R-square = 0.6222664
Change = 0.08863944
> shrinkage(model_out)
output
Original R-square = 0.1397824
10 Fold Cross-Validated R-square = 0.116201
Change = 0.02358148

Here we can see the change value for the model_out is lower than another model. Therefore we are considering model_out because of its small variance on prediction.

80/20 datasets training/testing:
With this technique, we can choose 80% of our dataset for training phase and 20% of our dataset for testing phase. That means we can build our model on 80% of the dataset and then prediction is generated on the input as 20% dataset. The output is compared with actual value from 20% of historical dataset. Therefore on the basis of  ratio of correct predicted values to the total observations(from 20% of dataset), we can measure the prediction accuracy of different model.

In this blog, we have done  model development and evaluation in R. If you need to do it yourself in R, you can download R code + sample dataset. In next of my post(Product revenue prediction with R – part 3), I will explain how to generate prediction for transactional product revenue with our model by input data object and also compare it with Google Prediction API model.

### Vignesh Prajapati

Vignesh is Data Engineer at Tatvic. He loves to play with opensource playground to make predictive solution on Big data with R, Hadoop and Google Prediction API.