**Tatvic Blog » R**, and kindly contributed to R-bloggers)

style="text-align: justify">After development of predictive model for
class="GRcorrect">transactional product revenue -(
href="http://www.tatvic.com/blog/product-revenue-prediction-with-r/" >Product revenue prediction with R – part 1), we can further improvise the model prediction by modifications in the model. In this post, we will see what are the steps required for model improvement. With the help of a set of model summary parameters, the data analyst can improve and evaluate the predictive model. Here, I have provided the information about how can we choose the best model or more fitted model for accurate prediction. We can do that by following ways using certain R functions.
style="margin-bottom: .571em"> With this technique, we can choose appropriate variable as well as filter variables to take into the development of
class="GRcorrect">predictive model. One of the common useful trick is to remove
class="GRcorrect">outliers from
class="GRcorrect">dataset to make a more accurate prediction.
style="margin-bottom: .571em">
style="text-align: justify">We can check data ranges or distribution with the help of histogram function, set subsets of our datasets to better fit and reduce the RSS (Residual Sum of Squares) of the model. That will increase the prediction accuracy of the model by removing
class="GRcorrect">outliers. One easy way to detect
class="GRcorrect">outliers from our
class="GRcorrect">dataset is to use histogram function. With
class="GRcorrect">hist
class="GRcorrect">(), we can check frequency
class="GRcorrect">vs data values for a single variable. We have displayed it here for only one variable. The output of
class="GRcorrect">hist
class="GRcorrect">() on the variable
class="GRcorrect">xproductviews is given below
style="text-align: justify">It represents that there are about 4000 numbers of observations having value of
class="GRcorrect">xproductviews less than 8000. Here, we can choose observations having
class="GRcorrect">xproductviews less than 5000 for
class="GRcorrect">filteration. We can also check the distribution of data with summary function upon data variable. The
class="GRcorrect">dataset is stored in “data” Object, the summary of which is given below.
style="text-align: justify">Here, we can see that every explanatory variable has Min., 1 After removing
class="GRcorrect">outliers from our datasets, summary of
class="GRcorrect">newdata looks like Now, we will develop our second model model_out with the
class="GRcorrect">newdata object. We have two models, one
class="GRcorrect">(Model1) with
class="GRcorrect">outlier values and other
class="GRcorrect">(Model2)
class="GRcorrect">is without
class="GRcorrect">outlier values. In model 2, after removing
class="GRcorrect">outliers from explanatory variable we have updated variables names with postfix (_out). We can choose appropriate variables with two techniques like
style="margin-bottom: .571em"> In stepwise Regression, variables are added to or deleted from
class="GRcorrect">model one at a time until stopping criterion is reached. For example, in forward stepwise regression we add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables and then delete them one at a time until
class="GRcorrect">removing variables would degrade the quality of the model.
class="GRcorrect">where RSS – Residual sum of square= Σ
class="GRcorrect">(Actual-predicted) This method suggests us to consider the four variables in the predictive model which are xcartaddtotalrs_out, xcartremove_out, xprodviewinrs_out and xprodviews_out. This technique is controversial (by this
href="http://en.wikipedia.org/wiki/Stepwise_regression#Criticism">criticism), there’s no guarantee that it will find the best model. So, we have another technique – All Subsets Regression to cross check this result.
style="margin-bottom: .571em"> All subsets regression is implemented using the
class="GRcorrect">regsubsets
class="GRcorrect">() function from the
href="http://cran.r-project.org/web/packages/leaps/index.html">leaps package. This regression will suggest the best set of variables graphically.
class="GRcorrect">Analyst can prefer this method for variable selection. It will suggest the set of variables having
class="GRcorrect">p value less than 0.05.
class="GRcorrect">p value denotes significance of the existence of variables into the model. With the following set of command we can get the subsets of variables. From above graph, we can distinguish which variables to include and which not to. You can see, the first row of this graph having black strip on xcartaddtotalrs_out, xcartremove_out, xproductviews_out, xuniqprodview_out and xprodviewinrs_out to be considered in to model. Now, we will update model_out variables with this output Here,
class="GRcorrect">model is with
class="GRcorrect">outliers data and model_out is without
class="GRcorrect">outliers data. Here, we will choose model_out
class="GRcorrect">having smaller AIC value as it is a better than
class="GRcorrect">model for prediction.
style="text-align: justify"> We can check the model prediction accuracy based on summary parameters like Residual Standard error, p value and R squared value. The theta (coefficients) values for all the explanatory variables of a linear model, which describe a positive or negative relationship between a response variable and explanatory variable.
class="GRcorrect">e.g. Here we are predicting the product revenue so for
class="GRcorrect">12.48 unit increase in
class="GRcorrect">transactional product revenue explained by 1 unit increase in
class="GRcorrect">product page view (if we check for xprodviewinrs_out , 0.0003705 unit increase in
class="GRcorrect">transactional product revenue explained by 1 unit increase in
class="GRcorrect">productview in
class="GRcorrect">rs). We can consider following points for choosing the model
style="margin-bottom: 0.571em;text-align: justify">
style="text-align: justify">With shrinkage method, we can cross check values of R squared values of training datasets and testing
class="GRcorrect">datasets. It first folds
class="GRcorrect">dataset
class="GRcorrect">in k subsets and then picks k-1
class="GRcorrect">for training and rest of them for testing phase. Then
class="GRcorrect">calaulate R-squared for training and testing. We can choose the model based on lower Multiple R squared difference of training and testing
class="GRcorrect">dataset.
Below is given snap of cross validation of two models Here we can see the change value for the model_out is lower than another model. Therefore we are considering model_out because of its small variance on prediction. In this blog, we have done model development and evaluation in R. If you need to do it yourself in R, you can
href="http://www.tatvic.com/blog/downloads/product_revenue_2.rar" onclick="_gaq.push(['_trackEvent','Downloads','Product Revenue-2','Blog',,1]);">download R code + sample
class="GRcorrect">dataset. In next of my post
class="GRcorrect">(
href="http://www.tatvic.com/blog/product-revenue-prediction-with-r-part-3/" >Product revenue prediction with R – part 3), I will explain how to generate
class="GRcorrect">prediction for
class="GRcorrect">transactional product revenue with our model
class="GRcorrect">by input data object and also compare it with
class="GRcorrect">Google Prediction API model. Want us to help you implement or analyze the data for your visitors.
href="http://www.tatvic.com/contact/?ref=blogpost">Contact us

**1. Choose Effective variables
class="GRcorrect">for the model:****
class="GRcorrect">Outliers Detection and removal:**> summary(data)

output
Nofinstancesofcartadd Nofuniqueinstancesofcartadd cartaddTotalRsValue
Min. : 0.000 Min. : 0.000 Min. : 0
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0
Median : 0.000 Median : 0.000 Median : 0
Mean : 3.638 Mean : 2.668 Mean : 4207
3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0
Max. :833.000 Max. :622.000 Max. :752186
Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews
Min. : 0.0000 Min. : 0.0000 Min. : 0.00
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 14.75
Median : 0.0000 Median : 0.0000 Median : 44.00
Mean : 0.2553 Mean : 0.1283 Mean : 161.52
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 130.00
Max. :36.0000 Max. :29.0000 Max. :24306.00
cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue ItemrevenuenRs
Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.: 0.0 1st Qu.: 11 1st Qu.: 11883 1st Qu: 0.0
Median : 0.0 Median : 35 Median : 40194 Median: 0.0
Mean : 301.3 Mean : 130 Mean : 252390 Mean : 64.8
3rd Qu.: 0.0 3rd Qu.: 104 3rd Qu.: 180365 3rd Qu: 0.0
Max. :29994.0 Max. :20498 Max. :29930894 Max. :80380.0

^{
class="GRcorrect">st}
class="GRcorrect">Qu., Median, Mean, 3^{
class="GRcorrect">rd}
class="GRcorrect">Qu.
class="GRcorrect">and Max. All sequential values should be near to each other but they are very far. One possible solution for this is to filter data with such conditions that would give more related data. With subset function, we can get
class="GRcorrect">subset of our
class="GRcorrect">dataset with certain conditions like
class="GRcorrect">xcartadd<200,
class="GRcorrect">xcartuniqadd<100,
class="GRcorrect">xcartaddtotalrs<2e+05,
class="GRcorrect">xcartremove<5,
class="GRcorrect">xcardtremovetotal<5,
class="GRcorrect">xcardtremovetotalrs<5000,
class="GRcorrect">xproductviews <5000,
class="GRcorrect">xuniqprodview<2500 and
class="GRcorrect">xuniqprodview<2500 by considering
class="GRcorrect">histogram graph of these variables. We have
class="GRcorrect">choosed above conditions for formatting our
class="GRcorrect">dataset variables such that they might have
class="GRcorrect">large fraction of
class="GRcorrect">original data and nearly similar values of Min., 1^{
class="GRcorrect">st}
class="GRcorrect">Qu., Median, Mean, 3^{
class="GRcorrect">rd}
class="GRcorrect">Qu.
class="GRcorrect">and Max. It will remove the
class="GRcorrect">outliers from the
class="GRcorrect">dataset and then store the
class="GRcorrect">dataset to
class="GRcorrect">newdata.
> newdata <- subset(data,xcartadd<200 & xcartuniqadd<100 & xcartaddtotalrs<2e+05 & xcartremove<5 & xcardtremovetotal<5 & xcardtremovetotalrs<5000 & xproductviews <5000 & xuniqprodview<2500 )

> summary(newdata)

output
Nofinstancesofcartadd Nofuniqueinstancesofcartadd cartaddTotalRsValue
Min. : 0.0000 Min. : 0.0000 Min. : 0.0
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0
Median : 0.0000 Median : 0.0000 Median : 0.0
Mean : 0.3275 Mean : 0.1857 Mean : 295.4
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0
Max. :14.0000 Max. :10.0000 Max. :48400.0
Nofinstancesofcartremoval NofUniqueinstancesofcartremoval productviews
Min. :0.0000 Min. :0.00000 Min. : 0.00
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 9.00
Median :0.0000 Median :0.00000 Median :24.00
Mean :0.0436 Mean :0.01666 Mean :30.47
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:47.00
Max. :4.0000 Max. :2.00000 Max. :99.00
cartremoveTotalvalueinRs uniqueproductviews productviewRsvalue ItemrevenuenRs
Min. : 0.00 Min. : 0.00 Min. : 0 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 7077 1st Qu.: 0.00
Median : 0.00 Median :19.00 Median : 19383 Median : 0.00
Mean : 24.22 Mean :24.21 Mean : 45150 Mean : 33.42
3rd Qu.: 0.00 3rd Qu.:38.00 3rd Qu.: 47889 3rd Qu.: 0.00
Max. :4190.00 Max. :91.00 Max. :942160 Max. :989.44

model_out <- lm(formula=yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)

**Stepwise Regression:**^{ }Model with lower AIC value will fit the data better, therefore its appropriate model. We have applied Stepwise Regression with backward direction on above
class="GRcorrect">dataset. Here, we have applied stepwise regression with
class="GRcorrect">
href="http://cran.r-project.org/web/packages/MASS/index.html">MASS package from R on model_out which is without
class="GRcorrect">outliers.> library(MASS)
> stepAIC(model_out,direction='backward')

output
Start: AIC=27799.14
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
xuniqprodview_out + xprodviewinrs_out
Df Sum of Sq RSS AIC
- xuniqprodview_out 1 25570 53512589 27799
53487020 27799
- xcartaddtotalrs_out 1 47194 53534214 27800
- xcartremove_out 1 48485 53535505 27800
- xproductviews_out 1 185256 53672276 27807
- xprodviewinrs_out 1 871098 54358118 27843
Step: AIC=27798.49
yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out +
xprodviewinrs_out
Df Sum of Sq RSS AIC
53512589 27799
- xcartaddtotalrs_out 1 39230 53551819 27799
- xcartremove_out 1 50853 53563442 27799
- xprodviewinrs_out 1 940137 54452727 27846
- xproductviews_out 1 2039730 55552319 27902
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
xproductviews_out + xprodviewinrs_out)
Coefficients:
(Intercept) xcartaddtotalrs_out xcartremove_out xproductviews_out
8.8942468 -0.0023806 11.9088716 1.2072294
xprodviewinrs_out
-0.0002675

^{2}**All Subsets Regression:**> library(leaps)
> leaps <- regsubsets(yitemrevenue_out ~ xcartadd_out + xcartuniqadd_out + xcartaddtotalrs_out + xcartremove_out + xcardtremovetotal_out + xcardtremovetotalrs_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out,data= newdata)
> plot(leaps,scale="adjr2")

model_out <- lm(formula=yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out + xproductviews_out + xuniqprodview_out + xprodviewinrs_out, data = newdata)

**2. Model Comparisons:**

/> We can compare models with AIC and
class="GRcorrect">anova functions.

**AIC:**

/> We can check
class="GRcorrect">AIC value of both models (model1 and model2) with this function. And distinguish that smaller AIC value model is a better fit. Command for AIC is given below> AIC(model,model_out)

output
df AIC
model 11 72204.46
model_out 7 58937.51

**
class="GRcorrect">anova:**

/> We can choose better to fit
class="GRcorrect">model
class="GRcorrect">among nested models with this function. The probability value which is less than 0.05 or smaller is better model to fit the data values. We are having two models with
class="GRcorrect">outliers and without
class="GRcorrect">outliers which are not nested model, so it will not be applied in this case. This function is for comparing the two or three models, but for large numbers of model we can prefer stepwise selection or
class="GRcorrect">subsets selection.** ****3. Measure Prediction Accuracy:**

/> For measuring the prediction accuracy of the model, we require model summary parameters to be checked. Like Residual standard error, Degrees of freedom, Multiple R squared and p-values. Model summary of model_out looks like below.> summary(model_out)

output
Call:
lm(formula = yitemrevenue_out ~ xcartaddtotalrs_out + xcartremove_out +
xproductviews_out + xuniqprodview_out + xprodviewinrs_out,
data = newdata)
Residuals:
Min 1Q Median 3Q Max
-2671.1 -173.6 -83.4 -42.9 14288.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.992e+01 1.254e+01 3.183 0.00147 **
xcartaddtotalrs_out -7.888e-03 2.570e-03 -3.070 0.00216 **
xcartremove_out -3.410e+01 2.431e+01 -1.403 0.16076
xproductviews_out 1.248e+01 1.222e+00 10.215 < 2e-16 ***
xuniqprodview_out -1.350e+01 1.487e+00 -9.076 < 2e-16 ***
xprodviewinrs_out 3.705e-04 5.151e-05 7.193 7.62e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 656.4 on 3721 degrees of freedom
Multiple R-squared: 0.1398, Adjusted R-squared: 0.1386
F-statistic: 120.9 on 5 and 3721 DF, p-value: < 2.2e-16

** ****4. Cross validation:**

/> We can cross validate our regression model with several ways but we are doing this by two methods:

**Shrinkage method:**

> shrinkage(model)

```
output
Original R-square = 0.7109059
10 Fold Cross-Validated R-square = 0.6222664
Change = 0.08863944
```

> shrinkage(model_out)

```
output
Original R-square = 0.1397824
10 Fold Cross-Validated R-square = 0.116201
Change = 0.02358148
```

**80/20 datasets training/testing:**

/> With this technique, we can choose 80% of our
class="GRcorrect">dataset for training phase and 20% of our
class="GRcorrect">dataset for testing phase. That means we can build our model on 80% of the
class="GRcorrect">dataset and then
class="GRcorrect">prediction is generated
class="GRcorrect">on the input as 20%
class="GRcorrect">dataset. The output is compared with
class="GRcorrect">actual value
class="GRcorrect">from 20% of historical
class="GRcorrect">dataset. Therefore on the basis of
class="GRcorrect">ratio of correct predicted values
class="GRcorrect">to the total observations
class="GRcorrect">(from 20% of
class="GRcorrect">dataset), we can measure the prediction accuracy of
class="GRcorrect">different model.