Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You could try linear regression, but specialized tools often do much better. For rate problems involving estimating probabilities and frequencies we recommend logistic regression. For non-frequency (and non-categorical) rate problems (such as forecasting yield or purity) we suggest beta regression.

In this note we will work a toy problem and suggest some relevant R analysis libraries.There are three primary settings for rate estimation problems:

1. Each data row represents a single observation and y is a 0/1 variable, categorical or logical. In this case we come to rates by asking for a probability forecast of how likely a given row has y=1.
2. Each data row represents many observations and y is a frequency between 0 and 1. In this case we need an additional weight to represent how many observations each row represents (so we can pass this detail on to various fitters). This representation is just a different encoding of the first setting.
3. Each data row represents a single observation and y is an observed quantity between 0 and 1. We think of y in this case as being a non-frequency rate (since each row represents a single event) and examples include y’s that measure purity, fullness, or yield.

To get away from online advertising for a moment consider the following (artificial) problem: predicting y as a function of x1 and x2 for the following data.

``` d <- data.frame( y= c( 1, 1, 1, 0, 0, 0), x1=c(-1000000,20000,-1, 2,10000,10000), x2=c( 1000000,30000, 0, 1, 1000, 1000)) ```

This problem is just an encoding of the law `x1 < x2`, but because the relative magnitudes of `x1` and `x2` are varying so much a linear regression can not pick up the relation:

``` lm <- lm(y ~ x1 + x2, data = d) predict(lm) ## 1 2 3 4 5 6 ## 1.0039682 0.8441012 0.2216910 0.2217397 0.3542499 0.3542499 ```

Notice how the predictions don’t have a cut-point separating “y” (items 1,2,3) from “n” (items 4,5,6). This is despite the fact that the linear form `x2-x1` is a perfect decision surface. The issue is linear regression is looking for a scoring function (not a decision surface) and is punished if it predicts out of the (0,1) range. If the answer is “1″ and the linear model predicts “5″ this counts as a lot of error. So a lot of the power of the linear model is wasted trying to push values back into the range 0 to 1.

There are many good modeling tools that are specialized to correctly predict categories and probabilities. To name a few: decision trees, random forests, k nearest neighbor, and support vector machines. However one of our favorites is generalized linear models and in particular logistic regression.

We demonstrate logistic regression as follows:

``` lr <- glm(y ~ x1 + x2, data = d, family=binomial(link='logit')) predict(lr,type='response') ## 1 2 3 4 5 6 ## 1.000000e+00 1.000000e+00 9.998699e-01 1.301089e-04 2.220446e-16 2.220446e-16 ```

Notice how the predictions now have a cut-point separating “y=0″ from “y=1″. The logistic regression (the generalized linear model with the logistic link) comes with a built in range of 0 through 1, so it doesn’t need to spend modeling resources (error budget and degrees of freedom) staying in that range (allowing the learned scoring function to in fact ofter work well as a separator or decision procedure).

Logistic regression should always be considered when trying to estimate probabilities of frequencies. It is an efficient method that tends to work well and has useful probabilistic derivations and interpretations. For estimating rates that don’t arise from category probabilities or frequencies you can still try logistic regression (and many other common generalized linear models), but we suggest also trying a method called beta regression. Beta regression allows the user to specify:

1. One formula and link for the rate or mean estimate.
2. One formula and link for a per-example error model.

When you specify the structure of the error model (using a `|` in your model formula) beta regression then fits both a mean estimate and an error model simultaneously. This is pretty powerful, but it does require you to specify what factors are important to the error model.

Consider the following (artificial) non-frequency problem: estimating the yield rate of a gasoline production facility. We take this example from the R package betareg. The code below shows beta regression giving an excellent model where each row of the data is a single measurement whose outcome is a rate (see plot below).

``` install.packages('betareg') library(betareg) data("GasolineYield", package = "betareg") set.seed(52352) GasolineYield\$rgroup <- sample(1:100, size=dim(GasolineYield)[],replace=T) GTrain <- subset(GasolineYield, GasolineYield\$rgroup50) gy <- betareg(yield ~ gravity + pressure + temp | gravity + pressure + temp, data = GTrain) print(summary(gy)) GTest\$model <- predict(gy,newdata=GTest) library(ggplot2) ggplot(data=GTest,aes(x=model,y=yield)) + geom_point() + geom_abline(slope=1) ``` This is a great fit (pseudo R-squared of 0.93). Though we really didn’t see any improvement over what lm() would have delivered (which itself has an R-squared of 0.97). Also notice we get properly bounded predictions, even without specifying a non-identity link. But we do have access to the argument sensitive error model which lets the following code produce our next figure: predictions with uncertainty ranges.

``` GTest\$modelErr <- sqrt(predict(gy,newdata=GTest, type='variance')) ggplot(data=GTest,aes(x=model,y=yield)) + geom_point() + geom_errorbarh(aes(xmin=model-modelErr,xmax=model+modelErr)) + geom_abline(slope=1) ``` This is pretty useful in practice.

So far we have been using beta regression on data where each row is a single measurement carrying a rate (not a frequency or fraction of success, but per-row measurement like how pure a product is). We could also attempt to use beta regression for data where each row is a single example that is a success or failure and we are trying to estimate rates (as we did using glm()). The first issue is to even attempt this we must first make sure our y’s are in the open interval (0,1). One such way to do this is as follows:

``` d\$yCollared <- pmin(pmax(1/dim(d)[],d\$y),1-1/dim(d)[]) bm <- betareg(yCollared ~ x1 + x2, data = d, link='logit') predict(bm,newdata=d,type='response') ## 1 2 3 4 5 6 ## 0.7497785 0.6791359 0.3401818 0.3402066 0.4105687 0.4105687 ```

And we seem to get a poor result very similar to linear regression (not separating the training examples, but at least all in the range (0,1)). Frankly the package seems to not be very sensitive to my attempts to set link and/or link.phi for this example. That is a hint we are not using the method as intended. The observation is: you want to try beta regression when you are estimating non-frequency rates and not when estimating probabilities or frequencies.

For a more on regression and logistic regression please see Zumel, Mount, “Practical Data Science with R” Chapter 7 Using linear and logistic regression.