January 19, 2014
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that get totals and averages correct. Absolute deviation measures do not prefer such models. So while MAD may be great for reporting, it can be a problem when used to optimize models.

Let’s suppose we have 2 boxes of 10 lottery tickets: all tickets were purchased for $1 each for the same game in an identical fashion at the same time. For our highfalutin data science project let’s look at the payoffs on the tickets in the first box and try to build a best predictive model for the tickets in the second box (without looking at the payoffs in the second box). Now since all tickets are identical if we are making a mere point-prediction (a single number value estimate for each ticket instead of a detailed posterior distribution) then there is an optimal prediction that is a single number V. Let’s explore potential values for V and how they differ if we use different measures of variation (square error, mean absolute variation and median absolute variation). To get the ball rolling let’s further suppose the payoffs of the tickets in the first box are nine zeros and one$5 payoff. We are going to use a general measure of model goodness called a “loss function” or “loss” and ignore any issues of parametric modeling, incorporating prior knowledge or distributional summaries.

Suppose we use mean absolute deviation as our measure of model quality. Then the loss (or badness) of a value V is loss(V) = 9*|V-0| + 1*|V-5| which is minimized V=$0. That is it says the best model under mean absolute error is that all the lottery tickets are worthless. I personally feel that way about lotteries, but the mean absolute deviation is missing a lot of what is going on. In fact if we have nine tickets with zero payoff and a single ticket with a non-zero payoff the mean absolute deviation is minimized for V=0 for any positive payoff on the last ticket. The mean absolute deviation says the best model for a lottery ticket given 9 non-payoffs and one$1,000,000 payoff is that tickets are worth $0. Meaning that we may not want to always think in terms of the mean absolute deviation summary. Here is some R-code demonstrating what models (values of V) mean absolute deviation prefers (for our original problem):  library(ggplot2) d <- data.frame(V=seq(-5,10,by=0.1)) f <- function(V) { 9*abs(V-0) + abs(V-5)} d$loss <- f(d$V) ggplot(data=d,aes(x=V,y=loss)) + geom_line()  Notice while there is a slope-change at V=$5, but the minimum is at $V=0. Suppose instead we use median absolute deviation as our measure of model quality (another possible expansion of the MAD acronym). Things are pretty much as bad: V=$0 is the “optimal model” for 10 tickets 9 of which payoff zero no matter what the payoff of the last ticket is.

Finally suppose instead of trendy MAD measures we use plain old square error like poor old Karl Pearson used in the 19th century. Then for our original example we have: loss(V) = 9*(V-0)^2 + 1*(V-5)^2 which is minimized at V=$0.5. Which says these lottery tickets seem to be worth about$0.5 each while they cost $1 each (typical of lotteries). Also notice we have 10*V equals$5 the actual total value of all of the tickets in the first box of lottery tickets. This is a key advantage of RMSE: it gets group totals and averages right even when it doesn’t know how to value individual tickets. You want this property.

How can we design loss functions that get totals correct? What we want is a loss function that when we optimize to minimize loss we end up recovering totals in our original data. That is the loss function, whatever it is, should have a stationary point when we try to use it to recover a total. So in our original example we should have: d(10*loss(V))/dV = 0 when V=$0.5 (the total we are trying to recover). Notice any loss-function of the form loss(V) = f(9*(V-0)^2 + (V-5)^2) has a stationary point at V=$0.5 (just an application of the chain-rule for derivatives). This is why square error, root mean square error and the standard deviation all pick the same optimal V=\$0.5. This is the core point of regression and logistic regression which both emphasize getting totals correct. This is the other reason you report RMSE: it is what regression optimizers are minimizing (so it is a good diagnostic).

The point is: there are a lot of different useful measures of error and fit. Taleb is correct: the measure you use should not depend on mere habit or ritual. But the measure you use should depend on your intended application (in this case preferring models that get expected values and totals correct) and not merely on your taste and sophistication. We also like non-variance based methods (like quantile regression, see this example) but find for many problems you really have to pick you measure correctly. RMSE itself is often mis-used: it is not the right measure for scoring classification and ranking models (you want to prefer something like precision/recall or deviance).