Predictors, responses and residuals: What really needs to be normally distributed?

February 18, 2013
By

(This article was first published on Are you cereal? » R, and kindly contributed to R-bloggers)

Introduction

Many scientists are concerned about normality or non-normality of variables in statistical analyses. The following and similar sentiments are often expressed, published or taught:

• If you want to do statistics, then everything needs to be normally distributed.”
• We normalized our data in order to meet the assumption of normality.”
• We log-transformed our data as they had strongly skewed distribution.”
• After we fitted the model we checked homoscedasticity of the residuals.”
• We used non-parametric test as our data did not meet the assumption of normality.”

And so on. I know, it is more complex than that, but still, it seems like normal distribution is what people want to see everywhere, and that having things normally distributed opens the door to the clean and powerful statistics and strong results. Many people that I know regularly check if their data are normally distributed prior to the analysis, and then they either try to “normalize” them by e.g. log-transformation, or they adjust the statistical technique accordingly based on the frequency distribution of their data. Here I will examine this more closely and I will show that there may be less assumptions of normality than one might think.

Example of some widely used models

The most general way to find out what needs to be the distribution of your data and residuals is to look at the complete and formal definition of the model that you are fitting – I mean both its deterministic and its stochastic components. Take the example of Normal linear regression of response $y_i$ and predictor $x_i$ (function lm(y~x) or glm(y~x, family="Gaussian") in R):

$y_i\sim Normal(\hat{\mu}_i,\sigma^2)$
$\hat{\mu_i}=\alpha+\beta \times x_{i}$
$i\in1:N$

which can be rewritten to

$y_i=\alpha+\beta \times x_{i} + \epsilon_i$
$\epsilon_i\sim Normal(0,\sigma^2)$
$i\in1:N$

where $\sigma^2$ is variance, $\alpha$ and $\beta$ are intercept and slope respectively, and $N$ is number of observations. Strictly speaking, $\alpha$ and $\beta$ also have their own distributions (e.g. normal), but let’s not go there for now. If you write such definition of the model, you have it all at one place, all assumptions are written here. And you can see clearly that the only assumption of “normality” in this model is that the residuals (or “errors” $\epsilon_i$ ) should be normally distributed. There is no assumption about the distribution of the predictor $x_i$ or the response variable $y_i$ . So there is no justification for “normalizing” any of the variables prior to fitting the model. Your predictors and/or response variables can have distributions skewed like hell and yet all is fine as long as the residuals are normal. The same logic is valid for ANOVA – a skewed and apparently non-normal distribution of the response variable does not necessarily imply that one has to use non-parametric Kruskall-Wallis test. It is only the distribution of ANOVA residuals which needs to meet the assumption of normality.

Now take an example of another widely used model – the log-linear Poisson regression model of count observations $y_i$ and predictors $x_i$ (function glm(y~x, family="Poisson") in R):

$y_i\sim Poisson(\lambda_i)$
$log\lambda_i=\alpha+\beta \times x_{i}$
$i\in1:N$

where $\alpha$ and $\beta$ are model parameters, and $N$ is number of observations. This is the complete description of the model. All of its assumptions are laid here. There is no other hidden assumption somewhere in the literature. And you can see clearly: nothing has to be normally distributed here. Not the predictors, nor the response, nor the residuals.

Finally, let’s examine another hugely popular statistical model, the logistic regression of binary response (or proportion) $y_i$ against predictor $x_i$ (function glm(y~x, family="Binomial") in R):

$y_i\sim Bernoulli(p_i)$
$logit (p_i)=\alpha+\beta \times x_{i}$
$i\in1:N$

Again, the complete model is here, with all its assumptions, and there is not a single Normally distributed thing.

This is hardly surprising to anyone already familiar with the inner structure of Generalized Linear Models (GLM) – a category that encompasses all of the above mentioned models. It is why GLMs were invented at the first place, i.e. in order to cope with non-normal residual (“error”) structures. It is well known that Poisson and logistic regressions have heteroscedastic residuals (Kutner et al. 2005). Moreover, it is also known that exactly for this reason it is problematic to use the residuals for model diagnostics or to judge if the model is appropriate (Kutner et al. 2005). It has been advised to transform the residuals to something called Pearson residuals (in order to normalize them?), but I still do not understand why one would want to do that – as I mentioned, there is no assumption of normality anywhere in Poisson or logistic regressions. Yet still, there is the R function plot.glm() which one can apply to any fitted glm object and it will pop out the residual diagnostic plots. What are these good for in the case of logistic and Poisson regressions?

Why do people still normalize data?

Another puzzling issue is why do people still tend to “normalize” their variables (both predictors and responses) prior to model fitting. Why this practice emerged and prevailed even when there is no assumption that would call for it? I have several theories for this: ignorance, tendency to follow statistical cookbooks, error propagation etc.
Two of the explanations seem more plausible: First, people normalize the data in order to linearize relationships. For example, by log-transforming the predictor one can fit an exponential function using the ordinary least squares machinery. This may seem sort of ok, but then why not specify the non-linear relationship directly in the model (through an appropriate link function for instance)? Also, the practice of log-transforming of the response can introduce serious artifacts, e.g. in the case of count data with zero counts (O’Hara & Kotze 2010).
A second plausible reason for the “normalizing” practice was suggested by my colleague Katherine Mertes-Schwartz: Maybe it is because researchers try to resolve the problem that their data were collected in a highly clumped and uneven way. In other words, it is very common that one works with data that have high number of observations aggregated at particular part of a gradient, while another part of the gradient is relatively under-represented. This results in skewed distributions. Transforming of such distributions then results in seemingly regular spread of observations along the gradient and elimination of outliers. This can be actually done with a good intention in mind. However, it is also fundamentally incorrect.

Summary

I would like to finish this post by a couple of summarizing advices to anyone who is troubled by skewed frequency distributions of the data, by heteroscedastic residuals or by the violations of the assumption of normality.

• If you are unsure about the assumptions of the predefined model that you are just about to use, try to write down its complete formal definition (especially its stochastic part). Try to understand what the model actually really does, dissect it and have a look at its components.
• Learn your probability distributions. Bolker (2008) offers a brilliant bestiary of the most commonly used distributions from which you can learn what are they good for and in which cases do they emerge. My opinion is that such knowledge is an alternative and reliable way to judge if the model is appropriate.
• Learn to specify your models in a Bayesian way, e.g. in BUGS language (used by OpenBUGS or JAGS). This will give you very clear insight into how the models work and what are their assumptions.
• Think twice before transforming your data in order to normalize their distributions (e.g. see O’Hara & Kotze 2010). The only semi-valid reason for transforming you data is to linearize the relationships. However, make sure that you really need to do that as it is usually possible to specify the non-linear relationship directly within the model (e.g. through the log link function in the Poisson regression).

I am aware that my understanding may be limited. Hence, if you have any comments, opinions or corrections, do not hesitate to express them here. Thank you!

References

Bolker, B. (2008) Ecological models and data in R. Princeton University Press.
Kutner, M.H. et al. (2005) Applied linear statistical models. Fifth edition. McGraw & Hill.
O’Hara & Kotze (2010) Do not log-transform count data. Methods in Ecology and Evolution, 1, 118-122.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...