# Not all proportion data are binomial outcomes

March 24, 2013
By

(This article was first published on Are you cereal? » R, and kindly contributed to R-bloggers)

It really is trivial. Not every proportion is frequency. There are things that have values  bounded between 0 and 1 and yet they are neither probabilities, nor frequencies. Why do I even bother to write this? Because some kinds of proportions should be treated as unbounded continuous variables, and should be analyzed using appropriate statistical machinery (e.g. assuming normal error structure). This may not be entirely clear after reading the chapter in Michael Crawley’s The R Book (2007) that deals with proportions (Chapter 16: “Proportion data”) and that focuses exclusively on the proportions which are frequencies.

Proportion is frequency when we count numbers of binary outcomes of a bernoulli-distributed random process (e.g. coin toss). If one is a frequentist he can say that the proportion (or frequency) of heads in the total number of flips is equal to the bias of the coin, or he can directly link the frequency to the probability that the coin is equal. Coin tosses are a dull example, so here are other kinds of data in which proportions are frequencies and which follow the same distribution: percentage mortalities, infection rates of diseases, proportions of patients responding to treatments, sex ratios and so on (examples taken from Crawley, 2007).

These data should be modeled with the assumption of binomial error structure, for example by using logistic regression. Here is an example of such data (black dots; the data are artificial) and model (red line):

Proportion is not frequency when we use the proportion to standardize and relativize continuous data. For example, length of a male leg covers lower proportion of the total body height than length of a female leg. Or: Percentages of weight gains or losses after a medical treatment. Or: Proportional decrease of population of an endangered species resulting from proportional destruction of an area of a rain forest. And so on.

Interestingly, these proportions can sometimes have interpretable negative values (e.g. negative percentage weight loss is weight gain). Also, it is not as clear as in the previous case what error structure should we assume. I would guess that in most cases it would be the distribution of the original, non-proportional and “non-standardized” variable.

Here is an example of proportional weight loss of patients (black dots; the data are artificial) after a drug treatment. In this case normal linear regression model is fitted:

As I’ve said, it is quite trivial. However, do let me know if I am trivially mistaken here.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...