R’s Garden of Probability Distributions

March 21, 2013

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

you type ?Distributions at the R console you get a list of the 21 probability
distributions included in the stats package that ships with base R. The same
list appears in the Introduction to R Manual on CRAN and in most of the many fine introductory books available for the R language. These are indeed fundamental distributions, sufficient for
most elementary work in probability and statistics. The fact that the R functions
implementing these distributions all follow same syntax greatly
eases a beginner's task of trying to get some useful work done with a minimum
of memorization.

The following figure shows plots of the cumulative distribution
pgamma()and probability
density function dgamma() along
with the histogram of random draws from a gamma distribution rgamma(2,2)with shape and scale parameters both set to 2.


However, if a person isn’t familiar with how information
about R is organized on CRAN, he or she might conclude:  “that’s it” or most of it anyway, with respect
to R and probability distributions. Imagine the surprise then of a person with
such modest expectations about R’s probability distributions accidently
stumbling into the overgrown garden of R’s Probability Distributions Task View. I think my first reaction was kind of glazed over inability
to take it all in.

However, if you just let your eyes relax and pick out a
flower with which you are familiar, binomial for example, you can see that the
chief gardener Christophe Dutang, listed as the maintainer of the Task View, and the eight individuals
whom acknowledges have done a remarkable job of organizing the distributions
according to their genus (discrete or continuous), species (binomial in this
case) and variety (truncated binomial and zero inflated binomial). I can’t
imagine the number of volunteer hours took to assemble this page, and keeping
it up to date can’t be easy either. I
spent a half hour or so just trying to count the distributions. Not counting
copulas, random matrices and other exotica I came up with 31 discrete, 133
continuous and 9 mixture distributions. Others may count more or less depending
on how they group things together. It seems as if few people outside of the
folks at Wikipedia have given much thought to the taxonomy of probability
distributions and only Mathematica 9 which includes 130 probability distributions comes close to cultivating so many distributions in one
coherent system. (To be fair, the online documentation for SAS, Matlab and SPSS is so distributed that it is difficult to determine how many probability distrbutions have ben implemented in these software packages.)

While the Probability Distributions Task view may be the
place to start for information about probability distributions, the complete R documentation
is itself an open ended, organic system that depends on the communication style
of package authors and the experiences of everyone who leaves a record of their
attempts to work with probability distributions.

The entire ecosystem of R documentation
for a probability distribution function starts with the command line help (
e.g. ?pgamma) and the package pdf on CRAN that includes the function, but may also include, vignettes,
external web pages, blog posts and questions and discussions on help bulletin
boards such as the R mailing lists and StackOverflow. For
some typical examples, consider that the actuar package from Vincet Goulet et al. which provides a number of distributions of interest to acturies has six vignettes, while Thomas Yee's VGAM package for Vector Generalized Linear and Additive Models, a source for many R probability distributions, has a web page as well as a vignette.

D. Cook’s clickable diagram for elementary probability distributions is hosted on his private website while and the paper by Delignette-Muller et al. on fitting distributions with R’s
fitdistrplus package is hosted on an academic website. Mage's post from December 2011 on fitting distributions in R is an example of the many blog posts that deserve a second look.

As a final example of how the community comes to play a part
of the extended documentation for R, consider my attempt get a handle on the
Cauchy distribution. Here I ran the below and got four very
different looking plots. This is not unexpected given that I’m working with
random draws from a probability distribution for which both the mean and
variance are not defined. But why only two bins for the histograms?

4_cauchy_plots Well, I wasn’t the first person to pause for a moment over this. Someone recently asked
this question on StackOverflow and received some good advice.

off and thank you to everyone involved in cultivating R’s garden of probability

# Cauchy plots
n <- 10000
location <- -1
scale <- 4
# Make four plots
    for(i in 1:4){
     y <- rcauchy(n, location, scale)
     hist(y, freq = FALSE, col = rainbow(6),
     main="random draw from rcauchy(-1,4)")
     fd <- function(y)dcauchy(y,shape,scale)
     curve(fd, col = "black", add = TRUE,lwd=2)



To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.