# Exploratory Data Analysis: Combining Histograms and Density Plots to Examine the Distribution of the Ozone Pollution Data from New York in R

(This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers)

#### Introduction

This is a follow-up post to my recent introduction of histograms.  Previously, I presented the conceptual foundations of histograms and used a histogram to approximate the distribution of the “Ozone” data from the built-in data set “airquality” in R.  Today, I will examine this distribution in more detail by overlaying the histogram with parametric and non-parametric kernel density plots.  I will finally answer the question that I have asked (and hinted to answer) several times: Are the “Ozone” data normally distributed, or is another distribution more suitable?

Read the rest of this post to learn how to combine histograms with density curves like this above plot!

This is another post in my continuing series on exploratory data analysis (EDA).  Previous posts in this series on EDA include

#### Getting the Data and Summary Statistics

Before plotting the histograms and density curves, let’s get the data and calculate the summary statistics.  In this post, I add the maximum of the “Ozone” data using the max() function for reasons further explained later.

##### Exploratory Data Analysis of Ozone Pollution Data in New York
##### By Eric Cai - The Chemical Statistician

# clear all variables
rm(list = ls(all.names = TRUE))

# view first 6 entries of the "Ozone" data frame

# extract "Ozone" data vector
ozone = airquality$Ozone # sample size of "ozone" length(ozone) # summary of "ozone" summary(ozone) # 3 ways to find number of non-missing values in "ozone" length(ozone[is.na(ozone) == F]) length(ozone[!is.na(ozone)]) n = sum(!is.na(ozone)) # calculate mean of "ozone" mean(ozone) # calculate mean, variance and standard deviation of "ozone" by excluding missing values mean.ozone = mean(ozone, na.rm = T) var.ozone = var(ozone, na.rm = T) sd.ozone = sd(ozone, na.rm = T) var(ozone, na.rm = T) sd(ozone, na.rm = T) max.ozone = max(ozone, na.rm = T) #### Kernel Density Plot: A First Non-Parametric Approximation In a previous post, I introduced the conceptual foundations of kernel density estimation. I then wrote a follow-up post to describe the distributions of the “Ozone” data from the “airquality” data set for New York and my own randomly generated ozone pollution data set for a fictitious city called Ozonopolis. It is always a good idea to explore a data set with multiple techniques, especially when they can be done together for comparison. Here is the R code for the histogram with the kernel density plot below it. • Notice my use of the lines() function to add the kernel density plot. • Ozone concentration is a non-negative quantity. Thus, I wrote the “from = 0″ and “to = max.ozone” options in the density() function to ensure that the support set of the kernel density plot is non-negative. • Recall that, by default, the density() function uses the Gaussian kernel. You may want to experiment with other kernels, though I did mention in my previous post that the bandwidth is much more important in determining the kernel density estimate. # histogram with kernel density estimate png('INSERT YOUR DIRECTORY PATH HERE/histogram and kernel density plot.png') hist(ozone, breaks = 15, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Probability', main = 'Histogram of Ozone Pollution Data with Kernel Density Plot') lines(density(ozone, na.rm = T, from = 0, to = max.ozone)) dev.off() The kernel density plot is a good reference for finding a parametric distribution. If a parametric distribution that fits the data well can be found, inference would be easier. Let’s try to find such a well fitted distribution. #### Start with the Normal The normal distribution is a commonly used distribution for continuous variables with many convenient properties, so let’s try to fit the normal distribution to this data and examine if it is consistent with the histogram. • Note my use of the dnorm() function to get the normal density values. I used the sample mean and sample standard deviatio as calculated above to define this particular normal distribution. • I calculated a vector called “ozone.ylim.normal” to get the lowest and largest values of both the histogram and the normal density. I wanted to use it in the y-axis limits of my histogram by setting ylim = ozone.ylim.normal so that the y-values of both the histogram and the normal density curve would be within the vertical limits of the plot. Strangely, the y-axis did not extend to the maximum of the histogram. (Try it yoursefl!) After examining the resulting histogram, I saw that an upper vertical limit of 0.025 would be suitable, so that’s what I used in the ylim option. ozone.ylim.normal was not eventually used for any purpose, but it has been computed here for your own exploration should you wish to do so. • Note my use of the curve() function to plot this normal density curve. # histogram with normal density curve png('INSERT YOUR DIRECTORY PATH HERE/histogram and normal density plot.png') ozone.histogram = hist(ozone, breaks = 50, freq = F) ozone.ylim.normal = range(0, ozone.histogram$density, dnorm(ozone, mean = mean.ozone, sd = sd.ozone), na.rm = T)
hist(ozone, breaks = 15, freq = F, ylim = c(0, 0.025), xlab = 'Ozone (ppb)', ylab = 'Probability', main = 'Histogram of Ozone Pollution Data with Normal Density Curve')
curve(dnorm(x, mean = mean.ozone, sd = sd.ozone), add = T)
dev.off()

This normal density curve’s maximum is even lower than the kernel density estimate’s maximum!  This is not a very good fit; let’s find another parametric distribution.

#### The Gamma Distribution?

Notice how the histogram rises quickly for low values of ozone concentration, then decreases gradually.  The gamma distribution has this behaviour; let’s give that a try!  However, in order to define the gamma function, we need a way to estimate the parameters.

Allow me to parametrize the gamma distribution with the following probability density function (PDF).  (There are several ways of parametrizing the gamma distribution, so it is always a good idea to specfiy the PDF for clarity.)

$f(x; k, \theta) = \theta^{-k}\Gamma(k)^{-1}x^{k-1}\text{exp}(x\theta^{-1})$

where

$x > 0, \ \ k > 0, \ \ \theta > 0$.

$k$ is the called the shape parameter, and $\theta$ is called the scale parameter.

Recall that a gamma random variable has the expected value

$E(X) = k\theta$

and the variance

$V(X) = k\theta^2$.

Since the sample mean, $\bar{X}$, and the sample variance, $s^2$, are consistent estimators of expected value and variance, respectively, sensible estimates of $k$ and $\theta$ would be

$\hat{k} = \bar{X}^2/s^2$

$\hat{\theta} = s^2/\bar{X}$.

Here is the code for plotting the histogram with the gamma density curve.

# histogram with normal density curve
png('INSERT YOUR DIRECTORY PATH HERE/histogram and normal density plot.png')
hist(ozone, breaks = 15, freq = F, xlim = c(0, 170), ylim = c(0, 0.025), xlab = 'Ozone (ppb)', ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data with Gamma Density Curve')
curve(dgamma(x, shape = mean.ozone^2/var.ozone, scale = var.ozone/mean.ozone), add = T)
dev.off()

This is clearly a better fit than the normal density curve, and it is closer to the kernel density plot, too!  (The kernel density plot detects the local maximum at about 75 ppb and has a small “bump” there as a result; the gamma distribution does not have such an accommodation.)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...