# Exploratory Data Analysis – Kernel Density Estimation and Rug Plots on Ozone Data in New York and Ozonopolis

(This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers)

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first half focused on the conceptual foundations of kernel density estimation.

#### Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation.  It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R.  It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

This is another post in a recent series on exploratory data analysis, which has included posts on descriptive statistics, box plots, violin plots, the conceptual foundations of empirical cumulative distribution functions (CDFs), and how to plot empirical CDFs in R.

Read the rest of this post to learn how to create the above combination of a kernel density plot and a rug plot!

#### Example: Ozone Pollution Data from New York and Ozonopolis

Recall that I used 2 sets of ozone data in my last post about box plots.  One came from the “airquality” data set that is built into R.  I simulated the other one and named its city of origin “Ozonopolis”.  Here are the code and the plot of the kernel density estimates of the 2 ozone pollution data sets.  I used the default settings in density() – specifically, I used the normal (Gaussian) kernel and the “nrd0” method of choosing the bandwidth.  I encourage you to try the other settings.  I have used the set.seed() function so that you can replicate my random numbers.

```##### Kernel Density Estimation
##### By Eric Cai - The Chemical Statistician

# extract "Ozone" data vector for New York
ozone = airquality\$Ozone

# calculate the number of non-missing values in "ozone"
n = sum(!is.na(ozone))

# calculate mean, variance and standard deviation of "ozone" by excluding missing values
mean.ozone = mean(ozone, na.rm = T)
var.ozone = var(ozone, na.rm = T)
sd.ozone = sd(ozone, na.rm = T)

# simulate ozone pollution data for ozonopolis
# set seed for you to replicate my random numbers for comparison
set.seed(1)
ozone2 = rgamma(n, shape = mean.ozone^2/var.ozone+3, scale = var.ozone/mean.ozone+3)

# obtain values of the kernel density estimates
density.ozone = density(ozone, na.rm = T)
density.ozone2 = density(ozone2, na.rm = T)

# number of points used in density plot
n.density1 = density.ozone\$n
n.density2 = density.ozone2\$n

# bandwidth in density plot
bw.density1 = density.ozone\$bw
bw.density2 = density.ozone2\$bw

png('INSERT YOUR DIRECTORY PATH HERE/kernel density plot ozone.png')
plot(density.ozone2, main = 'Kernel Density Estimates of Ozone \n in New York and Ozonopolis', xlab = 'Ozone (ppb)', ylab = 'Density', ylim = c(0, max(density.ozone\$y, na.rm = T)), lty = 1)

# add second density plot
lines(density.ozone, lty = 3)

# add legends to state sample sizes and bandwidths; notice use of paste()
legend(100, 0.015, paste('New York: N = ', n.density1, ', Bandwidth = ', round(bw.density1, 1), sep = ''), bty = 'n')
legend(100, 0.013, paste('Ozonopolis: N = ', n.density2, ', Bandwidth = ', round(bw.density2, 1), sep = ''), bty = 'n')

# add legend to label plots
legend(115, 0.011, c('New York', 'Ozonopolis'), lty = c(3,1), bty = 'n')
dev.off()```

It is clear that Ozonopolis has more ozone pollution than New York.  The right-skewed shapes of both curves also suggest that the normal distribution may not be suitable.  (If you read my blog post carefully, you will already see evidence of a different distribution!)  In a later post, I will use quantile-quantile plots to illustrate this.  Stay tuned!

To give you a better sense of why the density plots have higher “bumps” at certain places, take a look at the following plot of the ozone pollution just in New York.  Below the density plot, you will find a rug plot – a plot of tick marks along the horizontal axis indicating where the data are located.  Clearly, there are more data in the neighbourhood between 0 and 50, where the highest “bump” is located.  Use the rug() function to get the rug plot in R.

```##### Kernel Density Plot with Rug Plot
##### By Eric Cai - The Chemical Statistician
png('INSERT YOUR DIRECTORY PATH HERE/kernel density plot with rug plot ozone New York.png')
plot(density.ozone, main = 'Kernel Density Plot and Rug Plot of Ozone \n in New York', xlab = 'Ozone (ppb)', ylab = 'Density')
rug(ozone)
dev.off()```

#### References

Trosset, Michael W. An introduction to statistical inference and its applications with R. Chapman and Hall/CRC, 2011.

Everitt, Brian S., and Torsten Hothorn. A handbook of statistical analyses using R. Chapman and Hall/CRC, 2006.

To leave a comment for the author, please follow the link and comment on his blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...