Exploratory Data Analysis: Conceptual Foundations of Histograms – Illustrated with New York’s Ozone Pollution Data

(This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers)

Introduction

Continuing my recent series on exploratory data analysis (EDA), today’s post focuses on histograms, which are very useful plots for visualizing the distribution of a data set.  I will discuss how histograms are constructed and use histograms to assess the distribution of the “Ozone” data from the built-in “airquality” data set in R.  In a later post, I will assess the distribution of the ”Ozone” data in greater depth by combining histograms with various types of density plots.

Previous posts in this series on EDA include

Read the rest of this post to learn how to construct a histogram and get the R code for producing the above plot!

What is a Histogram?

histogram is a plot of the counts or proportions of data in disjoint intervals along the support setNote that the actual values of the data are not used; a datum’s presence in a particular interval (or “bin”) is merely tallied, and a histogram displays vertical bars with heights representing the tally or the proportion of the sample size in each bin.

A histogram is useful for visualizing the distribution of the data and answering the following common and basic questions about a data set.

Different histograms can be made from the same data set because of:

1. Different numbers of intervals
2. Different starting points (lower limit of first interval) and end points (upper limit of last interval)
3. Different rules for assigning which data belong to which intervals – these rules are usually set in terms of the boundaries of the intervals

The first reason accounts for much bigger differences than the last 2 reasons.  Alas, let’s show the steps of constructing a histogram first, then discuss these differences afterward.

How to Construct a Histogram

Here is a typical way of creating a histogram; again, for reasons just stated above, this is not the only way.

1.     Find the range (i.e. the maximum minus the minimum, $X_{(n)} - X_{(1)}$) of the data; denote the range as $R$

2.     Divide the range by some arbitrary number (i.e. the number of intervals or “bins” that you want in the histogram); denote this number as $B$

3.     Use $R$ and $B$ to set the boundaries of the intervals:

• The first interval has the minimum, $X_{(1)}$, as the left boundary and $X_{(1)} + R/B$ as the right boundary.
• The second interval has $X_{(1)} + R/B$ as the left boundary and $X_{(1)} + 2R/B$ as the right boundary.
• The Jth interval has $X_{(1)} + (J-1)R/B$ as the left boundary and $X_{(1)} + JR/B$ as the right boundary.
• If J = B (i.e. if the Jth interval is the last interval), then the right boundary is just the maximum, $X_{(n)}$.

4.     Choose a rule on how to assign the data into the intervals such that every datum must be assigned to one and only one interval; here is a common rule:

• For the first interval, include all points less than or equal to $X_{(1)} + R/B$.  This ensures that the minimum is included in the first interval.
• For all other intervals, include all points greater than $X_{(1)} + (J-1)R/B$ and less than or equal to $X_{(1)} + JR/B$.

5.     Count the number of data that fall into each interval.

6.     Depending on the type of vertical axis that you want, plot vertical bars representing the sample counts or sample proportions of data that fall into each interval.

This method uses the minimum as the lower limit of the first interval and the maximum as the upper limit of the last interval.  However, these limits can also be widened for convenience.  For example, the histogram for the “Ozone” data set (shown later in this post) uses 0 as the lower limit of the first interval; this is a sensible choice, since ozone concentration is non-negative, and the lowest concentration in this data set is 1 ppb.  The lesson to take away is to look at your data and use your judgment.

For Step #3, I have seen some textbooks that set the starting point as $X_{(1)} - 0.5$ for data sets with integer data; this ensures that no boundary is an integer and prevents any datum from falling exactly on a boundary.

Choosing the Number of Intervals or the Interval Width

The above procedure is straightforward except for one aspect: how to choose the number of intervals.  (This is equivalent to formulating the interval width, since the range divided by the number of intervals equals the interval width.)

A histogram with too few intervals will hide key patterns in the distribution.

A histogram with too many intervals will show too much noise about the data and obscure the underlying pattern.

After trying different numbers of intervals, I produced the following histogram, which best shows the distribution without too much of the noise that gives a “choppy” appearance.

This histogram shows a few key attributes about the distribution of the “Ozone” data.

• It is right-skewed with the mode at about 15 ppb and a slight rise again at about 70 ppb
• It is unimodal
• There are some outliers near the high end

There is no “best” rule for choosing the number of intervals; in my experience, it’s best to try multiple numbers of intervals and choose the one number that best shows the underlying pattern that you aim to capture.

There are some guidelines that suggest an optimal number of intervals in the “breaks” option; I encourage you to read the “Details” section in the documentation for the hist() function in R for more information.

R Code for Producing Histograms

Here is the R code to generate the above plots.  I used the “breaks” option to set the number of intervals in each histogram.

In a later post, I will assess the distribution of the “Ozone” data in greater depth by combining histograms with various types of density plots.  Stay tuned!

```##### Exploratory Data Analysis of Ozone Pollution Data in New York
##### By Eric Cai - The Chemical Statistician
# clear all variables
rm(list = ls(all.names = TRUE))

# extract "Ozone" data vector
ozone = airquality\$Ozone

# histogram with too few intervals
png('INSERT YOUR DIRECTORY PATH HERE/histogram with too few intervals.png')
hist(ozone, breaks = 3, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Few Intervals')
dev.off()

# histogram with too many intervals
png('INSERT YOUR DIRECTORY PATH HERE/histogram with too many intervals.png')
hist(ozone, breaks = 25, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data\nToo Many Intervals')
dev.off()

# histogram
hist(ozone, breaks = 15, freq = F, xlab = 'Ozone (ppb)', ylim = c(0, 0.025), ylab = 'Relative Frequency', main = 'Histogram of Ozone Pollution Data')
dev.off()```