Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

(This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers)

Introduction

Continuing my recent series on exploratory data analysis (EDA), this post focuses on the conceptual foundations of empirical cumulative distribution functions (CDFs); in a separate post, I will show how to plot them in R.  (Previous posts in this series include descriptive statistics, box plots, kernel density estimation, and violin plots.)

To give you a sense of what an empirical CDF looks like, here is an example created from 100 randomly generated numbers from the standard normal distribution.  The ecdf() function in R was used to generate this plot; the entire code is provided at the end of this post, but read my next post for more detail on how to generate plots of empirical CDFs in R.

Read to rest of this post to learn what an empirical CDF is and how to produce the above plot!

What is an Empirical Cumulative Distribution Function?

An empirical cumulative distribution function (CDF) is a non-parametric estimator of the underlying CDF of a random variable.  It assigns a probability of $1/n$ to each datum, orders the data from smallest to largest in value, and calculates the sum of the assigned probabilities up to and including each datum.  The result is a step function that increases by $1/n$ at each datum.

The empirical CDF is usually denoted by $\hat{F}_n(x)$ or $\hat{P}_n(X \leq x)$, and is defined as

$\hat{F}_n(x) = \hat{P}_n(X \leq x) = n^{-1}\sum_{i=1}^{n} I(x_i \leq x)$

$I()$ is the indicator function.  It has 2 possible values: 1 if the event inside the brackets occurs, and 0 if not.

$I(x_i \leq x) = \begin{cases} 1,&\text{when }x_i \leq x\\ 0,&\text{when }x_i > x \end{cases}$

Essentially, to calculate the value of $\hat{F}_n(x)$ at $x$,

1. count the number of data less than or equal to $x$
2. divide the number found in Step #1 by the total number of data in the sample

Why is the Empirical Cumulative Distribution Useful in Exploratory Data Analysis?

The empirical CDF is useful because

• it approximates the true CDF well if the sample size (the number of data) is large, and knowing the distribution is helpful for statistical inference
• a plot of the empirical CDF can be visually compared to known CDFs of frequently used distributions to check if the data came from one of those common distributions
• it can visually display “how fast” the CDF increases to 1; plotting key quantiles like the quartiles can be useful to “get a feel” for the data

Some Mathematical Statistics of the Empirical Distribution Function

Some appealing properties of the empirical CDF can be obtained from mathematical statistics.

1) For a fixed $x$, $I(x_i \leq x)$ is a Bernoulli random variable with a probability of $F(x)$ equalling 1.  Thus, its expected value is

$E[I(X_i \leq x)] = P(X_i \leq x) = F(x)$,

which means that $I(x_i \leq x)$ is an unbiased estimator of $F(x)$ for a fixed $x$.  Also note that its variance is

$V[I(X_i \leq x)] = F(x)[1 - F(x)]$.

2) By summation of all of these Bernoulli random variables, $\hat{F}_n(x)$ is a binomial random variable.  Thus,

$E[\hat{F}_n(x)] = F(x)$, so

$\hat{F}_n(x)$ is also an unbiased estimator of $F(x)$.

Also note that

$V[\hat{F}_n(x)] = n^{-1}F(x)[1 - F(x)]$.

Thus, for a fixed $x$, $\hat{F}_n(x)$ has a lower variance than $I(X_i \leq x)$.

3) By the Glivenko-Cantelli theorem$\hat{F}_n(x)$ is a consistent estimator of $F(x)$.  In fact, $\hat{F}_n(x)$ converges uniformly to $F(x)$.

Here is the code for generating the plot of the empirical CDF of the random standard normal numbers; the plot is given again after the code.  For the sake of brevity, I will describe in detail how to generate this and other plots of empirical CDFs in a separate post; in fact, I will show 2 different ways of doing so in R!

##### Empirical Distribution Function
##### By Eric Cai - The Chemical Statistician
# set the seed for consistent replication of random numbers
set.seed(1)

# generate 100 random numbers from the standard normal distribution
normal.numbers = rnorm(100)

# empirical normal CDF of the 100 normal random numbers
normal.ecdf = ecdf(normal.numbers)

# plot normal.ecdf (notice that the only argument needed is normal.ecdf)
# use png() and dev.off() to print this plot to your chosen folder
png('INSERT YOUR DIRECTORY PATH HERE/ecdf standard normal.png')

plot(normal.ecdf, xlab = 'Quantiles of Random Standard Normal Numbers', ylab = '', main = 'Empirical Cumluative Distribution\nStandard Normal Quantiles')

# add label to y-axis with mtext()
# side = 2 denotes the left veritical axis
# line = 2.5 sets the position of the label
mtext(text = expression(hat(F)[n](x)), side = 2, line = 2.5)

dev.off()

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...