[This article was first published on Ensemble Blogging
, and kindly contributed to R-bloggers
]. (You can report issue about the content on this page here
Want to share your content on R-bloggers? click here
if you have a blog, or here
if you don't.
As a data scientist, occasionally, you receive a dataset and you would like to know what is the generative distribution for that dataset. In this post, I aim to show how we can answer that question in R. To do that let’s make an arbitrary dataset that we sample from a Gamma distribution. To make the problem a little more interesting, let add Gaussian noise to simulate measurement noise:
num_of_samples = 1000
Basically, the process of finding the right distribution for a set of data can be broken down into four steps:
- Visualization. plot the histogram of data
- Guess what distribution would fit to the data the best
- Use some statistical test for goodness of fit
- Repeat 2 and 3 if measure of goodness is not satisfactory
The first task is fairly simple. In R, we can use hist to plot the histogram of a vector of data.
The following plot shows the histogram of the above dataset:
The second task is a little bit tricky. It is mainly based on your experience and your knowledge of statistical distribution. Since we created the dataset ourselves, it is easy (surprisingly!) to guess the distribution. Let assume we know that the distribution is a Gamma distribution with shape 10 and scale 3.
The third task is to do some statistical testing to see if data is actually driven from the parametric distribution. These tests are call Goodness of fit
. There are three well-known and widely use goodness of fit tests that also have nice package in R.
- Chi Square test
- Kolmogorov–Smirnov test
- Cramér–von Mises criterion
All of the above tests are for statistical null hypothesis testing. For goodness of fit we have the following hypothesis:
- H0 = The data is consistent with a specified reference distribution.
- H1 = The data is NOT consistent with a specified reference distribution
For any null hypothesis testing, one need to specify a threshold which is known as statistical significance (or significant level). The value of the significant level depends on the application but it is usually in the range of [.01, .1]. If the result of statistical test is above the level we would no reject the null hypothesis. In other words, if the test result is above the threshold, we conclude that the observed sample frequencies is significantly similar to expected frequencies specified in the null hypothesis.
Before we go furthur let’s agee on two definitions:
- Reference distribution is defined as a distribution which we assume fits the data the best. Our hypothesis testing tests if this assumption is correct or not
- Primary distribution is defined as actual distribution that the data was sampled from. In practice this distribution is unknown and we try to estimate and find that distribution.
Chi Square test
In R, you can use chisq.test to run the chi test. You need to pass the data and the candidate distribution. Two points need to be considered:
- The candidate distribution needs to be a pmf where its sum is 1. If you don’t have the distribution normalized set rescale.p to TRUE.
- The chi square test is a statistical test, hence it needs to be run using Monte Carlo to make sure its result is accurate enough. For use the Monte Carlo set simulate.p.value. You can also set the iteration number by set B.
The above test results in p-value of .2 which is above the significant level. That means we can not reject the null hypothesis. In other words hypothesis that p$counts are samples from null.probs is correct assumption.
How to create the null.probs
The Chi square test requires to specify the null distribution pmf. Note that, although the primary distribution that we took sample from is a continuous distribution (x ~ Gamma(10,3)) by using the histogram we convert it to the discrete samples. In better words, by using the histogram we first “bucketized” the Gamma distribution into 50 buckets and p$count shows number of samples falling into different buckets.
Since, the primary distribution and samples are bucketized, we need to do the same thing for the reference distribution. In other words, for reference Gamma distribution we need to calculate the probability of each bucket. We can use the following piece of code to do that:
The first line calculate the CDF of each break point in x histogram. To calculate the probability of each bucket in interval [x1, x2) we need to calculate the following:
pgamma(x2, shape=10, scale=3) – pgamma(x1, shape=10, scale=3)
The second line of code performs rolling difference to calculate the above formulation.
Cramér–von Mises criterion
Cramer von Mises test compares a given empirical distribution with another distribution. Since our hyposesis is that dataset x has Gamma distribution, we create another Gamma distribution with shape 10 and scale 3 and use it as reference distribution for hypnosis testing. Note that since the second gamma distribution is the basis of the comparison we are using a large sample size to closely estimate the Gamma distribution.
num_of_samples = 100000
The fourth line in above code is to convert Cramer-von Mises U-value to p-value. this creates p-value of .45 which is significantly above significance level and so the two distribution are close enough.
Kolmogorov-Smirnov is simple nonparametric test for one dimensional probability distribution. Same as Cramer von Mises test, it compares empirical distribution with reference probability. So we would use the test same as we used before:
num_of_samples = 100000
This generate the value of .2 which means we will accept the null hypothesis.
Different Reference Distribution
Now let see what would be the result if we decided to use a different reference distribution. Let study the result of the above test when the reference distribution is Gamma(11,3) or Normal distribution N(30,90). The following tables summarizes the result:
||Chi square test
||Cramér–von Mises criterion
Clearly, Gamma(10,3) is a good fit for the sample dataset, which is consistent with the primary distribution.