Central Limit Theorem

[This article was first published on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Central Limit Theorem (CLT) is an important theory in statistics. It basically says that you can use all statistical tools and methods that assume a normal distribution on a sample of the full population. It does not matter how the population is distributed (normal, non-normal, uniform, etc.). If you use a large enough sample it will be normally distributed.

For mathematical proof of this theory you can for example look on the Wikipedia page. Here I just show a demonstration in R for four distribution types: Normal (Gaussian), Uniform, Exponential and Lognormal. Basically a matrix with n datapoints is created. From this matrix a sample of 2, 4 and 25 is taken. This is plotted later. N is a crucial parameter here, it defines the size of the population. The bigger it is the more the distribution reflects the theoretical shape of the distribution. (the top row in the image below). A larger n does requires some more computing time and memory. On my laptop with 8 Gb RAM I run into errors with n = 1·109.

As you can see from the plot, you will achieve a normal distribution with a sample size of 25.

# a: normal distribution
# b: uniform distribution
# c: Exponential distribution
# d: lognormal distribution

n <- 1E6

# Calculations
a <- data.frame(matrix(rnorm(n,mean=10,sd=1), ncol=25))
 a$n2 <- rowMeans(cbind(a[1:2]), dims=1)
 a$n4 <- rowMeans(cbind(a[1:4]), dims=1)
 a$n25 <- rowMeans(cbind(a[1:25]), dims=1)

b <- data.frame(matrix(runif(n,min=1,max=10), ncol=25))
 b$n2 <- rowMeans(cbind(b[1:2]), dims=1)
 b$n4 <- rowMeans(cbind(b[1:4]), dims=1)
 b$n25 <- rowMeans(cbind(b[1:25]), dims=1)

c <- data.frame(matrix(rexp(n,rate=1), ncol=25))
 c$n2 <- rowMeans(cbind(c[1:2]), dims=1)
 c$n4 <- rowMeans(cbind(c[1:4]), dims=1)
 c$n25 <- rowMeans(cbind(c[1:25]), dims=1)

d <- data.frame(matrix(rlnorm(n,meanlog=10,sdlog=1), ncol=25))
 d$n2 <- rowMeans(cbind(d[1:2]), dims=1)
 d$n4 <- rowMeans(cbind(d[1:4]), dims=1)
 d$n25 <- rowMeans(cbind(d[1:25]), dims=1)

Plotting these will give the following graphs. From top to bottom: different distibutions, samples from the population, sample sizes of 2 ,4 and 25. With the latter the distribution of the sample is normal (gaussian) meaning that all statistical tools which are based on this distribution can be used. (like for example the calculation of standard deviation).

References

  • Kwong, C.W., 2009, The Use of R Language in the Teaching of Central Limit Theorem, National Institute of Education, Nanyang Technological University, Singapore, Asian Technology Conference on Mathematics. download
  • Central Limit Theorem on Wikipedia

To leave a comment for the author, please follow the link and comment on their blog: R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)