(This article was first published on

**SAS and R**, and kindly contributed to R-bloggers)A colleague recently asked "why should the average get closer to the mean when we increase the sample size?" We should interpret this question as asking why the standard error of the mean gets smaller as n increases. The central limit theorem shows that (under certain conditions, of course) the standard error must do this, and that the mean approaches a normal distribution. But the question was why does it? This seems so natural that it may have gone unquestioned in the past.

The best simple rationale may be that there are more ways to get middle values than extreme values--for example, the mean of a die roll (uniform discrete distribution on 1, 2, ..., 6) is 3.5. With one die, you're equally likely to get an "average" of 3 or of 1. But with two dice there are five ways to get an average of 3, and only one way to get an average of 1. You're 5 times more likely to get the value that's closer to the mean than the one that's further away.

Here's a simple graphic to show

*that*the standard error decreases with increasing n.

**SAS**

We begin by simulating some data-- normal, here, but of course that doesn't matter (assuming that the standard deviation exists for whatever distribution we pick and the sample size is appropriately large). Rather than simulate separate samples with n = 1 ... k, it's easier to add a random variate to a series and keep a running tally of the mean, which is easy with a little algebra. This approach also allows tracking the progress of the mean of each series, which could also be useful.

%let nsamp = 100;

data normal;

do sample = 1 to &nsamp;

meanx = 0;

do obs = 1 to &nsamp;

x = normal(0);

meanx = ((meanx * (obs -1)) + x)/obs;

output;

end;

end;

run;

We can now plot the means vs. the number of observations.

symbol1 i = none v = dot h = .2;

proc gplot data = normal;

plot meanx * obs;

run;

quit;

symbol1 i=join v=none r=&nsamp;

proc gplot data=normal;

plot meanx * obs = sample / nolegend;

run; quit;

The graphic resulting from the first

`proc gplot`is shown above, and demonstrates both how quickly the variability of the estimate of the mean decreases when n is small, and how little it changes when n is larger. A plot showing the means for each sequence converging can be generated with the second block of code. Note the use of the global macro variable

`nsamp`assigned using the

`%let`statement (section A.8.2).

**R**

We'll also generate sequences of variates in R. We'll do this by putting the random variates in a matrix, and treating each row as a sequence. We'll use the

`apply()`function (sections 1.10.6 and B.5.3) to treat each row of the matrix separately.

numsim = 100

matx = matrix(rnorm(numsim^2), nrow=numsim)

runavg = function(x) { cumsum(x)/(1:length(x)) }

ramatx = t(apply(matx, 1, runavg))

The simple function

`runavg()`calculates the running average of a vector and returns the a vector of equal length. By using it as the function in

`apply()`we can get the running average of each row. The result must be transposed (with the

`t()`function, section 1.9.2) to keep the sequences in rows. To plot the values, we'll use the

`type="n"`option to

`plot()`, specifying the first column of the running total as the y variable. While it's possible that the running average will surpass the average when n=1, we ignore that case in this simple demonstration.

plot(x=1:numsim, y = ramatx[,1], type="n",

xlab="number of observations", ylab="running mean")

rapoints = function(x) points(x~seq(1:length(x)), pch=20, cex=0.2)

apply(ramatx,1,rapoints)

plot(x=1:numsim, y = ramatx[,1], type="n",

xlab="number of observations", ylab="running mean")

ralines = function(x) lines(x~seq(1:length(x)))

apply(ramatx, 1, ralines)

Here we define another simple function to plot the values in a vector against the place number, then again use the

`apply()`function to plot each row as a vector. The first set of code generates a plot resembling the SAS graphic presented above. The second set of code will connect the values in each sequence, with results shown below.

To

**leave a comment**for the author, please follow the link and comment on his blog:**SAS and R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...