Compute the self excluded sample mean by group

February 12, 2013
By

(This article was first published on R HEAD, and kindly contributed to R-bloggers)

egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by ave

Generate dataset:

id <- rep(1:3,each=3)
t<-rep(1:3,3)
y<-sample(1:5,9,replace=T)
my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data
  id time y
1  1    1 4
2  1    2 1
3  1    3 4
4  2    1 2
5  2    2 3
6  2    3 3
7  3    1 4
8  3    2 4
9  3    3 3
> within(my_data, {mean_y = ave(y,id)} )
  id time y   mean_y
1  1    1 4 3.000000
2  1    2 1 3.000000
3  1    3 4 3.000000
4  2    1 2 2.666667
5  2    2 3 2.666667
6  2    3 3 2.666667
7  3    1 4 3.666667
8  3    2 4 3.666667
9  3    3 3 3.666667

The default summary statistics is mean. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

within(my_data, {sd_y = ave(y,id,FUN=sd)} )
  id time y      sd_y
1  1    1 4 1.7320508
2  1    2 1 1.7320508
3  1    3 4 1.7320508
4  2    1 2 0.5773503
5  2    2 3 0.5773503
6  2    3 3 0.5773503
7  3    1 4 0.5773503
8  3    2 4 0.5773503
9  3    3 3 0.5773503

Remark: The within evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Here is another usage of ave. We would like to create a self excluded sample mean by group.

Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.

id <- rep(1:3,each=3)
t<-rep(1:3,3)
y<-sample(1:5,9,replace=T)
my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data
  id time y
1  1    1 4
2  1    2 1
3  1    3 4
4  2    1 2
5  2    2 3
6  2    3 3
7  3    1 4
8  3    2 4
9  3    3 3

First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that i-th element is given by FUN(x[-i])

excludeSelfSummary<-function(x,FUN=mean){
	sapply(1:length(x), function(i) FUN(x[-i]))
}
> excludeSelfSummary(1:5,mean)
[1] 3.50 3.25 3.00 2.75 2.50
> excludeSelfSummary(1:5,min)
[1] 2 1 1 1 1
> excludeSelfSummary(1:5,max)
[1] 5 5 5 5 4

Then we pass the excludeSelfSummary into ave as argument.

> within(my_data, {sd_y = ave(y,id,FUN=excludeSelfSummary)} )
  id time y sd_y
1  1    1 4  2.5
2  1    2 1  4.0
3  1    3 4  2.5
4  2    1 2  3.0
5  2    2 3  2.5
6  2    3 3  2.5
7  3    1 4  3.5
8  3    2 4  3.5
9  3    3 3  4.0

Of course, we could compute the self excluded minimum or maximum.

> within(my_data, {sd_y = ave(y,id,FUN=function(x) excludeSelfSummary(x,min) )})
  id time y sd_y
1  1    1 4    1
2  1    2 1    4
3  1    3 4    1
4  2    1 2    3
5  2    2 3    2
6  2    3 3    2
7  3    1 4    3
8  3    2 4    3
9  3    3 3    4

To leave a comment for the author, please follow the link and comment on his blog: R HEAD.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.