egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.
In stata, the command would be
egen mean_y = mean(y), by(id)
In R, this task can be completed by ave
Generate dataset:
id <- rep(1:3,each=3) t<-rep(1:3,3) y<-sample(1:5,9,replace=T) my_data<-data.frame(id=id,time=t,y=y)
Orignal data:
> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3
> within(my_data, {mean_y = ave(y,id)} )
id time y mean_y
1 1 1 4 3.000000
2 1 2 1 3.000000
3 1 3 4 3.000000
4 2 1 2 2.666667
5 2 2 3 2.666667
6 2 3 3 2.666667
7 3 1 4 3.666667
8 3 2 4 3.666667
9 3 3 3 3.666667
The default summary statistics is mean. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have
within(my_data, {sd_y = ave(y,id,FUN=sd)} )
id time y sd_y
1 1 1 4 1.7320508
2 1 2 1 1.7320508
3 1 3 4 1.7320508
4 2 1 2 0.5773503
5 2 2 3 0.5773503
6 2 3 3 0.5773503
7 3 1 4 0.5773503
8 3 2 4 0.5773503
9 3 3 3 0.5773503
Remark: The within evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )
Here is another usage of ave. We would like to create a self excluded sample mean by group.
Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.
id <- rep(1:3,each=3) t<-rep(1:3,3) y<-sample(1:5,9,replace=T) my_data<-data.frame(id=id,time=t,y=y)
Orignal data:
> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3
First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that i-th element is given by FUN(x[-i])
excludeSelfSummary<-function(x,FUN=mean){
sapply(1:length(x), function(i) FUN(x[-i]))
}
> excludeSelfSummary(1:5,mean)
[1] 3.50 3.25 3.00 2.75 2.50
> excludeSelfSummary(1:5,min)
[1] 2 1 1 1 1
> excludeSelfSummary(1:5,max)
[1] 5 5 5 5 4
Then we pass the excludeSelfSummary into ave as argument.
> within(my_data, {sd_y = ave(y,id,FUN=excludeSelfSummary)} )
id time y sd_y
1 1 1 4 2.5
2 1 2 1 4.0
3 1 3 4 2.5
4 2 1 2 3.0
5 2 2 3 2.5
6 2 3 3 2.5
7 3 1 4 3.5
8 3 2 4 3.5
9 3 3 3 4.0
Of course, we could compute the self excluded minimum or maximum.
> within(my_data, {sd_y = ave(y,id,FUN=function(x) excludeSelfSummary(x,min) )})
id time y sd_y
1 1 1 4 1
2 1 2 1 4
3 1 3 4 1
4 2 1 2 3
5 2 2 3 2
6 2 3 3 2
7 3 1 4 3
8 3 2 4 3
9 3 3 3 4
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).