**R HEAD**, and kindly contributed to R-bloggers)

egen(stata cmd) compute a summary statistics by groups and store it in to a new variable. For example, the data has three variables, id, time and y, we want to compute the mean of y by for each id and then store it as a new variable mean_y.

In stata, the command would be

egen mean_y = mean(y), by(id)

In R, this task can be completed by `ave`

Generate dataset:

id <- rep(1:3,each=3) t<-rep(1:3,3) y<-sample(1:5,9,replace=T) my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3

> within(my_data, {mean_y = ave(y,id)} ) id time y mean_y 1 1 1 4 3.000000 2 1 2 1 3.000000 3 1 3 4 3.000000 4 2 1 2 2.666667 5 2 2 3 2.666667 6 2 3 3 2.666667 7 3 1 4 3.666667 8 3 2 4 3.666667 9 3 3 3 3.666667

The default summary statistics is `mean`

. However, we can assign a particular function to compute the summary statistics. For example, if we want to compute the sd of y by id, then we can have

within(my_data, {sd_y = ave(y,id,FUN=sd)} ) id time y sd_y 1 1 1 4 1.7320508 2 1 2 1 1.7320508 3 1 3 4 1.7320508 4 2 1 2 0.5773503 5 2 2 3 0.5773503 6 2 3 3 0.5773503 7 3 1 4 0.5773503 8 3 2 4 0.5773503 9 3 3 3 0.5773503

Remark: The `within`

evaluate an expression in an environment created from the data.frame. In addition, it will modify the data.frame and return it back(in our case, it create new variables, mean_y or sd_y )

Here is another usage of `ave`

. We would like to create a self excluded sample mean by group.

Suppose the data has three variables, id, time and y, we want to compute the mean of y by for each id but excluding the value of y of current time period.

id <- rep(1:3,each=3) t<-rep(1:3,3) y<-sample(1:5,9,replace=T) my_data<-data.frame(id=id,time=t,y=y)

Orignal data:

> my_data id time y 1 1 1 4 2 1 2 1 3 1 3 4 4 2 1 2 5 2 2 3 6 2 3 3 7 3 1 4 8 3 2 4 9 3 3 3

First, we need a function to compute the self excluded mean. This function takes a vector and a function(default is mean) as argument. It apply the function to the vector where one of the element is removed. The return value is a vector that i-th element is given by FUN(x[-i])

excludeSelfSummary<-function(x,FUN=mean){ sapply(1:length(x), function(i) FUN(x[-i])) } > excludeSelfSummary(1:5,mean) [1] 3.50 3.25 3.00 2.75 2.50 > excludeSelfSummary(1:5,min) [1] 2 1 1 1 1 > excludeSelfSummary(1:5,max) [1] 5 5 5 5 4

Then we pass the `excludeSelfSummary into ave as argument. `

```
```> within(my_data, {sd_y = ave(y,id,FUN=excludeSelfSummary)} )
id time y sd_y
1 1 1 4 2.5
2 1 2 1 4.0
3 1 3 4 2.5
4 2 1 2 3.0
5 2 2 3 2.5
6 2 3 3 2.5
7 3 1 4 3.5
8 3 2 4 3.5
9 3 3 3 4.0

Of course, we could compute the self excluded minimum or maximum.

> within(my_data, {sd_y = ave(y,id,FUN=function(x) excludeSelfSummary(x,min) )})
id time y sd_y
1 1 1 4 1
2 1 2 1 4
3 1 3 4 1
4 2 1 2 3
5 2 2 3 2
6 2 3 3 2
7 3 1 4 3
8 3 2 4 3
9 3 3 3 4

To **leave a comment** for the author, please follow the link and comment on their blog: ** R HEAD**.

R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...