Means By

October 4, 2010
By

(This article was first published on R Blog, and kindly contributed to R-bloggers)

The other day I was asked by a coworker hos to do a SAS Means By statement in R. I embarrassingly did not know how to so I wrote something up, and this is what I came up with, it takes a data.frame and an indexing variable and computes the means for each group defined by INDEX. This isn’t a solution that I’m terribly happy about since is involves unlisting and transposing matrices, and what not.

means.by<-function(data,INDEX){
  b<-by(data,INDEX,function(d)apply(d,2,mean))
  return(structure(
    t(matrix(unlist(b),nrow=length(b[[1]]))),
      dimnames=list(names(b),col.names=names(b[[1]]))
  ))
}

I got to thinking that someone out there will know how to do this so I put it on StackOverflow hoping that someone out there will know of someone who has faced and addressed this issue before. It is a common type of statistics question to ask so I have a hard time believing that everyone does this sort of thing every time. Please post responses on StackOverflow.

Update:
Thanks to the people on StackoverFlow I found the answer that I was looking for aggregate (also mentioned was a similar plyr function ddply), an interesting note that I found is that neither of these are as intelligent as I would like. For instance if the variable you are using is inside the data frame, it still tries to compute the aggregating function over the variable, where it would make sense to exclude the variable. It would also make sense to assume that the variable could be in the data frame.

> data<-data.frame(I=as.factor(rep(letters[1:10],each=3)),x=rnorm(30),y=rbinom(30,5,.5))
> aggregate(data,list(data$I),mean)
   Group.1  I           x        y
1        a NA -0.33443645 3.000000
2        b NA -0.68481744 1.666667
3        c NA  0.24380887 3.000000
4        d NA  0.54361000 1.000000
5        e NA  0.49409608 2.000000
6        f NA  1.45637561 2.333333
7        g NA  0.04765426 2.333333
8        h NA -0.25969667 2.666667
9        i NA -0.49345794 2.666667
10       j NA -0.10109013 3.000000
> iagg<-function(data,ind,FUN, cols=setdiff(names(data), names(ind)), ...){
+   indi<-substitute(ind)
+   eval(indi,data)->ind
+   if(!is.list(ind))in .... [TRUNCATED] 
> iagg(data,I,mean)
   I           x        y
1  a -0.33443645 3.000000
2  b -0.68481744 1.666667
3  c  0.24380887 3.000000
4  d  0.54361000 1.000000
5  e  0.49409608 2.000000
6  f  1.45637561 2.333333
7  g  0.04765426 2.333333
8  h -0.25969667 2.66

Here iagg is an intelligent aggregate function that makes those assumptions: 1. grouping variables will be able to be part of the data frame. 2. grouping variables are excluded from the computations. 3. grouping variables are added back appropriately named with the unique values.

The full code fore iagg is here

 
iagg<-function(data,ind,FUN, cols=setdiff(names(data), names(ind)), ...){
  indi<-substitute(ind)
  eval(indi,data)->ind
  if(!is.list(ind))ind<-list(ind)
  if(is.null(names(ind)))
    names(ind)<-sapply(indi,paste)[ifelse(length(ind)==1,1,-1)]
  aggregate(subset(data,select=cols),ind,FUN,...)  
}

To leave a comment for the author, please follow the link and comment on his blog: R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.