Grouped means (again)

June 26, 2012
By

(This article was first published on Insights of a PhD student » R, and kindly contributed to R-bloggers)

So, the post I did yesterday on aggregate seemed to go down well.

One of the comments suggested I add an example. Other comments had other useful hints which I thought I’d pass on more formally. So here goes…

The mtcars dataset in base has data on various aspects of cars – miles per gallon, number of cylinders etc etc etc. This makes it a nice example dataset for aggregate as some of the variables make for suitable grouping factors.

First we get hold of the dataset. Thats easy as its a base dataset:

data(mtcars)

If we look at the help file (?mtcars) for the dataset we notice that the last column is the number of carburettors (“carb”), which could be an good grouping variable. Another is  whether the car is an automatic or not (“am”). As our dependant variable we’ll use the horsepower (“hp”)

To make the group means we simply use aggregate as I mentioned yesterday.

aggregate(mtcars$hp, by=list(carb=mtcars$carb, am=mtcars$am), mean)
#  carb am        x
#1    1  0 104.0000
#2    2  0 134.5000
#3    3  0 180.0000
#4    4  0 198.0000
#5    1  1  72.5000
#6    2  1  91.2500
#7    4  1 161.3333
#8    6  1 175.0000
#9    8  1 335.0000

We can make this code a bit shorter though by using the formula interface:

aggregate(hp ~ carb + am, mtcars, mean)

Easy huh? Of course in order to use that table in further procedures (plotting or whatever) you should give it a name:

mtcarsagg <- aggregate(hp ~ carb + am, mtcars, mean)

We can also use other functions in aggregate. Min, max, range for example. A natural one that comes to mind is the standard error. First we’ll define a function to calculate standard error*, then apply it to the example above:

st.err <- function(x) {
    sd(x)/sqrt(length(x))
     }
SE <- aggregate(hp ~ carb+am, mtcars, st.err)
#  carb am        hp
#1    1  0  3.785939
#2    2  0 18.777202
#3    3  0  0.000000
#4    4  0 20.136439
#5    1  1  6.837397
#6    2  1 13.930632
#7    4  1 51.333333
#8    6  1        NA
#9    8  1        NA

It could be that you come up with NA values in the table if you have NAs in the dependant variable or is for st.err for example there is only a single value like for automatic cars with 6 or 8 carburettors. If so, adding the na.rm=TRUE argument to the  end of the argument list should clear it if theres NAs**. If theres only a single value, then mean and SE are probably not so useful anyway.

We can of course bind them together:

mtcarsagg <- cbind(mtcarsagg, SE[,3])
mtcarsagg
#  carb am       hp     SE$hp
#1    1  0 104.0000  3.785939
#2    2  0 134.5000 18.777202
#3    3  0 180.0000  0.000000
#4    4  0 198.0000 20.136439
#5    1  1  72.5000  6.837397
#6    2  1  91.2500 13.930632
#7    4  1 161.3333 51.333333
#8    6  1 175.0000        NA
#9    8  1 335.0000        NA

A note about using aggregate with the by argument – it doesnt give the dependant variable column a name, so you have to do that manually
(names(mtcarsagg)[3] <- “hp” for instance). Using the formula interface does name the column automatically though, and as its shorter, its probably easier to use anyway!

As with anything in R, theres multiple ways to do everything. Other methods for the above include the ave function, Hadley Wickham’s plyr package and aggregate(data["Y"], data[c("trt", "alt")], mean)

Hope that helps!!!

* A slightly more advanced standard error function with NA removal support is:

st.err <- function(x, na.rm=FALSE) {
     if(na.rm==TRUE) x <- na.omit(x)
     sd(x)/sqrt(length(x))
     }

**I attempted to find a dataset with NAs to provide an example, but it ignored the NAs anyway, despite mean giving an NA result (which is weird). Sorry about that.


To leave a comment for the author, please follow the link and comment on his blog: Insights of a PhD student » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.