File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”
Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20)) df.orig # vars obs1 obs2 # 1 A 1 11 # 2 B 2 12 # 3 C 3 13 # 4 D 4 14 # 5 E 5 15 # 6 A 6 16 # 7 B 7 17 # 8 C 8 18 # 9 D 9 19 # 10 E 10 20
Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
# use aggregate to create new data frame with the maxima df.agg <- aggregate(obs1 ~ vars, df.orig, max) # then simply merge with the original df.max <- merge(df.agg, df.orig) df.max # vars obs1 obs2 # 1 A 6 16 # 2 B 7 17 # 3 C 8 18 # 4 D 9 19 # 5 E 10 20
This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.
With thanks to this mailing list thread.
Filed under: programming, R, research diary, statistics
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).