**What You're Doing Is Rather Desperate » R**, and kindly contributed to R-bloggers)

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”

Let’s create a data frame which contains five variables, *vars*, named A – E, each of which appears twice, along with some measurements:

df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20)) df.orig # vars obs1 obs2 # 1 A 1 11 # 2 B 2 12 # 3 C 3 13 # 4 D 4 14 # 5 E 5 15 # 6 A 6 16 # 7 B 7 17 # 8 C 8 18 # 9 D 9 19 # 10 E 10 20

Now, let’s say we want only the rows that contain the maximum values of *obs1* for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:

# use aggregate to create new data frame with the maxima df.agg <- aggregate(obs1 ~ vars, df.orig, max) # then simply merge with the original df.max <- merge(df.agg, df.orig) df.max # vars obs1 obs2 # 1 A 6 16 # 2 B 7 17 # 3 C 8 18 # 4 D 9 19 # 5 E 10 20

This also works using *min()* and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.

With thanks to this mailing list thread.

Filed under: programming, R, research diary, statistics

**leave a comment**for the author, please follow the link and comment on his blog:

**What You're Doing Is Rather Desperate » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...