**Stat Bandit » R**, and kindly contributed to R-bloggers)

We have a data set *dat* with multiple observations per subject. We want to create a subset of this data such that each subject (with **ID** giving the unique identifier for the subject) contributes the observation where the variable **X** takes it’s maximum value for that subject.

## An R solution

Using the excellent R package *dplyr*, we can do this using windowing functions included in *dplyr*. The following solution is available on StackOverflow, by junkka, and gets around the real possibility that multiple observations might have the same maximum value of **X** by choosing one of them.

library(dplyr) dat_max <- dat %>% group_by(ID) %>% filter(row_number(X)==n())

To be a bit more explicit, `row_number`

is a wrapper around `rank`

, using the option `ties.method=“first”`

. So you can use the `rank`

function explicitly here as well.

A solution using the plyr package might be

library(plyr) dat_max <- ddply(dat, .(ID), function(u) u[rank(u$X, ties.method=’first’)==nrow(u),])

## A Python solution

In Python, the excellent pandas package allows us to do similar operations. The following example is from this thread on StackOverflow.

import pandas as pd df = DataFrame({’Sp’:[‘a’,’b‘,’c’,’d‘,’e’,’f’], ‘Mt’:[‘s1’, ‘s1’, ‘s2’,’s2‘,’s2’,’s3’], ‘Value’:[1,2,3,4,5,6], ‘count’:[3,2,5,10,10,6]}) df.iloc[df.groupby([‘Mt’]).apply(lambda x: x[‘count’].idxmax())]

You could also do (from the same thread)

df.sort(‘count’, ascending=False).groupby(‘Mt’, as_index=False).first()

but it is about 50% slower.

**leave a comment**for the author, please follow the link and comment on their blog:

**Stat Bandit » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...