# Selecting the max value from each group, a case study: base R

**Random R Ramblings**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Let’s say we wish to group some data by a variable, then for each group we wish to find the row of the maximum value of another variable, and then finally extract the entire row. This is a fairly common task and in fact I’ve had to do this exact data exploration technique on several occasions in the last week using different syntax, `data.table`

and `sparklyr`

; so I thought I would share my code with you so you can compare the different options. In fact for this blog entry, I will be using base R before looking at `data.table`

, `dplyr`

and `sparklyr`

in subsequent posts.

For this exercise, I will be using the classic `datasets::mtcars`

data. The aim here is to find the row for each cylinder group (`cyl`

) where the miles per gallon (`mpg`

) value is at its highest. I am not interested in finding multiple rows, I just want one row for each group even if there are cars which share the same `mpg`

.

# Base R Solution

For this blog entry, I am going to look at using base R. This is a classic `split()`

+ `lapply()`

problem. First I’ll do some pre-processing to `mtcars`

just so that we can see which cars have the maximum mpg for each cylinder group.

```
# Create a column of car names
mtcars_colnames <- colnames(mtcars)
mtcars[, "car"] <- rownames(mtcars)
mtcars <- mtcars[, c("car", mtcars_colnames)]
```

Now we can find the rows we want.

```
max_mpg <- do.call(rbind, lapply(
# Split the data into groups for each cylinder
split(mtcars, mtcars[, "cyl"]),
function(x) {
# For the group `x`, select the row which has the maximum mpg
x[which.max(x[, "mpg"]), ]
}
))
rownames(max_mpg) <- NULL
max_mpg
# car mpg cyl disp hp drat wt qsec vs am gear carb
# 1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 2 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 3 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
```

So what we did here is `split()`

the `data.frame`

into separate lists, one for each cylinder group, and then used `lapply()`

(list apply) to apply a function which, for each group, selects the row which has the maximum mpg. Finally we `rbind()`

each list back together into a `data.frame`

.

You may be curious about my use of `which.max(x[, "mpg"])`

over say `mtcars[mtcars$mpg == max(mtcars$mpg), ]`

, well it’s because the former will only return a single row, whereas the latter will return multiple matching rows (check out `?which.max`

) and in this instance I was only interested in a single row for each group.

Another approach for base R is to use the `by`

function.

```
do.call(rbind, by(datasets::mtcars, mtcars$cyl, function(x) x[which.max(x$mpg), ]))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 4 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 6 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 8 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
```

This acts in a similar way to the `split`

+ `lapply`

approach in that the `data.frame`

is split by row into `data.frame`

s subset by the values of one or more factors, in this case `cyl`

, and a function is then applied to each subset in turn. However as we can see from the benchmarks below, it isn’t quite as fast (in this case) as the former approach.

# Benchmarks

# Conclusion

To conclude, should you need to perform an operation on a `data.frame`

in R, you can `split()`

your data into lists of `data.frame`

s of the required groups, `lapply()`

a function to each `data.frame`

in the list; and finally `rbind()`

those `data.frame`

s back into a single `data.frame`

.

Are you aware of any other base R solutions to this problem? If so, let me know in the comments!

In my next blog entry, we will be looking at how to perform this task using the `data.table`

package.

**leave a comment**for the author, please follow the link and comment on their blog:

**Random R Ramblings**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.