The complete catalog of argument variations of select() in dplyr

July 28, 2015
By

(This article was first published on HOXO-M - anonymous data analyst group in Japan - , and kindly contributed to R-bloggers)

When I read the dplyr vignette, I found a convenient way to select sequential columns such as select(data, year:day).
Because I had inputted only column names to select() function, I was deeply affected by the convenient way.

On closer inspection, I found that the select() function accepts many types of input.
Here, I will enumerate the variety of acceptable inputs for select() function.

By the way, these column selection methods also can use in the summarise_each(), mutate_each() and some functions in tidyr package(e.g. gather()).

1. Whole codes.

At first, whole codes were shown for perspicuity.
In the sections below, the details of each command were shown.

# Data preparation ------------------------------------------------------------------
library(dplyr)
library(nycflights13)
set.seed(123)
data <- sample_n(flights, 3)

glimpse(data)

# Basic method of use select() ----------------------------------------------------------------------
select(data, year)
select(data, year, month, day)
select(data, year:day)

select(data, -year, -month, -day)
select(data, -(year:day))

select(data, 1, 2, 3)
select(data, 1:3)

select(data, -1, -2, -3)
select(data, -(1:3))

select(data, year:day, -month)
select(data, -(year:day), month)
select(data, 1:3, -2)
select(data, -(1:3), 2)

# Utility functions of select() --------------------------------------------------------------
select(data, starts_with("arr"))
select(data, ends_with("time"))
select(data, contains("_"))
select(data, matches("^(dep|arr)_"))

data2 <- data
colnames(data2) <- sprintf("x%d", 1:16)
select(data2, num_range("x", 8:11))
select(data2, num_range("x", c(9, 11)))

data3 <- data
colnames(data3) <- sprintf("x%02d", 1:16)
select(data3, num_range("x", 8:11, width=2))

col_vector <- c("year", "month", "day")
select(data, col_vector)
select(data, one_of(col_vector))

select(data, everything())

select(data, -starts_with("arr"))

# Standard evaluation --------------------------------------------------------------------
select_(data, "year", "month", "day")

col_vector <- c("year", "month", "day")
select_(data, .dots = col_vector)

select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

2. Data preparation.

To follow the dplyr vignette, flights data set in nycflights13 package were used as an example.

library(dplyr)
library(nycflights13)

set.seed(123)
data <- sample_n(flights, 3)

glimpse(data)
Variables:
$ year (int) 2013, 2013, 2013
$ month (int) 12, 7, 3
$ day (int) 15, 17, 2
$ dep_time (int) 2124, 651, 1636
$ dep_delay (dbl) -4, -9, 1
$ arr_time (int) 2322, 936, 1800
$ arr_delay (dbl) 1, -28, 0
$ carrier (chr) "UA", "DL", "WN"
$ tailnum (chr) "N801UA", "N194DN", "N475WN"
$ flight (int) 289, 763, 1501
$ origin (chr) "EWR", "JFK", "LGA"
$ dest (chr) "DTW", "LAX", "MKE"
$ air_time (dbl) 88, 306, 103
$ distance (dbl) 488, 2475, 738
$ hour (dbl) 21, 6, 16
$ minute (dbl) 24, 51, 36

This data set includes the 16 columns shown above.

3. Basic method of use select().

At first, the ways of using select() were shown.

select(data, year)
  year
1 2013
2 2013
3 2013

This process shows the way to take the year column out of data. To pick multiple columns, you can write the following.

select(data, year, month, day)
  year month day
1 2013 12 15
2 2013 7 17
3 2013 3 2

If columns were sequential in the dataset, you could write the following to pick sequential columns.

select(data, year:day)
  year month day
1 2013 12 15
2 2013 7 17
3 2013 3 2

If you want to remove a specific column, add - in the head of the column name as follows.

select(data, -year, -month, -day)
  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2124 -4 2322 1 UA N801UA 289 EWR DTW 88 488 21 24
2 651 -9 936 -28 DL N194DN 763 JFK LAX 306 2475 6 51
3 1636 1 1800 0 WN N475WN 1501 LGA MKE 103 738 16 36

To remove sequential columns, put sequential columns in brackets () connected with a colon.

select(data, -(year:day))
  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2124 -4 2322 1 UA N801UA 289 EWR DTW 88 488 21 24
2 651 -9 936 -28 DL N194DN 763 JFK LAX 306 2475 6 51
3 1636 1 1800 0 WN N475WN 1501 LGA MKE 103 738 16 36

It is also possible to pick columns by choosing the column number.

select(data, 1, 2, 3)
select(data, 1:3)
  year month day
1 2013 12 15
2 2013 7 17
3 2013 3 2

Next topics are slightly advanced.

It is possible to pick sequential columns temporarily and remove some of these.

select(data, year:day, -month)
  year day
1 2013 15
2 2013 17
3 2013 2

It is also possible to remove sequential columns and keep a part of these.

select(data, -(year:day), month)
  dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute month
1 2124 -4 2322 1 UA N801UA 289 EWR DTW 88 488 21 24 12
2 651 -9 936 -28 DL N194DN 763 JFK LAX 306 2475 6 51 7
3 1636 1 1800 0 WN N475WN 1501 LGA MKE 103 738 16 36 3

Even using a column number can give the same result with the column name. (The results are omitted.)

select(data, 1:3, -2)
select(data, -(1:3), 2)

4. Utility functions of select().

Utility functions existing in select(), summarise_each() and mutate_each() in dplyr as well as some functions in the tidyr package.

Seven functions existed in the utility functions of select().

  • starts_with(match, ignore.case = TRUE)
  • ends_with(match, ignore.case = TRUE)
  • contains(match, ignore.case = TRUE)
  • matches(match, ignore.case = TRUE)
  • num_range(prefix, range, width = NULL)
  • one_of(...)
  • everything()

We now check the respective commands and how to use them.

First, starts_with() picks columns whose name starts with the specified string.

select(data, starts_with("arr"))
  arr_time arr_delay
1 2322 1
2 936 -28
3 1800 0

The argument ignore.case specifies whether the lowercase is classified as a capital letter(default is TRUE).

The ends_with() picks columns whose name ends with the specified string .

select(data, ends_with("time"))
  dep_time arr_time air_time
1 2124 2322 88
2 651 936 306
3 1636 1800 103

The contains() picks columns whose name contains the specified string.

select(data, contains("_"))
  dep_time dep_delay arr_time arr_delay air_time
1 2124 -4 2322 1 88
2 651 -9 936 -28 306
3 1636 1 1800 0 103

The matches() picks columns based on a regular expression matching string.

select(data, contains("_"))
  dep_time dep_delay arr_time arr_delay air_time
1 2124 -4 2322 1 88
2 651 -9 936 -28 306
3 1636 1 1800 0 103

When the numbers were included in column names, num_range() might be useful.
In this example, we change the column names to be x1x16 and execute num_range() command for the data set.

data2 <- data
colnames(data2) <- sprintf("x%d", 1:16)
select(data2, num_range("x", 8:11))
  x8     x9  x10 x11
1 UA N801UA 289 EWR
2 DL N194DN 763 JFK
3 WN N475WN 1501 LGA

By specifying as num_range("x", 8:11), columns x8 to x11 can be identified.
Numbers in the column name are not necessarily sequential.

select(data2, num_range("x", c(9, 11)))
      x9 x11
1 N801UA EWR
2 N194DN JFK
3 N475WN LGA

When column names were padded, the column name was shown as x01.
Here, the argument width in num_range() might be useful.
We now try this process for a data that changes the column names as x01x16.

data3 <- data
colnames(data3) <- sprintf("x%02d", 1:16)
select(data3, num_range("x", 8:11, width=2))
  x08    x09  x10 x11
1 UA N801UA 289 EWR
2 DL N194DN 763 JFK
3 WN N475WN 1501 LGA

By specifying as width=2, the zero filled columns can be picked out.

When a column is named as a vector or character string, one_of() might be useful.
An error occurs in the following case with select().

col_vector <- c("year", "month", "day")
select(data, col_vector)
Error: All select() inputs must resolve to integer column positions.
The following do not:
* col_vector

However, an intended process occurs in the case of one_of().

select(data, one_of(col_vector))
  year month day
1 2013 12 15
2 2013 7 17
3 2013 3 2

The everything() selects all columns. (Result was omitted.)

select(data, everything())

If add - is in the head of a utility function name, we can pick out all except for the area specified in the utility function.

select(data, -starts_with("arr"))
  year month day dep_time dep_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 12 15 2124 -4 UA N801UA 289 EWR DTW 88 488 21 24
2 2013 7 17 651 -9 DL N194DN 763 JFK LAX 306 2475 6 51
3 2013 3 2 1636 1 WN N475WN 1501 LGA MKE 103 738 16 36

5. Standard evaluation.

Thus far, we explained the normal select() function; however, the normal select() function cannot handle character strings as arguments.
This might become a problem when column names are given as a string vector for example.
To solve this problem, the select_() function was equipped in dplyr. (Caution: An underscore was added in the function name.)
The use of the select_() function is the same as the select() except specifying columns by string; however, attention is needed when specifying a column name by a vector.

select_(data, "year", "month", "day")
  year month day
1 2013 12 15
2 2013 7 17
3 2013 3 2

When specifying column names by a vector, the vector should be given the .dot argument.

col_vector <- c("year", "month", "day")
select_(data, .dots = col_vector)
  year month day
1 2013 12 15
2 2013 7 17
3 2013 3 2

All arguments that can use the select() function are also possible candidates for the select_() function.

select_(data, 'year:day')
select_(data, 'year:day', '-month')
select_(data, '-(year:day)')
select_(data, 'starts_with("arr")')
select_(data, '-ends_with("time")')

Furthermore, also in this case, the argument vector should be given the .dot argument in the select_() function.

select_(data, .dots = c('starts_with("arr")', '-ends_with("time")'))

6. References.

Introduction to dplyr
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Help page for select() function
> help("select", package = "dplyr")

Original entry in Japanese by hoxo_m
https://qiita.com/hoxo_m/items/f2f1793c6f086d381340

Translated by siero

To leave a comment for the author, please follow the link and comment on their blog: HOXO-M - anonymous data analyst group in Japan - .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)