Lesser known dplyr tricks

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr.

Removing unneeded columns

Did you know that you can use - in front of a column name to remove it from a data frame?

mtcars %>% 
    select(-disp) %>% 
    head()
##                    mpg cyl  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 105 2.76 3.460 20.22  1  0    3    1

Re-ordering columns

Still using select(), it is easy te re-order columns in your data frame:

mtcars %>% 
    select(cyl, disp, hp, everything()) %>% 
    head()
##                   cyl disp  hp  mpg drat    wt  qsec vs am gear carb
## Mazda RX4           6  160 110 21.0 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       6  160 110 21.0 3.90 2.875 17.02  0  1    4    4
## Datsun 710          4  108  93 22.8 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      6  258 110 21.4 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   8  360 175 18.7 3.15 3.440 17.02  0  0    3    2
## Valiant             6  225 105 18.1 2.76 3.460 20.22  1  0    3    1

As its name implies everything() simply means all the other columns.

Renaming columns with rename()

mtcars <- rename(mtcars, spam_mpg = mpg)
mtcars <- rename(mtcars, spam_disp = disp)
mtcars <- rename(mtcars, spam_hp = hp)

head(mtcars)
##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am
## Mazda RX4             21.0   6       160     110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         21.0   6       160     110 3.90 2.875 17.02  0  1
## Datsun 710            22.8   4       108      93 3.85 2.320 18.61  1  1
## Hornet 4 Drive        21.4   6       258     110 3.08 3.215 19.44  1  0
## Hornet Sportabout     18.7   8       360     175 3.15 3.440 17.02  0  0
## Valiant               18.1   6       225     105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1

Selecting columns with a regexp

It is easy to select the columns that start with “spam” with some helper functions:

mtcars %>% 
    select(contains("spam")) %>% 
    head()
##                   spam_mpg spam_disp spam_hp
## Mazda RX4             21.0       160     110
## Mazda RX4 Wag         21.0       160     110
## Datsun 710            22.8       108      93
## Hornet 4 Drive        21.4       258     110
## Hornet Sportabout     18.7       360     175
## Valiant               18.1       225     105

take also a look at starts_with(), ends_with(), contains(), matches(), num_range(), one_of() and everything().

Create new columns with mutate() and if_else()

mtcars %>% 
    mutate(vs_new = if_else(
        vs == 1, 
        "one", 
        "zero", 
        NA_character_)) %>% 
    head()
##   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb vs_new
## 1     21.0   6       160     110 3.90 2.620 16.46  0  1    4    4   zero
## 2     21.0   6       160     110 3.90 2.875 17.02  0  1    4    4   zero
## 3     22.8   4       108      93 3.85 2.320 18.61  1  1    4    1    one
## 4     21.4   6       258     110 3.08 3.215 19.44  1  0    3    1    one
## 5     18.7   8       360     175 3.15 3.440 17.02  0  0    3    2   zero
## 6     18.1   6       225     105 2.76 3.460 20.22  1  0    3    1    one

You might want to create a new variable conditionally on several values of another column:

mtcars %>% 
    mutate(carb_new = case_when(.$carb == 1 ~ "one",
                                .$carb == 2 ~ "two",
                                .$carb == 4 ~ "four",
                                 TRUE ~ "other")) %>% 
    head(15)
##    spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am gear carb
## 1      21.0   6     160.0     110 3.90 2.620 16.46  0  1    4    4
## 2      21.0   6     160.0     110 3.90 2.875 17.02  0  1    4    4
## 3      22.8   4     108.0      93 3.85 2.320 18.61  1  1    4    1
## 4      21.4   6     258.0     110 3.08 3.215 19.44  1  0    3    1
## 5      18.7   8     360.0     175 3.15 3.440 17.02  0  0    3    2
## 6      18.1   6     225.0     105 2.76 3.460 20.22  1  0    3    1
## 7      14.3   8     360.0     245 3.21 3.570 15.84  0  0    3    4
## 8      24.4   4     146.7      62 3.69 3.190 20.00  1  0    4    2
## 9      22.8   4     140.8      95 3.92 3.150 22.90  1  0    4    2
## 10     19.2   6     167.6     123 3.92 3.440 18.30  1  0    4    4
## 11     17.8   6     167.6     123 3.92 3.440 18.90  1  0    4    4
## 12     16.4   8     275.8     180 3.07 4.070 17.40  0  0    3    3
## 13     17.3   8     275.8     180 3.07 3.730 17.60  0  0    3    3
## 14     15.2   8     275.8     180 3.07 3.780 18.00  0  0    3    3
## 15     10.4   8     472.0     205 2.93 5.250 17.98  0  0    3    4
##    carb_new
## 1      four
## 2      four
## 3       one
## 4       one
## 5       two
## 6       one
## 7      four
## 8       two
## 9       two
## 10     four
## 11     four
## 12    other
## 13    other
## 14    other
## 15     four

Mind the .$ before the variable carb. There is a github issue about this, and it is already fixed in the development version of dplyr, which means that in the next version of dplyr, case_when() will work as any other specialized dplyr function inside mutate().

Apply a function to certain columns only, by rows

mtcars %>%
    select(am, gear, carb) %>%
    purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)
## # A tibble: 6 × 4
##      am  gear  carb sum_am_gear_carb
##                 
## 1     1     4     4                9
## 2     1     4     4                9
## 3     1     4     1                6
## 4     0     3     1                4
## 5     0     3     2                5
## 6     0     3     1                4

For this, I had to use purrr’s by_row() function. You can then add this column to your original data frame:

mtcars <- cbind(mtcars, "sum_am_gear_carb" = mtcars2$sum_am_gear_carb)
head(mtcars)
##                   spam_mpg cyl spam_disp spam_hp drat    wt  qsec vs am
## Mazda RX4             21.0   6       160     110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         21.0   6       160     110 3.90 2.875 17.02  0  1
## Datsun 710            22.8   4       108      93 3.85 2.320 18.61  1  1
## Hornet 4 Drive        21.4   6       258     110 3.08 3.215 19.44  1  0
## Hornet Sportabout     18.7   8       360     175 3.15 3.440 17.02  0  0
## Valiant               18.1   6       225     105 2.76 3.460 20.22  1  0
##                   gear carb sum_am_gear_carb
## Mazda RX4            4    4                9
## Mazda RX4 Wag        4    4                9
## Datsun 710           4    1                6
## Hornet 4 Drive       3    1                4
## Hornet Sportabout    3    2                5
## Valiant              3    1                4

Use do() to do any arbitrary operation

mtcars %>% 
    group_by(cyl) %>% 
    do(models = lm(spam_mpg ~ drat + wt, data = .)) %>% 
    broom::tidy(models)
## Source: local data frame [9 x 6]
## Groups: cyl [3]
## 
##     cyl        term   estimate  std.error  statistic     p.value
##                                   
## 1     4 (Intercept) 33.2493403 17.0987286  1.9445504 0.087727622
## 2     4        drat  1.3244329  3.4519717  0.3836743 0.711215433
## 3     4          wt -5.2400608  2.2150213 -2.3656932 0.045551615
## 4     6 (Intercept) 30.6544931  7.5141648  4.0795609 0.015103868
## 5     6        drat -0.4435744  1.1740862 -0.3778039 0.724768945
## 6     6          wt -2.9902720  1.5685053 -1.9064468 0.129274249
## 7     8 (Intercept) 29.6519180  7.0878976  4.1834574 0.001527613
## 8     8        drat -1.4698722  1.6285054 -0.9025897 0.386081744
## 9     8          wt -2.4518017  0.7985112 -3.0704664 0.010651044

do() is useful when you want to use any R function (user defined functions work too!) with dplyr functions. First I grouped the observations by cyl and then ran a linear model for each group. Then I converted the output to a tidy data frame using broom::tidy().

Using dplyr() functions inside your own functions

extract_vars <- function(data, some_string){
    
  data %>%
    select_(lazyeval::interp(~contains(some_string))) -> data
    
  return(data)
}

extract_vars(mtcars, "spam")
##                     spam_mpg spam_disp spam_hp
## Mazda RX4               21.0     160.0     110
## Mazda RX4 Wag           21.0     160.0     110
## Datsun 710              22.8     108.0      93
## Hornet 4 Drive          21.4     258.0     110
## Hornet Sportabout       18.7     360.0     175
## Valiant                 18.1     225.0     105
## Duster 360              14.3     360.0     245
## Merc 240D               24.4     146.7      62
## Merc 230                22.8     140.8      95
## Merc 280                19.2     167.6     123
## Merc 280C               17.8     167.6     123
## Merc 450SE              16.4     275.8     180
## Merc 450SL              17.3     275.8     180
## Merc 450SLC             15.2     275.8     180
## Cadillac Fleetwood      10.4     472.0     205
## Lincoln Continental     10.4     460.0     215
## Chrysler Imperial       14.7     440.0     230
## Fiat 128                32.4      78.7      66
## Honda Civic             30.4      75.7      52
## Toyota Corolla          33.9      71.1      65
## Toyota Corona           21.5     120.1      97
## Dodge Challenger        15.5     318.0     150
## AMC Javelin             15.2     304.0     150
## Camaro Z28              13.3     350.0     245
## Pontiac Firebird        19.2     400.0     175
## Fiat X1-9               27.3      79.0      66
## Porsche 914-2           26.0     120.3      91
## Lotus Europa            30.4      95.1     113
## Ford Pantera L          15.8     351.0     264
## Ferrari Dino            19.7     145.0     175
## Maserati Bora           15.0     301.0     335
## Volvo 142E              21.4     121.0     109

About this last point, you can read more about it here.

Hope you liked this small list of tricks!

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)