My Favorite dplyr 1.0.0 Features
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As you are likely aware by now, the dplyr 1.0.0 release is right around the corner. I am very excited about this huge milestone for dplyr. In this post, we’ll go over my favorite new features coming in the 1.0.0 release.
# Install development version of dplyr remotes::install_github( "tidyverse/dplyr", ref = "23c166fa7cc247f0ee1a4ee5ac31cd19dc63868d" )
Note: in the above call to install_github(), I passed the most recent (as of the writing of this post) GitHub commit hash to the “ref” argument to install the dplyr package exactly as it existed as of that above commit. This way we can make sure we are using the exact same package version even as the development version progresses on GitHub.
Requisite packages and data:
library(dplyr) # data library(AmesHousing) # Load the housing data, clean names, then grab just six columns # to simplify the dataset for display purposes. ames_data <- make_ames() %>% janitor::clean_names() %>% select(sale_price, bsmt_fin_sf_1, first_flr_sf, total_bsmt_sf, neighborhood, gr_liv_area)
Superceding Functions
There are two major families of functions that supercede old functionality
The Replacements:- `across()`
- `slice()`
In addition, there are some deprecated functions that are worth noting as well as some new mutate() arguments. In this post, I’ll walk through some examples of each of these changes.
Across
All *_if(), *_at() and *_all() function variants were superseded in favor of across(). across() makes manipulating multiple columns more intuitive and consistent with other dplyr syntax.
across() is my favorite new dplyr function because I’ve always had to stop and think and pull up the docs when using mutate_if() and mutate_at(). Most notably, I appreciate the use of tidy selection rather than the vars() method used in mutate_at().
Let’s see across() in action. Let’s say we want to convert all the square foot variables to square yards. When we take a look at our data, we see that all of the square foot variables either contain “area” or “_sf” in their names.
feet_to_yards <- function(x) {x / 9}
Here is the old way this was done with mutate_at():
ames_data %>% mutate_at(.vars = vars(contains("_sf") | contains("area")) , .funs = feet_to_yards)
Now we use across() in combination with a vector. In this case, we used contains() to grab variable names that contain "_sf" or “area”.
ames_data %>% mutate(across(.cols = c(contains("_sf"), contains("area")), .fns = feet_to_yards)) %>% head() ## # A tibble: 6 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <int> <dbl> <dbl> <dbl> <fct> <dbl> ## 1 215000 0.222 184 120 North_Ames 184 ## 2 105000 0.667 99.6 98 North_Ames 99.6 ## 3 172000 0.111 148. 148. North_Ames 148. ## 4 244000 0.111 234. 234. North_Ames 234. ## 5 189900 0.333 103. 103. Gilbert 181 ## 6 195500 0.333 103. 103. Gilbert 178.
across() can also replace mutate_if() in combination with where().
Old way with mutate_if():
ames_data %>% mutate_if(is.numeric, log)
New way with across(where()):
## new dplyr(log transform numeric values) ames_data %>% mutate(across(where(is.numeric), log)) %>% head() ## # A tibble: 6 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> ## 1 12.3 0.693 7.41 6.98 North_Ames 7.41 ## 2 11.6 1.79 6.80 6.78 North_Ames 6.80 ## 3 12.1 0 7.19 7.19 North_Ames 7.19 ## 4 12.4 0 7.65 7.65 North_Ames 7.65 ## 5 12.2 1.10 6.83 6.83 Gilbert 7.40 ## 6 12.2 1.10 6.83 6.83 Gilbert 7.38
summarize() now uses the same across() and where() syntax that we used above with mutate. Let’s find the average of all numerics columns for each neighborhood.
ames_data %>% group_by(neighborhood) %>% summarize(across(where(is.numeric), mean, .names = "mean_{col}")) %>% head() ## `summarise()` ungrouping output (override with `.groups` argument) ## # A tibble: 6 x 6 ## neighborhood mean_sale_price mean_bsmt_fin_s… mean_first_flr_… ## <fct> <dbl> <dbl> <dbl> ## 1 North_Ames 145097. 3.66 1175. ## 2 College_Cre… 201803. 4.01 1173. ## 3 Old_Town 123992. 5.80 945. ## 4 Edwards 130843. 4.27 1115. ## 5 Somerset 229707. 4.59 1188. ## 6 Northridge_… 322018. 3.99 1613. ## # … with 2 more variables: mean_total_bsmt_sf <dbl>, mean_gr_liv_area <dbl>
As you can see, we calculated the neighborhood average for all numeric values. On top of that, we were able to easily prefix the column names with “mean_” thanks to another useful across() argument called “.names”.
Not only that, in conjunction with the where() helper, across() unifies "_if" and "_at" semantics, allowing more intuitive and elegant column selection. For example, let’s mutate the square footage variables that are integers (like mutate_if()), and the square footage variables that end with "_sf" (like mutate_at()) to make them doubles.
ames_data %>% mutate(across(where(is.integer) & ends_with("_sf"), as.double)) ## # A tibble: 2,930 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <int> <dbl> <dbl> <dbl> <fct> <int> ## 1 215000 2 1656 1080 North_Ames 1656 ## 2 105000 6 896 882 North_Ames 896 ## 3 172000 1 1329 1329 North_Ames 1329 ## 4 244000 1 2110 2110 North_Ames 2110 ## 5 189900 3 928 928 Gilbert 1629 ## 6 195500 3 926 926 Gilbert 1604 ## 7 213500 3 1338 1338 Stone_Brook 1338 ## 8 191500 1 1280 1280 Stone_Brook 1280 ## 9 236500 3 1616 1595 Stone_Brook 1616 ## 10 189000 7 1028 994 Gilbert 1804 ## # … with 2,920 more rows
Notice, the “first_flr_sf” was converted to a double, but the “gr_living_area” remains an integer because it doesn’t fit the criteria aends_with("_sf").
across() can also perform mutate_all() functionality with across(everything(), …
Slice
top_n(), sample_n(), and sample_frac() have been superseded in favor of a new family of slice() helpers.
Reasons for future deprecation:
- top_n() - has a confusing name that might reasonably be considered to filter for the min or the max rows. For example, let’s stay we have data for a track and field race that records lap times. One might reasonable assume that top_n() would return the fastest times but they actually return the times that took the longest. top_n() has been superseded by slice_min(), and slice_max().
- sample_n() and sample_frac() - it’s easier to remember (and pull up documentation for) two mutually exclusive arguments to one function called slice_sample().
# Old way to grab the five most expensive homes by sale price ames_data %>% top_n(n = 5, wt = sale_price) # New way to grab the five most expensive homes by sale price ames_data %>% slice_max(sale_price, n = 5) ## # A tibble: 5 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <int> <dbl> <int> <dbl> <fct> <int> ## 1 755000 3 2444 2444 Northridge 4316 ## 2 745000 3 2411 2396 Northridge 4476 ## 3 625000 3 1831 1930 Northridge 3627 ## 4 615000 3 2470 2535 Northridge_He… 2470 ## 5 611657 3 2364 2330 Northridge_He… 2364 # You can also grab the five cheapest homes ames_data %>% slice_min(sale_price, n = 5) ## # A tibble: 5 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <int> <dbl> <int> <dbl> <fct> <int> ## 1 12789 7 832 678 Old_Town 832 ## 2 13100 5 733 0 Iowa_DOT_and_… 733 ## 3 34900 6 720 720 Iowa_DOT_and_… 720 ## 4 35000 7 498 498 Edwards 498 ## 5 35311 2 480 480 Iowa_DOT_and_… 480 # Old way to sample four random rows(in this case properties) ames_data %>% sample_n(4) # New way to sample four random rows(in this case properties) ames_data %>% slice_sample(n = 4) ## # A tibble: 4 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <int> <dbl> <int> <dbl> <fct> <int> ## 1 119000 6 948 948 Edwards 948 ## 2 156000 1 990 990 College_Creek 990 ## 3 245700 3 1614 1614 Northridge_He… 1614 ## 4 108000 2 1032 1032 Old_Town 1032 # Old way to sample a random 0.2% of the rows ames_data %>% sample_frac(0.002) # New way to sample a random 0.2% of the rows ames_data %>% slice_sample(prop = 0.002) ## # A tibble: 5 x 6 ## sale_price bsmt_fin_sf_1 first_flr_sf total_bsmt_sf neighborhood gr_liv_area ## <int> <dbl> <int> <dbl> <fct> <int> ## 1 110000 7 682 440 Old_Town 1230 ## 2 136000 6 1040 1040 North_Ames 1040 ## 3 208000 1 1182 572 Crawford 1982 ## 4 115000 1 789 789 Old_Town 789 ## 5 145500 1 1053 1053 North_Ames 1053
Additionally, slice_head() and slice_tail() can be used to grab the first or last rows, respectively.
Nest By
nest_by() works similar to group_by() but is more visual because it changes the structure of the tibble instead of just adding grouped metadata. With nest_by(), the tibble transforms into a rowwised dataframe (Run vignette(“rowwise”) to learn more about the revised rowwise funtionality in dplyr 1.0.0).
First, for the sake of comparison, let’s calculate the average sale price by neighborhood using group_by() and summarize():
ames_data %>% group_by(neighborhood) %>% summarise(avg_sale_price = mean(sale_price)) %>% ungroup() %>% head() ## `summarise()` ungrouping output (override with `.groups` argument) ## # A tibble: 6 x 2 ## neighborhood avg_sale_price ## <fct> <dbl> ## 1 North_Ames 145097. ## 2 College_Creek 201803. ## 3 Old_Town 123992. ## 4 Edwards 130843. ## 5 Somerset 229707. ## 6 Northridge_Heights 322018.
The summarize() operation works well with group_by(), particularly if the output of the summarization function are single numeric values. But what if we want to perform a more complicated operation on the grouped rows? Like, for example, a linear model. For that, we can use nest_by() which stores grouped data not as metadata but as lists in a new column called “data”.
nested_ames <- ames_data %>% nest_by(neighborhood) head(nested_ames) ## # A tibble: 6 x 2 ## # Rowwise: neighborhood ## neighborhood data ## <fct> <list<tbl_df[,5]>> ## 1 North_Ames [443 × 5] ## 2 College_Creek [267 × 5] ## 3 Old_Town [239 × 5] ## 4 Edwards [194 × 5] ## 5 Somerset [182 × 5] ## 6 Northridge_Heights [166 × 5]
As you can see, nest_by() fundementally changes the structure of the dataframe unlike group_by(). This feature becomes useful when you want to apply a model to each row of the nested data.
For example, here is a linear model that uses square footage to predict sale price applied to each neighborhood.
nested_ames_with_model <- nested_ames %>% mutate(linear_model = list(lm(sale_price ~ gr_liv_area, data = data))) head(nested_ames_with_model) ## # A tibble: 6 x 3 ## # Rowwise: neighborhood ## neighborhood data linear_model ## <fct> <list<tbl_df[,5]>> <list> ## 1 North_Ames [443 × 5] <lm> ## 2 College_Creek [267 × 5] <lm> ## 3 Old_Town [239 × 5] <lm> ## 4 Edwards [194 × 5] <lm> ## 5 Somerset [182 × 5] <lm> ## 6 Northridge_Heights [166 × 5] <lm>
It’s important to note that the model must be vectorized, a tranformation performed here with list(). Let’s take a look at the model that was created for the “North_Ames” neighborhood.
north_ames_model <- nested_ames_with_model %>% filter(neighborhood == "North_Ames") %>% pull(linear_model) north_ames_model ## [[1]] ## ## Call: ## lm(formula = sale_price ~ gr_liv_area, data = data) ## ## Coefficients: ## (Intercept) gr_liv_area ## 74537.97 54.61
The model shows that for each additional square foot, a house in the North Ames neighborhood is expected to sell for about $54.61 more.
Additional Mutate arguments
Control what columns are retained with “.keep”
# For example "used" retains only the columns involved in the mutate ames_data %>% mutate(sale_price_euro = sale_price / 1.1, .keep = "used") %>% head() ## # A tibble: 6 x 2 ## sale_price sale_price_euro ## <int> <dbl> ## 1 215000 195455. ## 2 105000 95455. ## 3 172000 156364. ## 4 244000 221818. ## 5 189900 172636. ## 6 195500 177727.
Control where the new columns are added with “.before” and “.after”
# For example, make the "sale_price_euro" column appear to the left of the "sale_price" column like this ames_data %>% mutate( sale_price_euro = sale_price / 1.1, .keep = "used", .before = sale_price ) %>% head() ## # A tibble: 6 x 2 ## sale_price_euro sale_price ## <dbl> <int> ## 1 195455. 215000 ## 2 95455. 105000 ## 3 156364. 172000 ## 4 221818. 244000 ## 5 172636. 189900 ## 6 177727. 195500
Conculsion
This was a short, high level look at my favorite new features coming in dplyr 1.0.0. The two major changes were the addition of across() and slice() which supercede old functionality. across() makes it easy to mutate specific columns or rows in a more intuitive, consistent way. slice() makes similar improvements to data sampling methods. I am also a big fan of the new nest_by() functionality, and plan to search for elegant ways to incorporate it in my upcoming R projects. These changes align dplyr syntax more closely with conventions common in the tidyverse. Thanks tidyverse team for continually pushing the boundaries to make data analytics easier in practice and to learn/teach!
Not all dplyr 1.0.0 changes were covered in this post. Learn more at https://www.tidyverse.org/.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.