Create a data transformation pipeline
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
All data transformation functions in dplyr can be connected through the pipe %>% operator to create powerful and yet expressive data transformation pipelines.
- Use the pipe operator
%>%to combine multiple dplyr functions into one pipeline
%>% filter(___) %>% select(___) %>% arrange(___)
Using the %>% operator
The pipe operator %>% is a special part of the tidyverse universe. It is used to combine multiple functions and run them one after the other. In this setting the input of each function is the output of the previous function. Imagine we have the pres_results data frame and want to create a smaller, more transparent data frame for answering the question: In which states was the democratic party the most popular choice in the 2016 US presidential election? To accomplish this task we would need to take the following steps:
-
filter()the data frame for the rows, where theyearvariable equals 2016 -
select()the two variablesstateanddem, since we are not interested in the rest of the columns. -
arrange()the filtered and selected data frame based on thedemcolumn in a descending way.
The steps and functions described above should be run one after the other, where the input of each function is the output of the previous step. Applying the things you learned so far, you could accomplish this task by taking the following steps:
result <- filter(pres_results, year==2016) result <- select(result, state, dem) result <- arrange(result, desc(dem)) result # A tibble: 51 x 2 state dem <chr> <dbl> 1 DC 0.905 2 CA 0.617 3 HI 0.610 # … with 48 more rows
The first function takes the pres_results data frame, filters it according to the task description and assigns it to the variable result. Then, each subsequent function takes the result variable as input and overwrites it with its own output.
The %>% operator provides a practical way for combining the steps above into seemingly one step. It takes a data frame as the initial input. Then, it applies a list of functions, and passes on the output of each function for the input for the next function. The same task as above can be accomplished using the pipe operator %>% like this:
pres_results %>% filter(year==2016) %>% select(state, dem, rep) %>% arrange(desc(dem)) # A tibble: 51 x 3 state dem rep <chr> <dbl> <dbl> 1 DC 0.905 0.0407 2 CA 0.617 0.316 3 HI 0.610 0.294 # … with 48 more rows
We can interpret the code in the following way:
- We define the original data set as a starting point.
- Using the
%>%operator right after the data frame tells dplyr, that a function is coming, which takes the previously defined data frame as input. - We use each function as usual, but skip the first parameter. The data frame input is automatically provided by the output of the previous step.
- As long as we add the
%>%operator after a step, dplyr will expect an additional step. - In our example the pipeline closes with a
arrange()function. It gets the filtered and selected version of thepres_resultsdata frame as input and sorts it based on thedemcolumn in a descending way. Finally, it gives back the output.
One difference between the two approaches is, that the %>% operator does not save permanently the intermediate or the final results. To save the resulting data frame we need to assign the output to a variable:
result <- pres_results %% filter(year==2016) %>% select(state, dem) %>% arrange(desc(dem)) result # A tibble: 51 x 2 state dem <chr> <dbl> 1 DC 0.905 2 CA 0.617 3 HI 0.610 # … with 48 more rows
Exercise: Austrian Life Expectancy
Use the %>% operator on the gapminder data set and create a simple data frame to answer the following question: How did the life expectancy in Austria change over the last decades? Required packages are already loaded.
- Define the
gapminderdata frame as the base data frame - Filter only the rows where the
countrycolumn is equal toAustriaby pipinggapminderto thefilter()function. - Select only the columns:
yearandlifeExpfrom the filtered result. - Arrange the results based on the
yearcolumn based on the selected columns.
Exercise: European GDP Per Capita
Use the %>% operator on the gapminder dataset and create a simple tibble to answer the following question: Which European country had the highest GDP per capita in 2007? Required packages are already loaded.
- Define the
gapmindertibble as the input - Filter only the rows where the
yearcolumn is equal to2007 - Use a second layer of filter and keep only the rows where the
continentcolumn is equal toEurope - Select only the columns:
countryandgdpPercap - Arrange the results based on the
gdpPercapcolumn in a descending way
Exercise: Americas Population
Use the %>% operator on the gapminder dataset and create a simple tibble to answer the following question: Which country on the continent Americas had the largest population in 2007?
- Define the
gapmindertibble as the input - Filter only the rows where the
yearcolumn is equal to2007 - Use a second layer of filter and keep only the rows where the
continentcolumn is equal toAmericas - Select only the columns:
countryandpop - Arrange the results based on the
popcolumn in a descending way
Quiz: Malformed Code
gapminder %>% filter(year == 2007, continent == "Americas") %>% select(gapminder, country, pop) %>% arrange(desc(pop)) %>%Take a look at the code above. What mistakes does it contain?
- The
gapmindertibble should not be defined in theselect()function. - There should be no
%>%applied after the last line. - There will be no output, because you cannot use these functions in this order.
- The
desc()function should be applied on the wholearrange()function and not on a single column.
Create a data transformation pipeline is an excerpt from the course Introduction to R, which is available for free at quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.