- Learn what dplyr does
- Get an overview of Select, Filter and Sort
- Learn what Joins, Aggregations and Pipelines are
What is dplyr
There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data.
Anthony Goldbloom, Founder and CEO of Kaggle
Having clean data in any Data Science project is super important, because the results only get as good as is the data correct. Cleaning data is also the part which usually consumes most of the time and causes the biggest pains for data scientists. R already offers a broad set of tools and functions to manipulate data frames. However, due to its long history, the available base R toolset is fragmented and hard to use for new users.
The dplyr package facilitates the data tranformation process through a consistent collection of functions. These functions support different transformations on data frames, including
- filter rows
- select columns
- sort data
- aggregate data
Multiple data frames can also be joined together by common attribute values.
The consistency of dplyr functions improves usability and enables user to connect transformations together to form data pipelines. These pipelines can also be seen as a high-level query language—much like e.g. the SQL language for database queries. Additionally, it is even possible to translate created data pipelines to other backends including databases.
Quiz: dplyr FactsWhich of the below statements are correct?
- dplyr provides a consistent set of functions for data visualization
- dplyr functions can be connected to data pipelines
- dplyr queries can be translated to database queries
- dplyr supports data transformations like aggregations and joins
- dplyr is built for vector transformations
Every data transformation function in dplyr accepts a data frame as its first input parameter and returns the transformed data frame back as an output. A blueprint for a typical dplyr function looks like this:
transformed <- dplyr_function(my_data_frame, param_one, param_two, ...)
dplyr_function can be customized further through additional arguments (
param_two) placed after the first data frame parameter (
The real power of dplyr comes with the pipe operator
%>% which allows users to concatenate dplyr functions to data pipelines. The pipe injects the resulting data frame from the previous calculation as the first argument of next one. A data transformation consisting of three functions looks like
dplyr_function_three( dplyr_function_two( dplyr_function_one(my_data_frame)))
but can be written with the pipe as
my_data_frame %>% dplyr_function_one() %>% dplyr_function_two() %>% dplyr_function_three()
The different reading order of data transformation functions in actual transformation order makes pipelines easier to read than nested function calls.
Quiz: Valid Functions
dplyr_functionspecifies the transformation function,
param_onethe parameter for the dplyr function and
input_data_framethe data frame to be transformed. Which of the code lines below are valid according to the dplyr function framework?