One of the most useful (and most popular) applications in R are the functions available in the dplyr package. With functions like select, filter, arrange, and mutate, you can restructure a data set to get it looking just the way you want it. The problem is that doing so can take multiple steps. As a result, you either end up creating a bunch of extraneous objects to keep your activities organized, or you end up nesting your activities in one long convoluted line of nested functions. Is there a better way to create cleaner code with dplyr? Let’s have a look…
### import education expenditure data set and assign column names education <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE) colnames(education) <- c("X","State","Region","Urban.Population","Per.Capita.Income","Minor.Population","Education.Expenditures") View(education)
First, we’ve taken a data set on education expenditures by state and given the columns appropriate names. For a more detailed explanation on ways to subset this data set, visit this post. Here’s a snapshot of what the first half of the data set looks like:
Now, let’s supposed we are tasked with answering a very specific question:
Which states in the Midwestern region of the United States have the highest and lowest education expenditures per minority resident?
Let’s use the dplyr functions to filter this information from the data set–one step at a time…
### Filter for Region 2 ed_exp1 <- filter(education, Region == 2) ### Select the State, Minor Population, and Education Expenditures columns ed_exp2 <- select(ed_exp1, c(State, Minor.Population, Education.Expenditures)) ### Add a column for the Expenditures Per Child ed_exp3 <- mutate(ed_exp2, Expenditures.Per.Child = Education.Expenditures / Minor.Population) ### Arrange the data set to sort by Expenditures.Per.Child ed_exp4 <- arrange(ed_exp3, desc(Expenditures.Per.Child))
Building our data frame this way, we create four separate objects to reach our goal. With each activity, we assign a new object and then feed that object as the new data frame into the next activity. We first filter the original data set, creating ed_exp1. Then, we apply the select function on ed_exp1, creating ed_exp2, and so on until we end up with our final result in ed_exp4. And, sure enough, this method works:
We can now answer our question: Ohio spends the least amount per child and Minnesota spends the most.
That being said, ed_exp4 is not the only data frame we’ve created.In getting our result, we have created several intermediary objects. We have no use for ed_exp1, ed_exp2, or ed_exp3. The final result–what we’ve called ed_exp4–is the only revised data frame we care about. And yet, these other three data sets are taking up space in our working memory:
None of these subsets give us the complete information to answer our question. All we need is the final result–ed_exp4. So, is there a way to get to ed_exp4 without creating the first three objects. Yes, there is–but it’s a little tricky…
### Create final result using a single nested function ed_exp5 <- arrange(mutate(select(filter(education, Region == 2),c(State,Minor.Population, Education.Expenditures)), Expenditures.Per.Child = Education.Expenditures / Minor.Population),desc(Expenditures.Per.Child))
So, what is happening in this long, convoluted line of code? We are nesting each object as the data frame in the function that creates the next object. The innermost function, filter, creates the result that serves as the data frame for the select function, and then it builds all the way out to our last activity–arrange. As we see below, ed_exp5 gives us the same result as ed_exp4–and we only have to create one object.
The downside to using this method is rather obvious–it’s too complicated! Sure, we save space by not creating extraneous variables, but the trade off is that we have a long line of code that’s difficult to understand. The more activities we do to create our resulting data frame, the farther apart our arguments will get from the functions we are trying to apply to them. Sooner or later, mistakes will become inevitable.
But there is a fix even for this! Included as part of the dplyr package is the documentation for the “piping” operator. It essentially does the same thing as nesting functions does, but it’s a lot cleaner. Let’s have a look at the code…
### Create final result using the piping operator ed_exp6 <- education %>% filter(Region == 2) %>% select(c(State, Minor.Population, Education.Expenditures)) %>% mutate(Expenditures.Per.Child = Education.Expenditures / Minor.Population) %>% arrange(desc(Expenditures.Per.Child))
The piping operator, delineated by the “%>%” symbol, funnels each object preceding the operator as the first argument in subsequent functions. In other words…
education %>% filter(Region = 2)
is the same thing as…
filter(education, Region == 2)
You simply continue linking the chain, or “extending the pipe,” all the way down to your last action. In our case, the final action is to arrange the data set, so that’s where our pipe ends.
So, the moment of truth–does the piping operator give us the result we’re looking for?
Indeed, it does! But we’re only creating a single object and the code is much, much cleaner.
Pretty cool, huh?