Note that this post is in continuation with Part 1 of this series of posts on data manipulation with dplyr in R. The code in this post carries forward from the variables / objects defined in Part 1.
In the previous post, I talked about how dplyr provides a grammar of sorts to manipulate data, and consists of 5 verbs to do so:
The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics
I went on to discuss examples using select() and mutate(). Let’s now talk about filter(). R comes with a set of logical operators that you can use inside filter(). These operators are:
x < y, TRUE if x is less than y
x <= y, TRUE if x is less than or equal to y
x == y, TRUE if x equals y
x != y, TRUE if x does not equal y
x >= y, TRUE if x is greater than or equal to y
x > y, TRUE if x is greater than y
x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
The following call, for example, filters df such that only the observations where the variable a is greater than the variable b:
filter(df, a > b)
Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following calls equivalent:
filter(df, a > b & c > d)
filter(df, a > b, c > d)
The is.na() will also come in handy very often. This expression, for example, keeps the observations in df for which the variable x is not NA:
A recap on select(), mutate() and filter():
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).
By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behaviour with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange()
summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().
In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.
Note: summarise() and summarize() both work the same!
You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions. Here are some of the most useful:
min(x) – minimum value of vector x.
max(x) – maximum value of vector x.
mean(x) – mean value of vector x.
median(x) – median value of vector x.
quantile(x, p) – pth quantile of vector x.
sd(x) – standard deviation of vector x.
var(x) – variance of vector x.
IQR(x) – Inter Quartile Range (IQR) of vector x.
diff(range(x)) – total range of vector x.
dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:
first(x) – The first element of vector x.
last(x) – The last element of vector x.
nth(x, n) – The nth element of vector x.
n() – The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) – The number of unique values in vector x
This would be it for Part-2 of this series of posts on data manipulation with dplyr. Part 3 would focus on the pipe operator, Group_by and working with databases.