Exploring R Packages – plyr

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, we explore the functionality provided by the plyr package. The ideas behind this package are described in this paper by Hadley Wickham. However, rather than trying to understand the theoretical underpinnings of the package, we look at some of the useful functions provided by this package and how they work.

Anyone using R seriously will have come across the apply family of functions. In simple terms, these functions allow you to loop over a collection of objects and call a function for each object in the collection. Depending on which function you are using, the argument names or the output may be different.

The first set of useful functions provided by the plyr package are llply, ldply, laply, dlply, ddply, daply, alply, adply and aaply. While this may look like a lot of functions, it is really very simple. l stands for list, d for data frame and a for array. The first letter represents the input while the second letter represents the output. All of these functions have the suffix ply.

Let us look at some simple examples where the input is a list.

l <- list(
  x = rnorm(10),
  y = runif(10),
  z = rpois(10, 1)
)
llply(l, mean)
## $x
## [1] 0.1418481
## 
## $y
## [1] 0.5988205
## 
## $z
## [1] 1.3
ldply(l, mean)
##   .id        V1
## 1   x 0.1418481
## 2   y 0.5988205
## 3   z 1.3000000
laply(l, mean)
## [1] 0.1418481 0.5988205 1.3000000

In the examples above, the function mean which is being applied to each element of the list returns a single value. This is reflected in the output. llply returns a list of three elements with the same names as those provided in l. ldply returns a dataset where the variable .id and V1 is created automatically while laply returns an array (in this case, a vector).

Now let us look at an example where the function which is being applied returns a vector.

llply(l, sample, 3)
## $x
## [1] -1.65372495  0.71552275 -0.03883335
## 
## $y
## [1] 0.6828394 0.6145224 0.7873703
## 
## $z
## [1] 1 2 1
ldply(l, sample, 3)
##   .id         V1         V2         V3
## 1   x -0.6725610 -1.6537250 -0.4180975
## 2   y  0.5729277  0.8648463  0.9501068
## 3   z  2.0000000  1.0000000  2.0000000
laply(l, sample, 3)
##              1          2          3
## [1,] 1.7907078 -1.6537250 -1.4636092
## [2,] 0.6145224  0.2937731  0.2525297
## [3,] 1.0000000  1.0000000  1.0000000

While the llply output is fairly obvious, the outputs of ldply and laply are more interesting. In these outputs, the row represents the element of l which is being iterated over and the column represents the result of the function which is being applied. For example, laply returns a matrix, where each row of the matrix corresponds to an element of l while the values in the columns are those returned by the sample function. Since we are sampling three elements from each element of l, there are three columns. Sampling five elements would result in five columns.

laply(l, sample, 5)
##               1          2           3         4         5
## [1,] -1.4636092 -0.6725610 -0.05967467 1.7907078 1.8692739
## [2,]  0.5729277  0.8648463  0.61452238 0.9501068 0.6828394
## [3,]  0.0000000  1.0000000  1.00000000 2.0000000 2.0000000

What happens when the function being applied returns a data frame? Let’s try it out.

f <- function(x) {
  data.frame(dx = sample(x, 2))
}
llply(l, f)
## $x
##            dx
## 1 -0.41809752
## 2 -0.03883335
## 
## $y
##          dx
## 1 0.6828394
## 2 0.9456604
## 
## $z
##   dx
## 1  1
## 2  2
ldply(l, f)
##   .id        dx
## 1   x 0.7155227
## 2   x 1.3494772
## 3   y 0.7873703
## 4   y 0.8648463
## 5   z 2.0000000
## 6   z 0.0000000
laply(l, f)
##      1         2        
## [1,] Numeric,2 Numeric,2
## [2,] Integer,2 NULL     
## [3,] NULL      NULL

The first two outputs are fairly obvious. llply returns a list where each element is a data frame. ldply returns a single data frame where the .id variable is same as before, but instead of V1, it uses the variable name dx used in the function. One might expect that laply will return a 3 x 2 matrix, where each row are the columns of the data frame. However, while it does create a 3 x 2 matrix, the elements are three lists filled in the matrix in row major order. The remaining elements are set to NULL. Using the code below, we verify that each element of the matrix is actually a list. We will explore in a separate post why this happens.

x <- laply(l, f)
class(x[1, 1])
## [1] "list"
class(x[3, 1])
## [1] "list"
x[1, 1]
## $`1`
## [1] -0.4180975  1.8692739
x[3, 1]
## $`1`
## NULL

The functions dlply, ddply and daply work slightly differently. Rather than each row or column of a data frame, it applies a function to each subset of a data frame based on one or more variables in the data frame. Let’s take a look at some simple examples first.

f <- function(df) {
  mean(df[["disp"]], na.rm = TRUE)
}
dlply(mtcars, "cyl", f)
## $`4`
## [1] 105.1364
## 
## $`6`
## [1] 183.3143
## 
## $`8`
## [1] 353.1
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   cyl
## 1   4
## 2   6
## 3   8
ddply(mtcars, "cyl", f)
##   cyl       V1
## 1   4 105.1364
## 2   6 183.3143
## 3   8 353.1000
daply(mtcars, "cyl", f)
##        4        6        8 
## 105.1364 183.3143 353.1000

Here we are splitting the built-in mtcars dataset by cyl (the number of cylinders), and applying a function f which returns the mean of disp (displacement).

We can split by more than one variable.

dlply(mtcars, c("cyl", "gear"), f)
## $`4.3`
## [1] 120.1
## 
## $`4.4`
## [1] 102.625
## 
## $`4.5`
## [1] 107.7
## 
## $`6.3`
## [1] 241.5
## 
## $`6.4`
## [1] 163.8
## 
## $`6.5`
## [1] 145
## 
## $`8.3`
## [1] 357.6167
## 
## $`8.5`
## [1] 326
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   cyl gear
## 1   4    3
## 2   4    4
## 3   4    5
## 4   6    3
## 5   6    4
## 6   6    5
## 7   8    3
## 8   8    5
ddply(mtcars, c("cyl", "gear"), f)
##   cyl gear       V1
## 1   4    3 120.1000
## 2   4    4 102.6250
## 3   4    5 107.7000
## 4   6    3 241.5000
## 5   6    4 163.8000
## 6   6    5 145.0000
## 7   8    3 357.6167
## 8   8    5 326.0000
daply(mtcars, c("cyl", "gear"), f)
##    gear
## cyl        3       4     5
##   4 120.1000 102.625 107.7
##   6 241.5000 163.800 145.0
##   8 357.6167      NA 326.0

Note that there are no records in the data with cyl equal to 8 and gear equal to 4. While dlply and ddply do not create an element or row with a combination of these two values, daply does have an element corresponding to that row and column whose value is simply NA.

Instead of a character vector, the variable names can also be passed using the . function in plyr.

ddply(mtcars, .(cyl, gear), f)
##   cyl gear       V1
## 1   4    3 120.1000
## 2   4    4 102.6250
## 3   4    5 107.7000
## 4   6    3 241.5000
## 5   6    4 163.8000
## 6   6    5 145.0000
## 7   8    3 357.6167
## 8   8    5 326.0000

What if f returns more than one element? The results are fairly obvious.

f <- function(df) {
  sample(df[["disp"]], 2)
}
dlply(mtcars, c("cyl", "gear"), f)
## $`4.3`
## [1] 60 23
## 
## $`4.4`
## [1] 140.8  71.1
## 
## $`4.5`
## [1]  95.1 120.3
## 
## $`6.3`
## [1] 258 225
## 
## $`6.4`
## [1] 160 160
## 
## $`6.5`
## [1]  74 101
## 
## $`8.3`
## [1] 275.8 472.0
## 
## $`8.5`
## [1] 351 301
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   cyl gear
## 1   4    3
## 2   4    4
## 3   4    5
## 4   6    3
## 5   6    4
## 6   6    5
## 7   8    3
## 8   8    5
ddply(mtcars, c("cyl", "gear"), f)
##   cyl gear    V1    V2
## 1   4    3  77.0  88.0
## 2   4    4 121.0  75.7
## 3   4    5  95.1 120.3
## 4   6    3 225.0 258.0
## 5   6    4 167.6 160.0
## 6   6    5  49.0  31.0
## 7   8    3 275.8 460.0
## 8   8    5 351.0 301.0
daply(mtcars, c("cyl", "gear"), f)
## , ,  = 1
## 
##    gear
## cyl   3   4     5
##   4 117 108 120.3
##   6 258 160  78.0
##   8 460  NA 301.0
## 
## , ,  = 2
## 
##    gear
## cyl   3     4     5
##   4  83  78.7  95.1
##   6 225 160.0 139.0
##   8 350    NA 351.0

What is f returns a linear model object? In this case, dlply works as expected but ddply and daply throw errors as the output of f cannot be converted to data frames or arrays.

f <- function(df) {
  lm(mpg ~ disp + wt, data = df)
}
dlply(mtcars, "cyl", f)
## $`4`
## 
## Call:
## lm(formula = mpg ~ disp + wt, data = df)
## 
## Coefficients:
## (Intercept)         disp           wt  
##     41.1350      -0.1225      -0.6978  
## 
## 
## $`6`
## 
## Call:
## lm(formula = mpg ~ disp + wt, data = df)
## 
## Coefficients:
## (Intercept)         disp           wt  
##    28.18835      0.01914     -3.83510  
## 
## 
## $`8`
## 
## Call:
## lm(formula = mpg ~ disp + wt, data = df)
## 
## Coefficients:
## (Intercept)         disp           wt  
##   24.077855    -0.002512    -2.023138  
## 
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   cyl
## 1   4
## 2   6
## 3   8
ddply(mtcars, "cyl", f)
## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results must be all atomic, or all data frames
daply(mtcars, "cyl", f)
## Error: Results must have one or more dimensions.

Finally alply, adply and aaply are used to apply functions to the margins of an array. Margins are based on rows, columns or higher dimensions depending on the dimension of the array to which we want to apply a function.

Let’s start with a simple example of calculating the row or column means of a matrix.

m <- matrix(rnorm(20), nrow = 4, ncol = 5)
m
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  1.6567853 -0.8003669  0.1643759  0.8915022  0.2456789
## [2,]  0.9842877 -0.5195872  0.7050530 -1.3335945 -0.5600941
## [3,] -0.2289117  0.2849420 -1.0549861 -0.9702475  0.7347813
## [4,]  0.5323479  1.0536352  1.0766656  1.1427620 -1.4998437
alply(m, 1, mean)
## $`1`
## [1] 0.4315951
## 
## $`2`
## [1] -0.144787
## 
## $`3`
## [1] -0.2468844
## 
## $`4`
## [1] 0.4611134
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3
## 4  4
alply(m, 2, mean)
## $`1`
## [1] 0.7361273
## 
## $`2`
## [1] 0.004655771
## 
## $`3`
## [1] 0.2227771
## 
## $`4`
## [1] -0.06739445
## 
## $`5`
## [1] -0.2698694
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
adply(m, 1, mean)
##   X1         V1
## 1  1  0.4315951
## 2  2 -0.1447870
## 3  3 -0.2468844
## 4  4  0.4611134
adply(m, 2, mean)
##   X1           V1
## 1  1  0.736127310
## 2  2  0.004655771
## 3  3  0.222777074
## 4  4 -0.067394455
## 5  5 -0.269869415
aaply(m, 1, mean)
##          1          2          3          4 
##  0.4315951 -0.1447870 -0.2468844  0.4611134
aaply(m, 2, mean)
##            1            2            3            4            5 
##  0.736127310  0.004655771  0.222777074 -0.067394455 -0.269869415

Since we have a 4 x 5 matrix, the dimensions of the object being returned depend on the margin. So aaply(m, 1, mean) returns a vector of length 4, while aaply(m, 2, mean) returns a vector of length 5.

And finally, one more example where the function applied returns a vector of two elements.

alply(m, 1, sample, 2)
## $`1`
## [1] -0.8003669  1.6567853
## 
## $`2`
## [1] -1.3335945 -0.5195872
## 
## $`3`
## [1] -0.2289117 -0.9702475
## 
## $`4`
## [1] 1.076666 1.142762
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3
## 4  4
alply(m, 2, sample, 2)
## $`1`
## [1] 0.9842877 1.6567853
## 
## $`2`
## [1] 1.053635 0.284942
## 
## $`3`
## [1] 0.7050530 0.1643759
## 
## $`4`
## [1]  1.1427620 -0.9702475
## 
## $`5`
## [1] -0.5600941 -1.4998437
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
adply(m, 1, sample, 2)
##   X1        V1         V2
## 1  1 0.2456789  0.1643759
## 2  2 0.9842877 -1.3335945
## 3  3 0.2849420 -1.0549861
## 4  4 0.5323479 -1.4998437
adply(m, 2, sample, 2)
##   X1        V1         V2
## 1  1 0.5323479 -0.2289117
## 2  2 1.0536352 -0.5195872
## 3  3 0.1643759 -1.0549861
## 4  4 0.8915022 -1.3335945
## 5  5 0.2456789 -0.5600941
aaply(m, 1, sample, 2)
##    
## X1           1          2
##   1  0.1643759  0.8915022
##   2 -0.5600941  0.7050530
##   3  0.2849420 -0.9702475
##   4  0.5323479 -1.4998437
aaply(m, 2, sample, 2)
##    
## X1           1         2
##   1 -0.2289117 0.5323479
##   2 -0.8003669 0.2849420
##   3  1.0766656 0.1643759
##   4 -0.9702475 1.1427620
##   5 -1.4998437 0.2456789

Another useful function provided by the plyr package is arrange(). This allows you to sort a dataset using one or more columns. The traditional way to do this in R is to use the order function. But passing the name of the dataset to arrange followed by the name of the columns saves a lot of typing.

d <- data.frame(
  l = sample(letters[1:10], 10, replace = TRUE),
  x = rnorm(10),
  stringsAsFactors = FALSE
)
arrange(d, x)
##    l          x
## 1  f -1.0594373
## 2  c -0.9824473
## 3  a -0.7180118
## 4  a -0.5163577
## 5  i -0.4262763
## 6  a -0.1782782
## 7  i -0.1406105
## 8  f  0.2550276
## 9  d  0.3105578
## 10 c  0.3984060
arrange(d, l, x)
##    l          x
## 1  a -0.7180118
## 2  a -0.5163577
## 3  a -0.1782782
## 4  c -0.9824473
## 5  c  0.3984060
## 6  d  0.3105578
## 7  f -1.0594373
## 8  f  0.2550276
## 9  i -0.4262763
## 10 i -0.1406105

In the first example above, we sort the dataset using the variable x. In the second example, we first sort it by l and then by x for each value of l.

The desc() function can be used to sort in descending order.

arrange(d, l, desc(x))
##    l          x
## 1  a -0.1782782
## 2  a -0.5163577
## 3  a -0.7180118
## 4  c  0.3984060
## 5  c -0.9824473
## 6  d  0.3105578
## 7  f  0.2550276
## 8  f -1.0594373
## 9  i -0.1406105
## 10 i -0.4262763

The function colwise() turns a function which operates on a vector to one which operates on each column of a data frame. For example, to get the range of each variable in the mtcars dataset, we can use the following.

colwise(range)(mtcars)
##    mpg cyl  disp  hp drat    wt qsec vs am gear carb
## 1 10.4   4  71.1  52 2.76 1.513 14.5  0  0    3    1
## 2 33.9   8 472.0 335 4.93 5.424 22.9  1  1    5    8

Look at the examples provided in the documentation for this function for more use cases.

count() allows you to count the number of occurences of one or a combination of variables. It excludes combinations with 0 counts.

count(mtcars, "cyl")
##   cyl freq
## 1   4   11
## 2   6    7
## 3   8   14
count(mtcars, c("cyl", "gear"))
##   cyl gear freq
## 1   4    3    1
## 2   4    4    8
## 3   4    5    2
## 4   6    3    2
## 5   6    4    4
## 6   6    5    1
## 7   8    3   12
## 8   8    5    2

mapvalues() and revalue() allow you to replace elements in a vector or factor.

l <- sample(c(NA, letters[1:10]), 20, replace = TRUE)
l
##  [1] "a" "g" "j" NA  "e" "c" "d" NA  "h" "j" "i" "f" NA  "g" "c" "b" NA 
## [18] "i" "h" NA
mapvalues(l, NA, "z") # convert missings to z
##  [1] "a" "g" "j" "z" "e" "c" "d" "z" "h" "j" "i" "f" "z" "g" "c" "b" "z"
## [18] "i" "h" "z"
revalue(l, c("a" = "z")) # convert a's to z
##  [1] "z" "g" "j" NA  "e" "c" "d" NA  "h" "j" "i" "f" NA  "g" "c" "b" NA 
## [18] "i" "h" NA

mutate() allows you to create new variables in a dataset. Note that in the example below, we are using z in the definition of w where z is itself created by the mutate function.

d <- data.frame(
  x = rnorm(10),
  y = rnorm(10)
)
mutate(d, z = x + y, w = log(abs(z)))
##              x           y           z          w
## 1  -0.95250986  0.56975469 -0.38275518 -0.9603597
## 2  -1.44871677  1.05125632 -0.39746045 -0.9226598
## 3  -1.50057747  0.41648489 -1.08409257  0.0807433
## 4   0.63543431  0.08208207  0.71751637 -0.3319595
## 5  -0.06153740 -0.05421987 -0.11575727 -2.1562598
## 6   0.23454277 -0.19975350  0.03478927 -3.3584462
## 7   0.37479472 -0.59963071 -0.22483598 -1.4923841
## 8  -0.14365318  0.87849446  0.73484129 -0.3081007
## 9   0.43586000 -0.31622884  0.11963116 -2.1233419
## 10  0.02582448  0.43899923  0.46482371 -0.7660971

rename() allows you to rename variables in a dataset using a named character vector. The names are the old names while the values are the new names.

d <- data.frame(
  l = sample(letters[1:10], 10, replace = TRUE),
  x = rnorm(10),
  stringsAsFactors = FALSE
)
rename(d, c("l" = "letters", "x" = "random_normal"))
##    letters random_normal
## 1        j     0.3168342
## 2        h     0.6857034
## 3        h     0.4438535
## 4        e     0.1356904
## 5        h     1.6360101
## 6        e     0.5000295
## 7        d    -1.7132048
## 8        g     1.0090552
## 9        d    -1.0668415
## 10       g    -0.8507992

Finally, summarise() can be used to create a new data frame to typically store summarised information from the original dataset.

d <- data.frame(
  x = rnorm(10),
  y = rnorm(10)
)
summarise(d, xrange = max(x) - min(x), yrange = max(y) - min(y))
##     xrange   yrange
## 1 2.441068 2.569499

In this post, we explored some of the functionality provided by the plyr package. The most useful is the set of consistent apply functions described in the first part of the post. It is probably a good idea to gain familiarity with these functions and use them in your code instead of the base functions.

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)