R tutorial on the Apply family of functions

Posted on July 28, 2015 by DataCamp in R bloggers | 0 Comments

[This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In our previous tutorial Loops in R: Usage and Alternatives , we discussed one of the most important constructs in programming: the loop. Eventually we deprecated the usage of loops in R in favor of vectorized functions. In this post we highlight some of the most used vectorized functions: the apply functions.

In the present post we show the use of apply, its variants, and a few of its relatives, applied to different data structures. We will not exhaust all the variants (googling might be of help here) but when possible, we will illustrate the use of these functions in cooperation via a couple of slightly more beefy examples. Hope you will enjoy the read!

P.S. you may find it useful to look at this introduction to R tutorial do better understand lists, vectors, arrays and dataframes (although this is not necessary to follow this post). A forthcoming post will specifically deal with examples of data structures in R.

Apply: what are these functions in R?

The apply family pertains to the R base package, and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array, and apply a named function with one or several optional arguments. The called function could be:

an aggregating function, like for example the mean, or the sum (that return a number or scalar);
other transforming or sub-setting functions;
and other vectorized functions, which return more complex structures like list, vectors, matrices and arrays.

The apply functions form the basis of more complex combinations and helps to perform operations with very few lines of code. The family comprises: apply, lapply , sapply, vapply, mapply, rapply, and tapply.

But how and when should we use these? This depends on the structure of the data we wish to operate on, and the format of the output we need..

Using apply in R

We start with the godfather of the family, apply, which operates on arrays (for simplicity we limit here to 2D arrays, aka, matrices). The R base manual tells us that it’s called as follows: apply(X, MARGIN, FUN, ...)

where:

X is an array (a matrix if the dimension of the array is 2);
MARGIN is a variable defining how the function is applied: when MARGIN=1, it applies over rows, whereas with MARGIN=2, it works over columns. Noticeably, with the construct MARGIN=c(1,2) it applies to both rows and columns;
FUN is the function we want to apply and can be any R function, including a User Defined Function (more on functions in a separate post).

Now, beginners may have difficulties in visualizing what is actually happening, so a few pictures will help figuring it out. Let’s construct a 5 x 6 matrix and imagine we want to sum the values of each column: we can write something like

X<-matrix(rnorm(30), nrow=5, ncol=6)
apply(X,2 ,sum)

Remember that in R a matrix can be seen as a collection of line vectors – when crossing the matrix from top to bottom (along the vertical line 1, which specifies the dimension or margin 1 -; or as a list of columns vectors, spanning the matrix left to right – along the dimension or margin 2 -.

So the instruction we entered, depicted in figure 1, translates into: apply the function ‘sum’ to the matrix X along margin 2, thus by column, summing up the values of each column (To avoid cluttering the picture, we just highlighted one of the columns, the third). We end up with a line vector containing the sums of the values of each column.

Figure 1

Note that the output we get, a line vector, would have been given also if we summed along the lines of the matrix. This is just how R displays the result.

In most cases R can return a value even if the latter has not been specified, or more precisely the return value of the function has not been assigned to a variable. R simply returns the last object evaluated. In practice however (see a more advanced example), when we wish to check the return value and, more importantly, when we need to do further operations on those return values, it is best to assign the results of a given function to a variable explicitly.

Using lapply in R

We wish to apply a given function to every element of a list and obtain a list as result . Upon ?lapply, we see that the syntax looks like the apply. Here the difference is that:

It can be used for other objects like dataframes, lists or vectors.
The output returned is a list (thus the l in the function name) which has the same number of elements as the object passed to it.

To see how this works, let’s create a few matrices and extract from each a given column. This is a quite common operation performed on real data when making comparisons or aggregations from different dataframes.

Figure 2

Our toy example, depicted in figure 2 can be coded as:

#create a list of matrices:
A<-matrix(1:9, 3,3)
B<-matrix(4:15, 4,3)
C<-matrix(8:10, 3,2)
MyList<-list(A,B,C) # display the list

# extract the second column from the list of matrices, using the selection operator "["
lapply(MyList,"[", , 2)

## [[1]]
## [1] 4 5 6
## 
## [[2]]
## [1]  8  9 10 11
## 
## [[3]]
## [1]  8  9 10

# Another example: we now  extract the first row from the list of matrices, using the selection operator "["
lapply(MyList,"[", 1, )

## [[1]]
## [1] 1 4 7
## 
## [[2]]
## [1]  4  8 12
## 
## [[3]]
## [1] 8 8

The operation is shown in the left part of figure 2. Again, we start specifying the object of interest, now the list Mylist, and we use the standard R selection operator [; then omit the first parameter (which therefore translates into any, that’s why you see the two commas); then specify the second parameter, which is 2: our margin is ‘column’. So we extract the second column from all the matrices within the list. A few notes:

The [ notation is the select operator. Recall for example, that to extract all the elements of the third line of B requires: {r}B[3,] (the nothing after the comma means “any”)
The [[ ]] notation expresses the fact that the we are dealing with lists: [[2]] means the second element of the list. This is shown also in the output given by R
The output is a list with as many elements as the element in the input
Note that we could also have extracted a single element for each matrice, like this:

lapply(MyList,"[", 1, 2)

## [[1]]
## [1] 4
## 
## [[2]]
## [1] 8
## 
## [[3]]
## [1] 8

In the right hand side of figure 2, we show an alternative extraction: this time we omit the first parameter; we get the first row from each of the matrices (try it yourself, for instance selecting the second column form each matrix in the list).

Using sapply in R

sapply works as lapply, but it tries to simplify the output to the most elementary data structure that is possible. In effect, as can be seen in the base manual, sapply is a ‘wrapper’ function for lapply.

An example may help. Say we want to repeat the extraction operation of a single element as in the last example, now taking the first element of the second row (indexes 2 and 1) for each matrix. As we know, lapply would give us a list

lapply(MyList,"[", 2,1 )

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 9

but sapply returns a vector instead.

sapply(MyList,"[", 2,1 )

## [1] 2 5 9

unless we tell simplify=FALSE as parameter to sapply, in which case a list will be returned:

sapply(MyList,"[", 2,1, simplify=F)

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 9

Conversely, a function like unlist, can tell lappy to give us a vector:

unlist(lapply(MyList,"[", 2,1 ))

## [1] 2 5 9

Anyway, to avoid confusion, it is best to use these functions in their ‘native format’ and avoid conversions unless strictly necessary.

The rep function

We present this amidst the others as it is often used in conjunction with apply functions.
Given a vector or a factor x, the function replicates its values a specified number of times.
Let’s use one of the vectors we generated above with lapply into MyList, this time though, we only select the elements of the first line and first column from each elements of the list MyList (and we use sapply to get a vector):

Z=sapply(MyList,"[", 1,1 )

Now replicate their values a number of times as established by c(3,1,2): three times the first, one time the second and two times the third:

Z=rep(Z,c(3,1,2))
Z

## [1] 1 1 1 4 8 8

Handy, no?

Using mapply in R

mapply stands for ‘multivariate’ apply. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments. In short, mapply applies a Function to Multiple List or multiple Vector Arguments.

For example, we may create a 4 x 4 matrix using a call to the rep function repeatedly, four times with:

Q=matrix(c(rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4)),4,4)
Q

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    1    2    3    4
## [3,]    1    2    3    4
## [4,]    1    2    3    4

where we bind the results of the rep function with c (for “column bind”), and ask for a 4 x 4 matrix).
We could have done, more concisely:

Q=mapply(rep,1:4,4)
Q

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    1    2    3    4
## [3,]    1    2    3    4
## [4,]    1    2    3    4

where the mapply has been called to vectorize the action of the function rep.

Related functions

Similarly structured functions are occasionally used in conjunction with the element of the apply family: we cite only a few of these.

The sweep function in R

Sweep is probably the closest to the apply family. Its use is suggested whenever we wish to replicate different actions on the MARGIN elements we have chosen (limiting here to the matrix case). A typical scenario occurs in clustering, where you may need to repetitively produce normalized and centered data (“standardised” data). What does this mean?
Assume you have a number of data points in a group of data. You first find the center of the data (center of mass) and look how disperse this data are with respect to this center. Two basic quantities will give you this information: the mean and the standard deviation.

Say your data points are the column vectors in a matrix of your data and let’s use the matrix B created at the start of this post.

MyPoints=B  # just give an illustrative name to our matrix B
MyPoints

##      [,1] [,2] [,3]
## [1,]    4    8   12
## [2,]    5    9   13
## [3,]    6   10   14
## [4,]    7   11   15

You first find the means per column using one of the apply functions:

MyPoints_means=apply(MyPoints,2,mean)
MyPoints_means

## [1]  5.5  9.5 13.5

Similarly, you find their dispersion (the standard deviation) using another call to apply:

MyPoints_sdev=apply(MyPoints,2,sd)
MyPoints_sdev

## [1] 1.290994 1.290994 1.290994

Then you shift all the points with respect to their center, e.g. the mean you found above (it’s like changing your system of reference); and then normalize with respect to their standard deviation (You do this when you wish to make comparison aong data represented in different scales). We’ll do this in two steps to illustrate the use of the function.
First, let’s produce the centered points with one call to sweep:

MyPoints_Trans1=sweep(MyPoints,2,MyPoints_means,"-")
MyPoints_Trans1

##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5

where we named the result as MyPoints_Trans1, (Trans for translated, as we are moving these points).
Let’s analyze this line of code.
Sweep expects an input array (thus a matrix, as in our case, is ok); a MARGIN, for us 2=columns; a summary statistics (here we use ‘mean’); and a function to be applied: we use the arithmetic operator -, subtraction). This means: take the elements of the columns of the dataset MyPoints, and subtract the mean,that is, MyPoints_means, from each of them.

Now, step 2, we call sweep again, to divide all the values just found by their own standard deviation (this step is called normalization):

MyPoints_Trans2=sweep(MyPoints_Trans1,2,MyPoints_sdev,"/")
MyPoints_Trans2

##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

Again, we select MARGIN=2, e.g. columns; then provide the vector of the standard deviations, MyPoints_sdev as an operand; and then tell to use the ‘divide by’ operator, / .
Thus, we are asking R: take the elements of the columns of the new object you just created, MyPoints_Trans1, and divide these (/) by their standard deviation MyPoints_sdev.
We could have obtained the same result more rapidly and concisely (as often is the case in R!) and without using different names, all this in a single line of code by a nested call to sweep:

MyPoints_Trans=sweep(sweep(MyPoints,2,MyPoints_means,"-"),2,MyPoints_sdev,"/")
MyPoints_Trans

##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

[Just compare MyPoints_Trans and MyPoints_Trans2].

You are basically referring your points to this mean, the center of mass rather, than the previous reference system based on the origin of your coordinates and normalized by their standard deviation.
Statistically, you have just created a correlation matrix and standardized data are at the base of several more advanced procedures on data (like dimensionality reductions via PCA, signal analysis and others).
Would you like to test yourself and reproduce this example using a nested for structure?

The Aggregate function in R

This function is contained in the stats package, so because of its importance, we lift the ban about using functions outside the base package in an introductory blog like this. From the manual we get the usage as:

aggregate(x, by, FUN, ..., simplify = TRUE)

In other words, it works similarly to the apply (specify the object, the function and tell whether we wish to simplify as in the sapply). The key difference is the use of the ‘by’ clause, which sets the variable or dataframe field by which we want to perform the aggregation. We see in the next subsection how this works.

Another example

Consider the toy dataset called Mydf which contains data about the sales of a product, and where some of the values of the variable DepPC column repeat; it is a variable classifying the data on a geographical location, like the portion of a post code (here we took the numbers corresponding to the departments of the le de France, the region comprising Paris).

We want to do some stats on the sales columns. These are DProgr, a progressive number in increasing time order, and the sales of the product (the quantity Qty), plus a logical variable, Delivered, which is a logical, telling us whether the product has been delivered (T) or not (F).

First, we can do a number of very simple things to get acquainted with the data set, other than showing it all, by just typing its name (here we only have 120 records, but imagine doing this for a real file with thousands of lines!).

So first, let’s create this dataset (Don’t worry about the details on how this is done, we will see this in a dedicate post on data structures):

Mydf <- data.frame(DepPC=c("90","91","92","93","94","75"), DProgr=c(1:120), Qty=c(7:31,9:23,99:124,2:28,14:19,21:29,4,3,1:9,66), Delivered=ifelse(rnorm(120)>0,TRUE,FALSE))

So looking at the first 15 records or the last 5 (a bit of ‘data exploration’):

head(Mydf,15) # show first 15 records...

##    DepPC DProgr Qty Delivered
## 1     90      1   7     FALSE
## 2     91      2   8     FALSE
## 3     92      3   9     FALSE
## 4     93      4  10      TRUE
## 5     94      5  11      TRUE
## 6     75      6  12     FALSE
## 7     90      7  13      TRUE
## 8     91      8  14     FALSE
## 9     92      9  15     FALSE
## 10    93     10  16     FALSE
## 11    94     11  17     FALSE
## 12    75     12  18     FALSE
## 13    90     13  19      TRUE
## 14    91     14  20      TRUE
## 15    92     15  21     FALSE

tail(Mydf,5) # ...or last 5

##     DepPC DProgr Qty Delivered
## 116    91    116   6     FALSE
## 117    92    117   7      TRUE
## 118    93    118   8      TRUE
## 119    94    119   9      TRUE
## 120    75    120  66     FALSE

Looking at the type of variables the dataset is made of (a rather common use of sapply!):

sapply(Mydf, class)    # show data types for each column using sapply

##     DepPC    DProgr       Qty Delivered 
##  "factor" "integer" "numeric" "logical"

Seeing how many data points (“records”):

dim( Mydf) # how many rows (records) and columns

## [1] 120   4

nrow(Mydf); ncol(Mydf) # the same, but separately

## [1] 120

## [1] 4

Listing all the departments:

unique(Mydf$DepPC) # how many departments are there?

## [1] 90 91 92 93 94 75
## Levels: 75 90 91 92 93 94

Many other enquiries on the data are possible. Here we are interested in knowing where the product sells best, e.g. in which department. Therefore we regroup the data by department, summing up the sales, Qty, for each department (DepPC):

aggregate(Mydf$Qty,by=Mydf["DepPC"],FUN=sum)

##   DepPC   x
## 1    75 878
## 2    90 689
## 3    91 684
## 4    92 701
## 5    93 707
## 6    94 802

So, aggregate tells R that (1) we wish to sum (FUN=sum) (2) over all the Qty (first parameter is Mydf$Qty, e.g. the field Qty of the datafrae Mydf) (3) that belong to the same department (the ‘by’ clause: by=Mydf[“DepPC”]). Note that R assigned the sum to a variable ‘x’, because we didn’t tell otherwise [in general we did not assign the result to a variable of our choice, as noted above].

The output is quite readable as is, but for a higher number of departments (say for the whole country: in France there are 96 metropolitan departments plus 5 overseas) this might be less readable, so we can resort to some graphical output. Thus we plot the results using one of R’s graphical output systems, (ggplot2, more on this in a later post, here we do a simple unpolished use of this), by incorporating the aggregate function in it:

library(ggplot2)
ggplot(data=aggregate(Mydf$Qty,by=Mydf["DepPC"],FUN=sum), aes(x=DepPC, y=x)) +
  geom_point()+
  ggtitle("Sales per department - All")

This gives us the sales for each department.

We might ask the same question, but only for the goods that were delivered. To do this, we first subset the data for which delivered is true (T) using the now familiar subsetting operator “[”. Note that here we assign the result to a new variable Y, which is a new dataframe that inherits the same columns names from the parent datafrae Mydf. We do this to avoid repeating the aggregate instruction within the call to the plotting for readability:

# select only for delivered=True
Y<-Mydf[Mydf$Delivered==TRUE,]

So we can repeat the plot:

ggplot(data=aggregate(Y$Qty,by=Y["DepPC"],FUN=sum), aes(x=DepPC, y=x)) +
  geom_point()+
  ggtitle("Sales per department - Delivered")

So we could have posed different questions to the data in a vectorized way like with aggregate, and this we often do in conjunction with a handy plotting system like ggplot2, so you get the spirit. Note that to get this we only needed very few lines of code.

Summary and conclusions

We have seen some variations on the same theme, which is act on a structured set of data in a repetitive way. In this sense, these functions can be seen as (i) An alternative to loops and (ii) As a vectorized form of doing things. Vectorized here in the loose sense, we won’t enter the debate as if and which of the apply functions are truly vectorized or not.

In practice, in order to choose which apply function to use, we need considering:

The data type of the input: this is the object we will act upon (vector, matrix, array…, list, data frame or perhaps a combination of those)
What we intend to do: the FUN function we pass
The subsets of that data : rows, columns, or perhaps all?
What type of data do we want to get from the function, because we might want to perform further operations on it (and do we want a new object, or do we want to transform the input object directly?)

These are quite general questions and we may ask for the help of related functions like aggregate, by, sweep…(There are many more!). Also, as is very common in R, there may be equivalent way of doing the same things, especially because of the large amount of libraries nowadays available. For example, libraries like plyr, and especially dplyr with the very useful ddply function. (Tip: learn more on dplyr here.)

If you want to learn more on using the functions in the apply family, have a look at DataCamp’s Intermediate R tutorial.

About the author: Carlo Fanara

Carlo Fanara – After a career in IT, Carlo spent 20 years in physics, first gaining an Msc in Nuclear Physics in Turin (Italy), and then a PhD in plasma physics in the United Kingdom. After several years in academia, in 2008 Carlo moved to France to work on R&D projects in private companies and more recently as a freelance. His interests include technological innovation and programming for Data Mining and Data Science.