It is extremely common to have a dataframe containing a bunch of variables, and to do the exact same thing to all of these variables.
For instance, lets say we have a dataframe that has a bunch of limb bone measurements of different animals, and we want to see if they are related to a categorical predictor variable after controlling for the body mass of the animal.
Plotting all our variables against body mass – the long way
We have created 6 variables that are all correlated with our 3-level categorical variable. They also have an increasing correlation with body mass, which we can see in a plot. Your first inclination might be to set up a plotting space with room for 6 plots, and then type out each plot command, like so.
That’s not too bad with just 6 variables, but would be annoying with 30 variables. And what if we want to change something about the way we are doing the plot? We will have to change each one of the plotting commands….which I am way too lazy to do. Only the variable name changes each time, everything else is exactly the same. We have a clear case here for replacing our 6 plot commands with a single use of `lapply()`. Note: there are reasons (many of them stylistic) to avoiding explicit `for()` loops in R. Here here is a good introduction to using the apply family of R functions.
click to enlarge
lapply() goes through an object and applies a function to each piece, and then returns a list of the same length as the original object. In short, there is an implicit for loop that gets written for you. You can use lapply() to iterate over anything: a list, a dataframe (which is just a special type of list) a vector of numbers, a vector of characters…..whatever. In our case, the variables of interest are stored in columns 3 through 8 of our data frame. So we can use lapply() to go through the numbers 3 through 8 and do the same thing each time. The hardest part of using lapply() is writing the function that is to be applied to each piece. We need to write our own function for lapply() to use. In this case, we’ll call it myPlot(). This function takes an index number (corresponding to the number of the column in the dataframe), plots the corresponding column in myData against the body mass column in myData, and then applies the appropriate labels to the axes. Once we have written the function, we can apply it to all of our columns with the single lapply() line of code.
The graphs are identical, and we did it with much less code!! Notice that, besides the plot the output is a bunch of NULL values. This isn’t a mistake….our function myPlot() only plots to the screen, it doesn’t return any values. I this case all we wanted was the side effect of making the plots, but other times we want to return values.
getting something useful from the return value of lapply()
What if we weren’t interested in the plots, but we wanted to do an ANCOVA on each variable, and summarize the results in a readable format? Well, we can do that with lapply() as well. First, we need to create the formulae that describe the ANCOVA model for each variable, then we will use lapply() to loop over each one. And wouldn’t you know…when we look at the effect of body mass in this ANCOVA, it increases in just the way we modeled it! Note that using this method, the names of the variable don’t get preserved, but they are in the order in which they were called, so we could save the results of the lapply() instead of printing them to the screen, and then give then assign them names using the names() function.