# Scalable plotting with ggplot2 – Part I

**Rsome - A blog about some R stuff**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Introduction

This series discusses how we can use ggplot2 to produce plots for each column

of a data frame that depend on characteristics of this column

(e.g. the class of a column) in a scalable manner.

To this end, we integrate the following concepts / functions:

- the
`ggplot2`

package `lapply`

- anonymous functions
- non-standard evaluation
- lexical scoping

The reader should be familiar with these concepts, otherwise, Hadley Wickham’s

Advanced R might be a good starting point to read

up on all but the first topic.

The approach discussed here generalizes to other situations in which one wants

to customize plots based on the characteristics of input data.

## The problem

For this blog post, we are going to use a subset of the diamonds data set.

Now, imagine you want a visual summary for each variable. Unfortunately,

the variables are not all of the same class. Otherwise, you might transform the

data into long format and use facets.

For the factors, you could do a bar chart, for the numerical variables, you might want to

use a density plot. Let’s have a look at a first approach. You could do the

following for cut and color.

Similarly, you can do for price and carat

Now, you can note two issues:

- There is a lot of code duplication. For each plot, you need another line of

code that is almost identical to the ones you have already. This is not

scalable to data sets with many columns. This problem will be addressed in this

post. - You might want to further customize your plots. For

example, changing the x-axis of the density plots from linear to logarithmic

might be desirable to make better use of space. This problem will be addressed

in the second part of the series.

## A solution

To address the first problem, we can create a function that

behaves differently depending on whether the input is factorial or numeric.

`current_class`

is a function that magically gets the class of the variable

that you used in `aes`

of `ggplot`

. It will be explained at a later stage.

Having defined that function, you could rewrite the above as follows:

This is a slight improvement on the first solution because you always call the

same functions for all plots. Hence, we can kind of use an `apply`

approach to

reduce the redundancy of this problem. You might think of the following:

Unfortunately, this does not quite work because for each iteration in

`lapply`

, `g`

will be the actual

values from each column, but in `aes`

, you need the name of the column, not the

actual value. Since there is no way to get from the values to the names, but

if we have the names, we can get the values, the trick is to loop over the names

of the data frame.

However, we are not quite there yet. Due to non-standard evaluation, we need to

further change two things:

- use
`aes_`

instead of`aes`

so`g`

is not actually g, but points to something

else. - use
`as.name(g)`

instead of`g`

because`g`

is just the name of an object (i.e.

“cut” for the first iteration), not the object itself.

The only explanation I still owe you is how the function `current_class()`

works.

It only works because it is called from within `lapply`

.

Hence, the parent frame of `current_class`

(the function that *calls*

`current_class`

) has `lapply`

as its parent. For a given iteration,

the value of g is available in the environment of `lapply`

. `currrent_class`

simply needs to go *up the tree* until it reaches the environment of `lapply`

and get the value of g and figure out it’s class. That is done as follows.

Now, we are done. This is all code we need to get our solution.

Finally, we can plot the result.

## Conclusion

In this blog post, a few advanced concepts from the R toolbox were integrated

in order to create column-wise visual data summaries. To this end, we

created a set of functions which can be used to generate plots for different

data types (numerical and factorial).

This set of functions can be used in conjunction with `lapply`

to

create summary plots, which would not be possible if different functions had to

be called for the different data types. The solution presented above is

scalable to data sets with an arbitrary number of columns without altering the

code.

## Outlook

We will expand on this by customizing the appearance of the plots further.

Namely, the second part of the series centers on using a log-transformed

x-scale for continuous data and how to generate appropriate breaks, but the

principles we will develop there can be generalized well to other customization

needs.

**leave a comment**for the author, please follow the link and comment on their blog:

**Rsome - A blog about some R stuff**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.