Scalable plotting with ggplot2 – Part II

(This article was first published on Rsome - A blog about some R stuff, and kindly contributed to R-bloggers)

Introduction

This blog post is the follow-up on part I on programming with ggplot2. If you have not read the first post of the
series, I strongly recommend doing so before continuing with this second part,
otherwise it might prove difficult to follow.

Having developed a scalable approach to column-wise and data
type-dependent visualization, we will continue to customize our plots. Specifically,
the focus of this post is how we can use a log-transformed x-axis with nice
breakpoints for continuous data.
If you don’t like the idea of having a
non-linear scale, don’t stop reading here. The principles developed below can be
generalized well to customize the plots regarding other aspects in which
the customization depends on the data itself.

The problem

Recall from part one that we ended
up with the following code to produce graphs for two different data types in
our data frame with four columns.

##  ............................................................................
##  get set up
library(tidyverse)
source("multiplot.R") # http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
dta <- select(diamonds, price, cut, carat, color)


##  ............................................................................
##  define helper functions
current_class <- function() {
  # returns the first class of the current iteration*
  class(dta[[get("g", parent.frame(n = 2))]])[1]
}
# * subset first element since object can have multiple classes

geom_hist_or_bar <- function() {
  color <- "gray"
  if (current_class() %in% c("integer", "numeric")) {
  geom_density(color = color, fill = color)
  } else if(current_class() %in% c("factor", "ordered")){
  geom_bar(color = color, fill = color)
  }
}

##  ............................................................................
##  create all plots
all_plots <- lapply(names(dta), function(g) {
  ggplot(dta, aes_(as.name(g))) + 
    geom_hist_or_bar()
  }
)
##  ............................................................................
##  compose plots
# use plotlist not ... as input
do.call("multiplot", list(plotlist = all_plots, cols = 2)) 

Our goal is to alter the x-axis from a linear to a log-transformed scale to make
better use of the space in the plot.

A fist solution

At first glance, the solution to the problem seems easy.
Similarly to the first post of this series,
we can create a new function scale_x_adapt which returns a continuous scale
and a discrete scale otherwise. Then, we could pass the transform argument
via ... to scale_x_continuous and integrate it with our current framework.

##  ............................................................................
##  create a scale function 
scale_x_adapt <- function(...) {
  if (current_class() %in% c("integer", "numeric")) {
    scale_x_continuous(...)
  } else {
    scale_x_discrete()
  }
}


##  ............................................................................
##  integrate this function in our lapply framework
all_plots <- lapply(names(dta), function(g) { 
  ggplot(dta, aes_(as.name(g))) + 
    geom_hist_or_bar() +
    scale_x_adapt(trans = "log")
})
all_plots[[1]]

This seems fine, except for the fact that the break ticks are not really chosen
wisely. There are various ways to go about that:

  • Resort to functionality from existing packages like trans_breaks (from the
    scales package), annotation_logticks (ggplot2) and others.
  • Create your own function that returns pretty breaks.

We go for the second option because it is a slightly more general approach and I
was not able to find a solution that pleased me for our specific case.

A second solution

We need to change the way the breaks are created within scale_x_adapt.
To produce appropriate breaks, we need to know the maximum and the minimum of the
data we are dealing with (that is, the column that lapply currently passes over)
and then create a sequence between the minimum and the maximum with some function.
Recall that in part 1 we used a function current_class that does
something similar to what we want. It gets the class of the current data. Hence,
we can expand this function to get any property from our current data (and
give the function a more general name).

current_property <- function(f) {
  # returns a property of the current iteration object
  f(dta[[get("g", parent.frame(n = 2))]])
}

# where the follwing is equivalent
current_class() 
current_property(f = class)[1] 

Note the new argument f, which allows us to fetch a wider range of properties from
the current data, not just the class, as current_class did.

This is key
for every customization that depends on the input data, because this function
can now get us virtually any information out of the data we could possibly want.
In our case, we are interested in the minimum and maximum
values for the current batch of data. As a finer detail, also note that
current_class called class and returned the first value, since objects can
have multiple classes and we were only interested in the first one (otherwise
we could not do the logical comparison with %in%). We now return all elements
that f returns, since we can always perform the subset outside the function
current_property, and this makes the function more flexibile.

Next, we need to create a function that, given a range, computes
some nice break values we can pass to the breaks argument of
scale_x_continuous. This task is independent of the rest of the framework we
are developing here. One function that does something that is close to what
we want is the following.

### .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
### break calculation
calc_log_breaks <- function(min, max,
                        length = 5,
                        correction = 2) {
  
  # calculate sequence of log values
  sequence <- seq(log(min), log(max), length.out = length)
  
  # exponentiate sequcence and round depending on range
  round(exp(sequence), digits = -log10(max - min) + correction)
}

Let me break these lines into pieces.

  • The basic idea is to create a sequence of breaks between the minimum and the
    maximum value of the current batch of data using seq.
  • Let us assume we want break points that are equi-distant on the log scale.
    Since our plot is going to be on a logarithmic x-axis, we need to create a linear sequence
    between log(start) and log(end) and transform it with exp so we end up
    with breaks that have the same distance on the logarithmic scale
    It becomes
    evident that the solution presented above is suitable for a log-transformed
    axis, but if you choose another transformation, e.g. the square root-
    transformation, you need to adapt the function.
  • We want to round the values depending on their absolute value. For example,
    the values for carat (which are in the range of 0.2 to 5) should be rounded to
    one decimal point, whereas the values of price (ranging up to 18’000)
    should be rounded to thousands or tens of thousands.
    So note that log10(10) is one, log10(100) = 2 and log10(0.1) = -1 etc, which
    is exactly what we need. In other words, we make the rounding dependent on the
    log of the difference between the maximum and the minimum of the input data
    for each plot.
  • A constant correction is added so it is possible to manually adjust
    the rounding from more to less digits.

Finally, we can put it all together:

##  ............................................................................
##  define helper-funtions
current_property <- function(f, n = 2) {
  # returns a property of the current iteration object
  f(dta[[get("g", parent.frame(n = n))]])
}

# returns a histogram or a bar geom, depending on class
geom_hist_or_bar <- function() {
  color <- "gray"
  if (current_property(class)[1] %in% c("integer", "numeric")) {
  geom_density(color = color, fill = color)
  } else {
  geom_bar(color = color, fill = color)
  }
}

### .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
### break calculation

calc_log_breaks <- function(min, max,
                        length = 5,
                        correction = 2) {
  
  # calculate sequence of log values
  sequence <- seq(log(min), log(max), length.out = length)
  
  # exponentiate sequcence and round depending on range
  round(exp(sequence), -log10(max - min) + correction)
}

### .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
### returns appropriate scale 

scale_x_adapt <- function(...) {
  if (current_property(class)[1] %in% c("integer", "numeric")) {
    scale_x_continuous(breaks = calc_log_breaks(min = current_property(min), 
                                                max = current_property(max)), ...)
  } else {
    scale_x_discrete()
  }
  
}


##  ............................................................................
##  generate plots
all_plots <- lapply(names(dta), function(g) { 
  ggplot(dta, aes_(as.name(g))) + 
    geom_hist_or_bar() +
    scale_x_adapt(trans = "log")
})
##  ............................................................................
##  compose the plots 
# use plotlist not ... as input
do.call("multiplot", list(plotlist = all_plots, cols = 2)) 

Conclusion

In this blog post, we wanted to further customize our plots created in the first
post of the series.
We introduced a new function, scale_x_adapt that returns a
predefined scale for a given data type. It can be integrated with our framework
similarly to geom_hist_or_bar. We created a more general version of
current_class, current_property which takes a function as an argument and
allows us to evaluate this function on the current data column.
In our example, this is helpful because using current_property(min)
and current_property(max), we found out the range of the column we are
processing and hence can construct nice breakpoints with calc_log_breaks that then get
used in scale_x_adapt. current_property is a key function in the framework
developed here since it can extract any information from the batch of data we are
processing within lapply.

To leave a comment for the author, please follow the link and comment on their blog: Rsome - A blog about some R stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)