Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Right now, many people are pursuing data science because they want to learn artificial intelligence and machine learning. And for good reason. Machine learning is white hot right now, and it will probably reshape almost every sector of the world economy.

Having said that, if you want to be a great machine learning expert, and a great data scientist in general, you need to master data visualization too.

This is because data visualization is a critical prerequisite for advanced topics (like machine learning), and also because visualization is very useful for getting things done in its own right.

So let’s talk a little more about data visualization. As you begin learning data visualization in R, you should master the basics: the how to use ggplot2, how to think about data visualization, how to make basic plots (like the bar chart, line chart, histogram, and scatterplot).

And I really mean that you need to master these basics. To be a great data scientist, you need to be “fluent” in these basics. You should be able to write the code for basic charts and plots without even thinking about it. If you can’t, you should go back and practice them. Don’t get shiny object syndrome and try to move on to advanced topics before you do. Show some discipline. Master the foundations.

After you do master the foundations though, you’ll need to learn some intermediate tools.

One such thing that you’ll need to learn is how to work with color. Specifically, you’ll need to learn how to manipulate the “fill” color of things like density plots (as well as heatmaps).

With that in mind, let’s take a look at using color in density plots.

As always, we’re going to use the meta-learning strategy of learning the tools with basic cases. Essentially, we’ll learn and practice how to modify the fill aesthetic for very simple plots. Later, once you get the hang of it, you can move on to more advanced applications.

As always, first we will load the packages we will need.

#--------------
#--------------

library(tidyverse)
library(viridis)
library(RColorBrewer)



Next, we will set a “seed” that will make the dataset exactly reproducible when we create our data. Without this, running rnorm() and runif() would produce data that is similar, but not exactly the same as the data you see here. To be clear, it won’t make too much difference for the purposes of this blog post, but it’s a best practice when you want to show reproducible results, so we will do this anyway.

#---------
# SET SEED
#---------
set.seed(19031025)



Now, we will create our data frame.

We will use runif() to create 20,000 uniformly distributed values for our x variable, and rnorm() to create 20,000 random normal values for our y variables. We’re also using the tibble() function to ultimately create a tibble/data.frame out of these two new variables.

#------------------
# CREATE DATA FRAME
#------------------
df.data <- tibble(x = runif(20000)
,y = rnorm(20000)
)



Now that we have a dataset created, let’s create a simple plot of the data. Ultimately, we will be working with density plots, but it will be useful to first plot the data points as a simple scatter plot.

Here, we’re using the typical ggplot syntax: we’re specifying the data frame inside of ggplot() and specifying our variable mappings inside of aes().

#--------------------
# CREATE SCATTER PLOT
#--------------------

ggplot(df.data, aes(x = x, y = y)) +
geom_point()



Ok. We can at least see the data points and the general structure of the data (i.e., the horizontal band).

Having said that, these data are very heavily overplotted. There are a few ways to mitigate this overplotting (e.g., manipulating the alpha aesthetic), but a great way is to create a density plot.

To create the density plot, we’re using stat_density2d(). I won’t explain this in detail here, but essentially in this application, stat_density2d() calculates the density of observations in each region of the plot, and then fills in that region with a color that corresponds to the density.

#------------------------------
# DENSITY PLOT, w/ DEFAULT FILL
#------------------------------

ggplot(df.data, aes(x = x, y = y)) +
stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F)



This isn’t bad. It gives us a sense of the density of the data (you can see the thick band across the middle). However, there are two issues.

First, the differences in density are not completely obvious, because of the color scale. The default light blue/dark blue color scheme doesn’t illuminate the differences in data density.

Second, this is just not very aesthetically appealing. It just doesn’t look that good.

To fix these issues, let’s modify the color scheme. I’ll show you a few options. Some will work better than others, and after you see these, I’ll encourage you to experiment with other color palettes.

Let’s first take a look at the color palette options.

You can examine a large number of ready-made color palettes from the RColorBrewer package by using display.brewer.all().

#-----------------------
# DISPLAY COLOR PALETTES
#-----------------------

display.brewer.all()



As you can see, there are quite a few palettes from RColorBrewer. We’ll use a few of these to change the fill color of our plot.

#--------------------------------
# FILL WITH COLOR PALETTE: Greens
#--------------------------------

ggplot(df.data, aes(x = x, y = y)) +
stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
scale_fill_distiller(palette = 'Greens')



Now let’s use a few more color palettes.

#------------------------------------------------------
# FILL WITH COLOR PALETTE: Reds, Greys, Red/Yellow/Blue
#------------------------------------------------------

ggplot(df.data, aes(x = x, y = y)) +
stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
scale_fill_distiller(palette = 'Reds')



… grey

ggplot(df.data, aes(x = x, y = y)) +
stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
scale_fill_distiller(palette = 'Greys')



… and a scale from red, to yellow, to blue.

ggplot(df.data, aes(x = x, y = y)) +
stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
scale_fill_distiller(palette = 'RdYlBu')



I think they work a little better than the default color scheme, but I think we can do better, so let’s try one more.

The following plot uses a custom color palette from the viridis package.

#---------------------------------
# FILL WITH COLOR PALETTE: Viridis
#---------------------------------

ggplot(df.data, aes(x = x, y = y)) +
stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
scale_fill_viridis()



I should have called this blog post “ggplot for people who love Mark Rothko.”

Ok, the viridis color palette (and a related set of palettes in the viridis package) is probably my favorite option. Not only do I think this color palette is one of the most aesthetically attractive, it’s also more functional. As noted in the documentation for the package, the viridis color palette is “designed in such a way that it will analytically be perfectly perceptually-uniform … It is also designed to be perceived by readers with the most common form of color blindness.”

## Master these techniques with simple cases

Admittedly, when you move on to more complex datasets later, it will take a bit of finesse to properly apply these color palettes.

But as I noted earlier, when you’re trying to master R (or any programming language), you should first master the syntax and techniques on basic cases, and then increase the complexity as you attain basic competence.

With that in mind, if you want to master these color and fill techniques, learn and practice these tools with simple cases like the ones shown here, and you can attempt more advanced applications later.

# Sign up now, and discover how to rapidly master data science

To master data visualization and data science, you need to master the essential tools.

Moreover, to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

You’ll learn:

• How to do data visualization in R
• How to practice data science
• How to apply data visualization to more advanced topics (like machine learning)
• … and more