# End-to-end visualization using ggplot2

August 13, 2017
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

`ggplot2` is kind of a household word for R users. I’ve ended up using it for complex data munging and wrangling work, where I needed to get clarity on different aspects of the data, especially being able to get different views, slices and dices of it, but in a nice visualization. At some point along the line, I slowly stopped using more traditional plotting functions like `plot()`, `matplot()`, `barplot()`, etc.

This article is an end-to-end data visualization exercise, using only `ggplot2()`. It has been helpful for me to see such pieces online on the endless possibilities of `ggplot2()`, so I wanted to give back to the community by doing one of my own.

## 1. Pima Indian Diabetes data

Consider the Pima Indian Diabetes dataset available in `R`. It looks at the population of women who were at least 21 years of age, of Pima Indian heritage and living near Phoenix, Arizona, and were tested for diabetes according to WHO criteria. In this exercise, I will use the 332 test data subjects. There are no missing values in this data. It is a very simple dataset, but my goal is to use it to demonstrate the tools available in `ggplot2` to visually investigate a dataset we know very little about. This is part of the important data exploration phase of a data science project, to help prepare for the modeling phase.

``````library(MASS)
d <- Pima.te
summary(d)``````
``````##      npreg             glu              bp              skin
##  Min.   : 0.000   Min.   : 65.0   Min.   : 24.00   Min.   : 7.00
##  1st Qu.: 1.000   1st Qu.: 96.0   1st Qu.: 64.00   1st Qu.:22.00
##  Median : 2.000   Median :112.0   Median : 72.00   Median :29.00
##  Mean   : 3.485   Mean   :119.3   Mean   : 71.65   Mean   :29.16
##  3rd Qu.: 5.000   3rd Qu.:136.2   3rd Qu.: 80.00   3rd Qu.:36.00
##  Max.   :17.000   Max.   :197.0   Max.   :110.00   Max.   :63.00
##       bmi             ped              age         type
##  Min.   :19.40   Min.   :0.0850   Min.   :21.00   No :223
##  1st Qu.:28.18   1st Qu.:0.2660   1st Qu.:23.00   Yes:109
##  Median :32.90   Median :0.4400   Median :27.00
##  Mean   :33.24   Mean   :0.5284   Mean   :31.32
##  3rd Qu.:37.20   3rd Qu.:0.6793   3rd Qu.:37.00
##  Max.   :67.10   Max.   :2.4200   Max.   :81.00``````
``head(d)``
``````##   npreg glu bp skin  bmi   ped age type
## 1     6 148 72   35 33.6 0.627  50  Yes
## 2     1  85 66   29 26.6 0.351  31   No
## 3     1  89 66   23 28.1 0.167  21   No
## 4     3  78 50   32 31.0 0.248  26  Yes
## 5     2 197 70   45 30.5 0.158  53  Yes
## 6     5 166 72   19 25.8 0.587  51  Yes``````

The target variable, type, tells us whether a patient is diabetic or not.

## 2. Distributions across categories

When the target is categorical, as in this case, type, I like to start by examining distributions for the continuous input columns. This gives us an overall sense of which input is likely to be useful. To do this, I like to do both boxplots and a density plot, since each has a different goal.

### 2.1. Boxplots

First, I’ll use boxplots, but `ggplot2`-style. I really like the look of a `ggplot2()` boxplot. It also allows me to seamlessly have multiple plots in a grid, as well as tinker around with the plotting parameters more flexibly than in a classical `boxplot()` approach, and end up with a nice-looking plot. We can see below how some inputs clearly vary across the 2 target categories, and others don’t.

``````df <- subset(d, select=c(glu, bp, skin, bmi, ped, age, type))

library(gridExtra)
library(ggplot2)
p <- list()

for (j in colnames(df)[1:6]) {
p[[j]] <- ggplot(data=df, aes_string(x="type", y=j)) + # Specify dataset, input or grouping col name and Y
geom_boxplot(aes(fill=factor(type))) + guides(fill=FALSE) + # Boxplot by which factor + color guide
theme(axis.title.y = element_text(face="bold", size=14))  # Make the Y-axis labels bigger/bolder
}

do.call(grid.arrange, c(p, ncol=3))``````

### 2.2. Density plots

I have used various overlay-density packages in the past, `sm.density.compare()` for example. I find the overlay-density rendering in `ggplot2()` to be more visually pleasing, with little plotting parameter tuning. E.g., it’s clear in the plot below that diabetic patients are associated with more number of pregnancies. I really like the `alpha` parameter.

``````df\$npreg <- d\$npreg
g <- ggplot(df, aes(npreg))
g + geom_density(aes(fill=factor(type)), alpha=0.8) +
labs(title="Density plot",
subtitle="# Pregnancies Grouped by Diabetes Type",
x="# Pregnancies",
fill="Diabetes Type")``````

## 3. Grid views

Next, I want to mix things up a little, so that I can have multi-dimensional views. By this, I mean that I want to know how the target is distributed across a few important inputs, but I want to link those inputs up as well. Sort of like a 3-way table, but visualized nicely instead of numbers. I came across this problem recently in one of the projects, and while it seems like a basic must-have output to dig deeper, I really needed something like `ggplot2` to implement it. Using `facet_grid()` was amazing, even more so on account of the smooth control one has on the plotting parameters within a `ggplot2` setup.

### 3.1. Data preparation

Facet-wrapping and gridding is a must-have tool for deeper data views, but the process is a multi-step one. Not too complicated though – very intuitive under `ggplot2`. We start with creating some new categorical columns using the continuous ones. Note that this can be done in different ways: appending new columns directly to the data frame, or using the more sleeker `dplyr()` in combination with `magrittr()`, which I absolutely love. This integrates a number of operations into a single chunk, making it quite seamless. I am also loading up `plyr()`, since I will be using it later.

``````library(magrittr)
library(plyr)
library(dplyr)
df_grid <- d %>%
mutate(Skin = ifelse(d\$skin <= 29, "low skin fold", "high skin fold"),
BMI = ifelse(d\$bmi <= 33, "low BMI", "high BMI"),
Ped = ifelse(d\$ped <= 0.31, "low pedigree",
ifelse(d\$ped > 0.3134 & d\$ped <= 0.5844, "medium pedigree", "high pedigree"))) %>%

mutate(Ped = factor(Ped, levels = c("low pedigree", "medium pedigree", "high pedigree")))``````

### 3.2. Reshaping the data

Next, we need to prepare the data a little more before throwing it into the `facet_grid()` mix. Most importantly, we need to “reshape” it, i.e., while our data is a “wide”-form data frame, we need to convert this to a “long”-form to enable `facet_grid()` to easily pick up what it needs to “facet” the plot by. We will also add a “size” column – this will allow us to make more granular adjustments in our plot. I will also rename columns in order to enable easier axis labeling when plotting. Again, notice that instead of using `reshape2()`, which I have used for many years, we’re using `gather()` from `tidyr()`, all sewn together with the pipe in `magrittr()`.

``````library(tidyr)
DF <- df_grid %>%
subset(select=c(type, Skin, BMI, Ped)) %>%
gather(variable, value, -c(Skin, Ped, BMI))

colnames(DF)[5] <- "Diabetes_Value"
DF\$size <- rep(1.5, nrow(DF))
s <- 1.5``````

### 3.3. Facet Grid

We’ll try the basic `facet_grid()` plot, after which we’ll go in and make some adjustments. For now, our goal is the following: to see a “matrix” or “grid” of the BMI distribution across diabetes type, as a 2×2 table of pedigree/skin fold combinations. In other words, for low pedigree/low skin fold, how does BMI distribute across diabetes type? You can see the amount of information you can pack into just one plot. I have found this to be useful when presenting to an end-user or customer. It becomes all the more useful since its a very clear representation of this slice/dice, with little room for ambiguity.

``````# Simple
library(ggplot2)
ggplot(data=DF, aes(x=Diabetes_Value, fill=BMI)) + geom_bar() +  # Barplot
facet_grid(Skin ~ Ped)   # wrap up everything to showcase by multiple cols``````

This looks nice, but I would like to add more of a “pop”. I am going to outline each box, and bolden the fonts. Note that you can also color the “grid strips”, but I won’t do that right now.

``````# More color
p <- ggplot(data=DF, aes(x=Diabetes_Value, fill=BMI)) + geom_bar() +  # Barplot
geom_rect(aes(fill=NA, size=size),xmin =-Inf,xmax=Inf,ymin=-Inf,ymax=Inf,alpha = 0.0002, colour="black",show.legend = F) +   # use box drawn around each location to cleanly separate facets + suppress guide
scale_size(range=c(s,s), guide=FALSE) + # use line width/size feature for cleaner plotting
facet_grid(Skin ~ Ped) +   # wrap up everything to showcase by multiple cols
theme(strip.text.x = element_text(face="bold", size=12)) +
theme(strip.text.y = element_text(face="bold", size=12))
# optional changes in strip
#+ theme(strip.text.x = element_text(face="bold", size=12, colour="white")) +
#  theme(strip.text.y = element_text(face="bold", size=12, color="white")) +
#  theme(strip.background = element_rect(fill="black"))
plot(p)``````

Much better. Look how nicely this granular plot adjustment in `ggplot2` allows each “block” in the matrix to pop out. Its very clear how BMI is distributed across diabetes type, and how that in turn is distributed across both pedigree function and skin fold. We see that (as expected): 1. A higher triceps skin fold thickness is associated with a higher BMI, as well as a higher count of diabetic people. 2. The above is more true for a higher diabetes pedigree function.

This kind of a grid plot presents a very powerful tool for such multi-dimensional data views.

## 4. Heatmaps

I like heatmaps – there’s a sense of drama in the way you can see where “something is happening”. I’ve used `heatmap.2()` to implement hierarchical clustering and translating that to a heatmap. But I wanted to use `ggplot2()` to simply look at a dataset as a heatmap, without any underlying analysis, to detect patterns before any analysis begins.

In this case, I want `ggplot2()` to show me patterns across different input columns, for the two diabetes types, i.e., what inputs seem to differ across diabetic/non-diabetic patients. This will be clear once we render our dataset into a nice `ggplot2()` heatmap.

### 4.1. Data preparation

As usual, we need to prep our data before pushing it into the `ggplot2()` function. We’ll reshape and scale the data first, all within the `plyr()`, `dplyr()`, and `magrittr()` framework. I’ll also specify some plotting parameters that I will call into my `ggplot2()` function. I’m going to rely on `RColorBrewer()` for these.

``````df_heat <- d[order(d\$type),1:8]
DF_Heat <- df_heat %>%
mutate(id = 1:nrow(df_heat)) %>%
select(c(npreg:age, id))  %>%
gather(variable, value, -id)  %>%
ddply(.(variable), transform,
rescale = scale(value))  # Notice that this reorders by "variables"

# Color scale for heatmap
library(RColorBrewer)
colors <- brewer.pal(9, 'Reds')

# Lines to split patients into diabetic/non-diabetic
my.lines <- data.frame(x1 = 0.5, x2 = 7.5, y1 = 223.5, y2 = 223.5)``````

### 4.2. Rendering the heatmap

``````# Basic plot
p <- ggplot(DF_Heat, aes(as.factor(variable), as.factor(id), group=id)) +
geom_tile(aes(fill = rescale),colour = "white") +

base_size <- 9
p_adj <- p + theme_grey(base_size = base_size) + labs(x = "",y = "") + scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
geom_segment(data=my.lines, aes(x = x1, y = y1, xend=x2, yend=y2), size=1, inherit.aes=F) +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

Note that I have suppressed the ticks on the Y-axis. We can clearly see regions of interest on the heatmap. It would be better for these to easily pop out at the viewer, to enable which, I am going to invoke `geom_rect()`.

``````# Borders of rectangles to indicate areas of interest on heatmap
my.lines.rect.1 <- data.frame(xmin = 1.5, xmax = 2.5, ymin = 223.5, ymax = 255.5)
my.lines.rect.2 <- data.frame(xmin = 3.5, xmax = 4.5, ymin = 223.5, ymax = 332)
my.lines.rect.3 <- data.frame(xmin = 5.5, xmax = 6.5, ymin = 223.5, ymax = 280.5)

p_adj + geom_rect(data=my.lines.rect.1, aes(xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax), fill = NA, col = "black", lty=2, inherit.aes = F) +
geom_rect(data=my.lines.rect.2, aes(xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax), fill = NA, col = "black", lty=5, inherit.aes = F) +
geom_rect(data=my.lines.rect.3, aes(xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax), fill = NA, col = "black", lty=4, inherit.aes = F)``````

Much better.

## 5. Segmentation in a scatterplot

Finally, I want to try to implement some “basic-level clustering”. This is not model-based clustering; rather, it is simply using a scatterplot and a few nice plotting parameters in `ggplot2()` to make some things pop right out at the viewer – again, with little room for ambiguity. What I like most here is the boxes that we can draw nicely to showcase the “clusters” a little better, along-with the multi-layered information, e.g., age, BMI, glucose, etc.

The conclusions are logical and obvious from the following plot, but quite nicely illustrate the use of `ggplot2()` for such a specific purpose.

``````d\$Age <- ifelse(d\$age < 30, "<30 yrs", ">= 30 yrs")

ggplot(d, aes(x = glu, y = bmi)) +
geom_rect(aes(linetype = "High BMI - Diabetic"), xmin = 160, ymax = 40, fill = NA, xmax = 200,
ymin = 25, col = "black") +
geom_rect(aes(linetype = "Low BMI - Not Diabetic"), xmin = 0, ymax = 25, fill = NA, xmax = 120,
ymin = 10, col = "black") +
geom_point(aes(col = factor(type), shape = factor(Age)), size = 3) +
scale_color_brewer(name = "Type", palette = "Set1") +
scale_shape(name = "Age") +
scale_linetype_manual(values = c("High BMI - Diabetic" = "dotted", "Low BMI - Not Diabetic" = "dashed"),
name = "Segment")``````

Hopefully, this little exercise will be helpful for someone wanting to use `ggplot2()` for an innovative slice/dice of a complex dataset, and to visualize it nicely.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.