R Training – Data Visualization

[This article was first published on R – SLOW DATA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

ggplot iris example dataset in R

This is the fourth module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Graphics in data projects can be useful for several tasks including:

  • understand data properties
  • find patterns in data
  • communicate results

First of all, let’s load some useful packages.

library(dplyr)
library(MASS)
library(rworldmap)
library(ggplot2)
library(RColorBrewer)

Understand data properties

We will start with some exploratory graphics to summarize data and highlight broad features. This is useful to explore basic questions and hypothesis, suggest modeling strategies and so on.

x <- rnorm(100)
y <- x + rnorm(100, mean=0.2, sd=2)
df <- data.frame(lab = LETTERS[1:7], g = rgamma(7, shape = 100))

plot( ) is a generic function to plot R objects. It is generic because it adapts to the input provided:

  • if you provide a numeric vector the default is to plot them as points on the y axis against an integer index on x axis
  • if you provide two numeric vectors the default is to plot the points determined by the (x,y) couples (a scatterplot)
  • if you provide a dataframe with a numerical and a factor you will get a barplot

plot(x) # values against integer index

plot one way

plot(x, y) # scatterplot

scatterplot

plot(df) # barplot

barplot

Let’s load some example data

data("iris")
dt <- iris
str(dt) # to have an idea of what kind of data you have read-in

summary(dt) # to obtain a summary of data

Once you know your data is clean you may want to explore some features more in detail.

dt_spec <- dt %>% group_by(Species) %>% summarise(Petal.Length=sum(Petal.Length))
plot(dt_spec)

barplot iris

There is also a specific function to create barplots in R, but input have to be provided in a slightly different way:

barplot(dt_spec$Petal.Length, names.arg = dt_spec$Species)

barplot iris standard

Looking at the summary we see that minimum sepal length is 4.3, maximum 7.9 and median 5.8. We have also other quantiles but to have a more thorough view of the distribution you should draw a histogram.

hist(dt$Sepal.Length)

hist iris 1

hist(dt$Sepal.Length, nclass = 30) # to smooth more by increasing number of bins

hist iris 2

Another way to get a quick visualization of a distribution is to use boxplots.

boxplot(dt$Sepal.Length)

boxplot one-way

In this case we see clearly that:

  • the bulk of distribution (50%) has a value around 5 and 6.5
  • maximum value excluding outliers is somewhere between 7.5 and 8.0
  • right tail is longer than left tail

In R-boxplots the box correspond to the interquartile range (from 25th to 75th quantile), black line inside the box is the median, the lines extending vertically from the box (whiskers) indicate variability outside the upper and lower quartile. Outliers are plotted as individual points (if any).

 

Find patterns

Usually it is a good idea to investigate relations using graphics since we are naturally prone to detect trends, relationships, etc. in a visual way.

When we talk about patterns in data we usually refer to relationships between two or more variables. Options to visualize two dimnensions are:

  • draw multiple boxplots in one window
  • scatterplots
  • etc.

To add a 3rd dimension one option is to use different colors, shapes, sizes, etc. (rather than using 3D graphics, which are typically hard to interpret).

Say we want to see if age distribution changes according to car category.

# boxplot function supports formula (~) statements
boxplot(dt$Sepal.Length ~ dt$Species, col="salmon2")

box plot multi

The hist() function does not support the formula statment, but you can modify directly the global graphical parameteres in order to split the graphical device into multiple slots. Before changing global parameters it is a good idea to save a copy of original settings in order to easily go back to defaults once done with the plot.

parOriginal <- par(no.readonly = TRUE) # save a copy of original graphical parameters
par(mfrow=c(2,2)) # par can be used to set or query graphical parameters
hist(dt[dt$Species=="setosa","Sepal.Length"], nclass = 30)
hist(dt[dt$Species=="virginica","Sepal.Length"], nclass = 30)
hist(dt[dt$Species=="versicolor","Sepal.Length"], nclass = 30)
hist(dt$Sepal.Length, nclass = 30) # full age distribution

hist multi par

par(parOriginal) # set default graphical parameters

 

Scatterplot

Let’s simulate some numbers and draw scatterplots.

# two normal populations, with mean 2 and 4 respectively 
x_a <- rnorm(50, 2)
x_b <- rnorm(50, 4)
x <- c(x_a, x_b)

# another two normal populations respectively correlated with previous ones
y_a <- x_a + rnorm(50, 0.2, 0.5)
y_b <- x_b + rnorm(50, 0.2, 1)
y <- c(y_a, y_b)

# a variable to label the two populations
l <- c(rep("A", 50), rep("B", 50))

# a dataframe including x, y and l
df <- data.frame(x=x, y=y, l=l)

# scatterplot 2-d
plot(df$x, df$y)

scatterplot 2way

# add a third dimension with colour
with(df, plot(x, y, col = l))

scatterplot 3way

 

Spatial analysis

If you are interested in the visualization of a geographical attribute then a map is probably what you need. R can be used as a fast, user-friendly and extremely powerful command-line Geographic Information System (GIS).

In R there is a large and growing number of spatial data packages. Here we will focus on rworldmap, a package for visualising global data referenced by country.

The package stores multiple maps which can be accessed through getMap function.

newmap <- getMap(resolution = "coarse")  
class(newmap)

Maps in R are classified as spatial (sp) objects. Spatial objects are made up of a number of different slots (that can be accessed through the @ operator):

  • bbox (bounding box, mostly used for setting up plots)
  • data (data indeed)
  • polygons/lines/points/… (the geometry instructing R on how to plot maps)
  • proj4string (define the coordinate reference system)

Inside each slot you may have multiple components which, as usual, can be accessed with the $ operator.

Plot is a generic function and it works also with spatial objects.

To add some information in this map we need some attribute at country-level. The package rworldmap itself offers some interesting environmental dataset.

The package rworldmap provides a function to join country-level attributes to an internal map. All you need to do is to provide the name of the column containing the key for join (nameJoinColumn = ‘ISO3V10’) and specify you want to join by that key (joinCode = ‘ISO3’)

dat <- joinCountryData2Map(countryExData, joinCode = "ISO3", nameJoinColumn = "ISO3V10")

Function mapCountryData in rworldmap draws a map of country-level data, allowing countries to be coloured.

mapCountryData(dat, nameColumnToPlot="BIODIVERSITY")

biodiversity hot map

 

Using spatial data in R can be challenging because there are many types and formats and there are many packages coming from diverse user communities. Anyway there is an increasing trend of harmonization and the capabilities offered are extremely vast. A good start is the CRAN tutorial, or one of the many tutorials on github.

Communicate results

Typically the findings of a data analysis are shared with an audience and in general visual aids help people to digest complex messages. In this context the sizes, shapes, widths, labels, margins, fonts, etc. are all things that become important because they can contribute to make the visualization clearer.

 

Additional graphical parameters

When applicable plot function allows you to specify many additional graphical parameters. To have a list of them type ?par

Let’s take the histogram created before and clean it a bit with additional graphical parameters.

hist(dt$Sepal.Length, 
     nclass = 30, # number of bins
     probability = TRUE, 
     col="wheat", # color of bars
     border = "black", # color of border of bars
     xlab = "Sepal Length", # label of x axis
     ylab = "", # label of y axis
     main = "Iris Sepal Length density distribution" # title
     ) 

fit <- fitdistr(dt$Sepal.Length, "normal") # Maximum-likelihood fitting of univariate distributions
curve(dnorm(x, mean = fit$estimate["mean"], sd = fit$estimate["sd"]), add=T, col = "red") # Draws a curve corresponding to a function


# Also legends can be added

legend("topright", # position of legend box
       bty = "n", # box type = none
       legend = c("Observed", "theoretical normal"), # text to be displayed
       col = c("wheat", "red"), # colors
       lty = c(1,1), # line type 
       lwd = c(10, 1) # line width
       )

hist formatted

Ggplot

All functions used until now belong to the base plotting systems. In R there are 3 different plotting systems available:

  • base
  • lattice
  • ggplot

ggplot is an implementation of the Grammar of Graphics by Leland Wilkinson (a set of principles for graphics). Grammar of graphics is a description of how graphics can be broken down into abstract concepts (like languages are divided in nouns, adjectives, etc.). Ggplot graphics abstraction is a very powerful concept to organize all kind of graphics and has become extremely popular in recent years.

Ggplot2, as lattice, is built upon the grid package which is able to control all details of the graphic system in R. This is why ggplot allows you to produce a wide variety of visualizations virtually according to every needs and purpose. For the same reason ggplot is typically the first choice for high-quality works in R, ready to publish.

Briefly, from the ggplot book,

the grammar tells us that a statistical graphics is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system.

Another key feature of ggplot graphics is that they are built with layers and this explain the sum symbol (+) you will see in the code.

# if you are using a Windows machine you need to translate some font for ggplot
windowsFonts(Times=windowsFont("TT Times New Roman"))

gg1 <-ggplot(dt, aes(x = Sepal.Length, group = Species, fill = Species)) + # set Sepal Length on x-axis, group and fill (with color) according to the values of Species
  geom_density(alpha = .4) + # transform age data into a density distribution summary 
  xlab("Sepal Length") + # set x-axis label
  ylab("") + # set y-axis label
  ggtitle("Sepal Length distributions by Species") + # set plot title
  guides(fill=guide_legend("Species")) + # color legend according to values of species
  theme(plot.title = element_text(hjust = 0, vjust=5, size = 14, family = "Times"), # set position, size and font for title
        axis.text.x = element_text(size = 12, family = "Times"), # set size and font for x axis label
        axis.text.y = element_text(size = 12, family = "Times"), # set size and font for y axis label
        panel.background = element_rect(fill = "white") # set background color
        )
     
gg1 # to plot ggplot plots you have to call them

ggplot density plot

 

Colours

A careful choice of colors can help to draw better visualizations. R has 657 built-in color names. Use colors() for a list of all colors known by R.

When we need to show a range of colors we can use palettes. In the map created before the palette was not specified so mapCountryData function used its default value (in that case a heat palette, with colors ranging gradually from yellow to red). We can customize palettes to our needs.

A reference package for color palettes is RColorBrewer. The function to create palettes is brewer.pal. It takes two arguments:

  • n –> Number of different colors in the palette, minimum 3, maximum depending on palette
  • name –> a palette name

To have a look at all available palettes you can use:

display.brewer.all(n=NULL, type="all") # diverging, sequential, qualitative

colours display 1

display.brewer.all(n=NULL, type="seq") # only sequential

colours display 2

For an interactive viewer of palettes you can visit this page.

# using output from RColorBrewer
mapCountryData(dat, nameColumnToPlot="BIODIVERSITY",
               colourPalette = brewer.pal(7, "Purples"))

biodiversity2

 

Graphical devices

Once your nice plot is completed you may want to export it for reporting purpose. There are many graphic devices in R. A graphic device is something where you can make a plot appear:

  • a window on your computer (screen device)
  • a PDF file (file device)
  • a PNG or JPEG (file device)
  • a scalable vector graphics (SVG) file (file device)

When you make a plot in R it has to be “sent” to a specific graphic device. The most common place to be sent is the screen. On Mac screen device is launched with the quartz(), in windows with windows(), on Unix/Linux with x11().

Functions like plot(), hist(), ggplot() they all have screen as default device. If you want to send the graphics to a device different from screen you have to:

  • explicitly launch a graphic device
  • call a plotting function to make a plot (note that if you are using a file device no plot will appear on the screen!)
  • annotate plot if necessary (add legends, etc.)
  • explicitly close the graphics device with dev.off()

# save the ggplot in pdf
pdf(file = "myplot.pdf")
gg1
dev.off()

# save the ggplot in PNG
png(file = "myplot.PNG")
gg1
dev.off()

R graphical capabilities are enormous and we have only scratched the surface. To get inspired consider have a tour in R graph gallery.

 

That’s it for this module! If you have gone through all this code you should have learnt the basics of R graphical capabilities.

 

The post R Training – Data Visualization appeared first on SLOW DATA.

To leave a comment for the author, please follow the link and comment on their blog: R – SLOW DATA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)