R Training – Data Visualization
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the fourth module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.
Graphics in data projects can be useful for several tasks including:
- understand data properties
- find patterns in data
- communicate results
First of all, let’s load some useful packages.
library(dplyr) library(MASS) library(rworldmap) library(ggplot2) library(RColorBrewer)
Understand data properties
We will start with some exploratory graphics to summarize data and highlight broad features. This is useful to explore basic questions and hypothesis, suggest modeling strategies and so on.
x <- rnorm(100) y <- x + rnorm(100, mean=0.2, sd=2) df <- data.frame(lab = LETTERS[1:7], g = rgamma(7, shape = 100))
plot( ) is a generic function to plot R objects. It is generic because it adapts to the input provided:
- if you provide a numeric vector the default is to plot them as points on the y axis against an integer index on x axis
- if you provide two numeric vectors the default is to plot the points determined by the (x,y) couples (a scatterplot)
- if you provide a dataframe with a numerical and a factor you will get a barplot
plot(x) # values against integer index
plot(x, y) # scatterplot
plot(df) # barplot
Let’s load some example data
data("iris") dt <- iris str(dt) # to have an idea of what kind of data you have read-in
summary(dt) # to obtain a summary of data
Once you know your data is clean you may want to explore some features more in detail.
dt_spec <- dt %>% group_by(Species) %>% summarise(Petal.Length=sum(Petal.Length)) plot(dt_spec)
There is also a specific function to create barplots in R, but input have to be provided in a slightly different way:
barplot(dt_spec$Petal.Length, names.arg = dt_spec$Species)
Looking at the summary we see that minimum sepal length is 4.3, maximum 7.9 and median 5.8. We have also other quantiles but to have a more thorough view of the distribution you should draw a histogram.
hist(dt$Sepal.Length)
hist(dt$Sepal.Length, nclass = 30) # to smooth more by increasing number of bins
Another way to get a quick visualization of a distribution is to use boxplots.
boxplot(dt$Sepal.Length)
In this case we see clearly that:
- the bulk of distribution (50%) has a value around 5 and 6.5
- maximum value excluding outliers is somewhere between 7.5 and 8.0
- right tail is longer than left tail
In R-boxplots the box correspond to the interquartile range (from 25th to 75th quantile), black line inside the box is the median, the lines extending vertically from the box (whiskers) indicate variability outside the upper and lower quartile. Outliers are plotted as individual points (if any).
Find patterns
Usually it is a good idea to investigate relations using graphics since we are naturally prone to detect trends, relationships, etc. in a visual way.
When we talk about patterns in data we usually refer to relationships between two or more variables. Options to visualize two dimnensions are:
- draw multiple boxplots in one window
- scatterplots
- etc.
To add a 3rd dimension one option is to use different colors, shapes, sizes, etc. (rather than using 3D graphics, which are typically hard to interpret).
Say we want to see if age distribution changes according to car category.
# boxplot function supports formula (~) statements boxplot(dt$Sepal.Length ~ dt$Species, col="salmon2")
The hist() function does not support the formula statment, but you can modify directly the global graphical parameteres in order to split the graphical device into multiple slots. Before changing global parameters it is a good idea to save a copy of original settings in order to easily go back to defaults once done with the plot.
parOriginal <- par(no.readonly = TRUE) # save a copy of original graphical parameters par(mfrow=c(2,2)) # par can be used to set or query graphical parameters hist(dt[dt$Species=="setosa","Sepal.Length"], nclass = 30) hist(dt[dt$Species=="virginica","Sepal.Length"], nclass = 30) hist(dt[dt$Species=="versicolor","Sepal.Length"], nclass = 30) hist(dt$Sepal.Length, nclass = 30) # full age distribution
par(parOriginal) # set default graphical parameters
Scatterplot
Let’s simulate some numbers and draw scatterplots.
# two normal populations, with mean 2 and 4 respectively x_a <- rnorm(50, 2) x_b <- rnorm(50, 4) x <- c(x_a, x_b) # another two normal populations respectively correlated with previous ones y_a <- x_a + rnorm(50, 0.2, 0.5) y_b <- x_b + rnorm(50, 0.2, 1) y <- c(y_a, y_b) # a variable to label the two populations l <- c(rep("A", 50), rep("B", 50)) # a dataframe including x, y and l df <- data.frame(x=x, y=y, l=l)
# scatterplot 2-d plot(df$x, df$y)
# add a third dimension with colour with(df, plot(x, y, col = l))
Spatial analysis
If you are interested in the visualization of a geographical attribute then a map is probably what you need. R can be used as a fast, user-friendly and extremely powerful command-line Geographic Information System (GIS).
In R there is a large and growing number of spatial data packages. Here we will focus on rworldmap, a package for visualising global data referenced by country.
The package stores multiple maps which can be accessed through getMap function.
newmap <- getMap(resolution = "coarse") class(newmap)
Maps in R are classified as spatial (sp) objects. Spatial objects are made up of a number of different slots (that can be accessed through the @ operator):
- bbox (bounding box, mostly used for setting up plots)
- data (data indeed)
- polygons/lines/points/… (the geometry instructing R on how to plot maps)
- proj4string (define the coordinate reference system)
Inside each slot you may have multiple components which, as usual, can be accessed with the $ operator.
Plot is a generic function and it works also with spatial objects.
To add some information in this map we need some attribute at country-level. The package rworldmap itself offers some interesting environmental dataset.
The package rworldmap provides a function to join country-level attributes to an internal map. All you need to do is to provide the name of the column containing the key for join (nameJoinColumn = ‘ISO3V10’) and specify you want to join by that key (joinCode = ‘ISO3’)
dat <- joinCountryData2Map(countryExData, joinCode = "ISO3", nameJoinColumn = "ISO3V10")
Function mapCountryData in rworldmap draws a map of country-level data, allowing countries to be coloured.
mapCountryData(dat, nameColumnToPlot="BIODIVERSITY")
Using spatial data in R can be challenging because there are many types and formats and there are many packages coming from diverse user communities. Anyway there is an increasing trend of harmonization and the capabilities offered are extremely vast. A good start is the CRAN tutorial, or one of the many tutorials on github.
Communicate results
Typically the findings of a data analysis are shared with an audience and in general visual aids help people to digest complex messages. In this context the sizes, shapes, widths, labels, margins, fonts, etc. are all things that become important because they can contribute to make the visualization clearer.
Additional graphical parameters
When applicable plot function allows you to specify many additional graphical parameters. To have a list of them type ?par
Let’s take the histogram created before and clean it a bit with additional graphical parameters.
hist(dt$Sepal.Length, nclass = 30, # number of bins probability = TRUE, col="wheat", # color of bars border = "black", # color of border of bars xlab = "Sepal Length", # label of x axis ylab = "", # label of y axis main = "Iris Sepal Length density distribution" # title ) fit <- fitdistr(dt$Sepal.Length, "normal") # Maximum-likelihood fitting of univariate distributions curve(dnorm(x, mean = fit$estimate["mean"], sd = fit$estimate["sd"]), add=T, col = "red") # Draws a curve corresponding to a function # Also legends can be added legend("topright", # position of legend box bty = "n", # box type = none legend = c("Observed", "theoretical normal"), # text to be displayed col = c("wheat", "red"), # colors lty = c(1,1), # line type lwd = c(10, 1) # line width )
Ggplot
All functions used until now belong to the base plotting systems. In R there are 3 different plotting systems available:
- base
- lattice
- ggplot
ggplot is an implementation of the Grammar of Graphics by Leland Wilkinson (a set of principles for graphics). Grammar of graphics is a description of how graphics can be broken down into abstract concepts (like languages are divided in nouns, adjectives, etc.). Ggplot graphics abstraction is a very powerful concept to organize all kind of graphics and has become extremely popular in recent years.
Ggplot2, as lattice, is built upon the grid package which is able to control all details of the graphic system in R. This is why ggplot allows you to produce a wide variety of visualizations virtually according to every needs and purpose. For the same reason ggplot is typically the first choice for high-quality works in R, ready to publish.
Briefly, from the ggplot book,
the grammar tells us that a statistical graphics is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system.
Another key feature of ggplot graphics is that they are built with layers and this explain the sum symbol (+) you will see in the code.
# if you are using a Windows machine you need to translate some font for ggplot windowsFonts(Times=windowsFont("TT Times New Roman")) gg1 <-ggplot(dt, aes(x = Sepal.Length, group = Species, fill = Species)) + # set Sepal Length on x-axis, group and fill (with color) according to the values of Species geom_density(alpha = .4) + # transform age data into a density distribution summary xlab("Sepal Length") + # set x-axis label ylab("") + # set y-axis label ggtitle("Sepal Length distributions by Species") + # set plot title guides(fill=guide_legend("Species")) + # color legend according to values of species theme(plot.title = element_text(hjust = 0, vjust=5, size = 14, family = "Times"), # set position, size and font for title axis.text.x = element_text(size = 12, family = "Times"), # set size and font for x axis label axis.text.y = element_text(size = 12, family = "Times"), # set size and font for y axis label panel.background = element_rect(fill = "white") # set background color ) gg1 # to plot ggplot plots you have to call them
Colours
A careful choice of colors can help to draw better visualizations. R has 657 built-in color names. Use colors() for a list of all colors known by R.
When we need to show a range of colors we can use palettes. In the map created before the palette was not specified so mapCountryData function used its default value (in that case a heat palette, with colors ranging gradually from yellow to red). We can customize palettes to our needs.
A reference package for color palettes is RColorBrewer. The function to create palettes is brewer.pal. It takes two arguments:
- n –> Number of different colors in the palette, minimum 3, maximum depending on palette
- name –> a palette name
To have a look at all available palettes you can use:
display.brewer.all(n=NULL, type="all") # diverging, sequential, qualitative
display.brewer.all(n=NULL, type="seq") # only sequential
For an interactive viewer of palettes you can visit this page.
# using output from RColorBrewer mapCountryData(dat, nameColumnToPlot="BIODIVERSITY", colourPalette = brewer.pal(7, "Purples"))
Graphical devices
Once your nice plot is completed you may want to export it for reporting purpose. There are many graphic devices in R. A graphic device is something where you can make a plot appear:
- a window on your computer (screen device)
- a PDF file (file device)
- a PNG or JPEG (file device)
- a scalable vector graphics (SVG) file (file device)
When you make a plot in R it has to be “sent” to a specific graphic device. The most common place to be sent is the screen. On Mac screen device is launched with the quartz(), in windows with windows(), on Unix/Linux with x11().
Functions like plot(), hist(), ggplot() they all have screen as default device. If you want to send the graphics to a device different from screen you have to:
- explicitly launch a graphic device
- call a plotting function to make a plot (note that if you are using a file device no plot will appear on the screen!)
- annotate plot if necessary (add legends, etc.)
- explicitly close the graphics device with dev.off()
# save the ggplot in pdf pdf(file = "myplot.pdf") gg1 dev.off() # save the ggplot in PNG png(file = "myplot.PNG") gg1 dev.off()
R graphical capabilities are enormous and we have only scratched the surface. To get inspired consider have a tour in R graph gallery.
That’s it for this module! If you have gone through all this code you should have learnt the basics of R graphical capabilities.
The post R Training – Data Visualization appeared first on SLOW DATA.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.