**R – SLOW DATA**, and kindly contributed to R-bloggers)

This is the fourth module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Graphics in data projects can be useful for several tasks including:

- understand data properties
- find patterns in data
- communicate results

First of all, let’s load some useful packages.

library(dplyr) library(MASS) library(rworldmap) library(ggplot2) library(RColorBrewer)

**Understand data properties**

We will start with some exploratory graphics to summarize data and highlight broad features. This is useful to explore basic questions and hypothesis, suggest modeling strategies and so on.

x <- rnorm(100) y <- x + rnorm(100, mean=0.2, sd=2) df <- data.frame(lab = LETTERS[1:7], g = rgamma(7, shape = 100))

plot( ) is a generic function to plot R objects. It is generic because it adapts to the input provided:

- if you provide a numeric vector the default is to plot them as points on the y axis against an integer index on x axis
- if you provide two numeric vectors the default is to plot the points determined by the (x,y) couples (a scatterplot)
- if you provide a dataframe with a numerical and a factor you will get a barplot

plot(x) # values against integer index

plot(x, y) # scatterplot

plot(df) # barplot

Let’s load some example data

data("iris") dt <- iris str(dt) # to have an idea of what kind of data you have read-in

summary(dt) # to obtain a summary of data

Once you know your data is clean you may want to explore some features more in detail.

dt_spec <- dt %>% group_by(Species) %>% summarise(Petal.Length=sum(Petal.Length)) plot(dt_spec)

There is also a specific function to create barplots in R, but input have to be provided in a slightly different way:

barplot(dt_spec$Petal.Length, names.arg = dt_spec$Species)

Looking at the summary we see that minimum sepal length is 4.3, maximum 7.9 and median 5.8. We have also other quantiles but to have a more thorough view of the distribution you should draw a histogram.

hist(dt$Sepal.Length)

hist(dt$Sepal.Length, nclass = 30) # to smooth more by increasing number of bins

Another way to get a quick visualization of a distribution is to use boxplots.

boxplot(dt$Sepal.Length)

In this case we see clearly that:

- the bulk of distribution (50%) has a value around 5 and 6.5
- maximum value excluding outliers is somewhere between 7.5 and 8.0
- right tail is longer than left tail

In R-boxplots the box correspond to the interquartile range (from 25th to 75th quantile), black line inside the box is the median, the lines extending vertically from the box (whiskers) indicate variability outside the upper and lower quartile. Outliers are plotted as individual points (if any).

**Find patterns**

Usually it is a good idea to investigate relations using graphics since we are naturally prone to detect trends, relationships, etc. in a visual way.

When we talk about patterns in data we usually refer to relationships between two or more variables. Options to visualize two dimnensions are:

- draw multiple boxplots in one window
- scatterplots
- etc.

To add a 3rd dimension one option is to use different colors, shapes, sizes, etc. (rather than using 3D graphics, which are typically hard to interpret).

Say we want to see if age distribution changes according to car category.

# boxplot function supports formula (~) statements boxplot(dt$Sepal.Length ~ dt$Species, col="salmon2")

The *hist()* function does not support the formula statment, but you can modify directly the global graphical parameteres in order to split the graphical device into multiple slots. Before changing global parameters it is a good idea to save a copy of original settings in order to easily go back to defaults once done with the plot.

parOriginal <- par(no.readonly = TRUE) # save a copy of original graphical parameters par(mfrow=c(2,2)) # par can be used to set or query graphical parameters hist(dt[dt$Species=="setosa","Sepal.Length"], nclass = 30) hist(dt[dt$Species=="virginica","Sepal.Length"], nclass = 30) hist(dt[dt$Species=="versicolor","Sepal.Length"], nclass = 30) hist(dt$Sepal.Length, nclass = 30) # full age distribution

par(parOriginal) # set default graphical parameters

**Scatterplot**

Let’s simulate some numbers and draw scatterplots.

# two normal populations, with mean 2 and 4 respectively x_a <- rnorm(50, 2) x_b <- rnorm(50, 4) x <- c(x_a, x_b) # another two normal populations respectively correlated with previous ones y_a <- x_a + rnorm(50, 0.2, 0.5) y_b <- x_b + rnorm(50, 0.2, 1) y <- c(y_a, y_b) # a variable to label the two populations l <- c(rep("A", 50), rep("B", 50)) # a dataframe including x, y and l df <- data.frame(x=x, y=y, l=l)

# scatterplot 2-d plot(df$x, df$y)

# add a third dimension with colour with(df, plot(x, y, col = l))

**Spatial analysis**

If you are interested in the visualization of a geographical attribute then a map is probably what you need. R can be used as a fast, user-friendly and extremely powerful command-line Geographic Information System (GIS).

In R there is a large and growing number of spatial data packages. Here we will focus on *rworldmap*, a package for visualising global data referenced by country.

The package stores multiple maps which can be accessed through *getMap* function.

newmap <- getMap(resolution = "coarse") class(newmap)

Maps in R are classified as spatial (sp) objects. Spatial objects are made up of a number of different slots (that can be accessed through the @ operator):

- bbox (bounding box, mostly used for setting up plots)
- data (data indeed)
- polygons/lines/points/… (the geometry instructing R on how to plot maps)
- proj4string (define the coordinate reference system)

Inside each slot you may have multiple components which, as usual, can be accessed with the $ operator.

Plot is a generic function and it works also with spatial objects.

To add some information in this map we need some attribute at country-level. The package rworldmap itself offers some interesting environmental dataset.

The package *rworldmap* provides a function to join country-level attributes to an internal map. All you need to do is to provide the name of the column containing the key for join (nameJoinColumn = ‘ISO3V10’) and specify you want to join by that key (joinCode = ‘ISO3’)

dat <- joinCountryData2Map(countryExData, joinCode = "ISO3", nameJoinColumn = "ISO3V10")

Function *mapCountryData* in rworldmap draws a map of country-level data, allowing countries to be coloured.

mapCountryData(dat, nameColumnToPlot="BIODIVERSITY")

Using spatial data in R can be challenging because there are many types and formats and there are many packages coming from diverse user communities. Anyway there is an increasing trend of harmonization and the capabilities offered are extremely vast. A good start is the CRAN tutorial, or one of the many tutorials on github.

**Communicate results**

Typically the findings of a data analysis are shared with an audience and in general visual aids help people to digest complex messages. In this context the sizes, shapes, widths, labels, margins, fonts, etc. are all things that become important because they can contribute to make the visualization clearer.

**Additional graphical parameters**

When applicable plot function allows you to specify many additional graphical parameters. To have a list of them type ?par

Let’s take the histogram created before and clean it a bit with additional graphical parameters.

hist(dt$Sepal.Length, nclass = 30, # number of bins probability = TRUE, col="wheat", # color of bars border = "black", # color of border of bars xlab = "Sepal Length", # label of x axis ylab = "", # label of y axis main = "Iris Sepal Length density distribution" # title ) fit <- fitdistr(dt$Sepal.Length, "normal") # Maximum-likelihood fitting of univariate distributions curve(dnorm(x, mean = fit$estimate["mean"], sd = fit$estimate["sd"]), add=T, col = "red") # Draws a curve corresponding to a function # Also legends can be added legend("topright", # position of legend box bty = "n", # box type = none legend = c("Observed", "theoretical normal"), # text to be displayed col = c("wheat", "red"), # colors lty = c(1,1), # line type lwd = c(10, 1) # line width )

**Ggplot**

All functions used until now belong to the *base* plotting systems. In R there are 3 different plotting systems available:

- base
- lattice
- ggplot

ggplot is an implementation of the Grammar of Graphics by Leland Wilkinson (a set of principles for graphics). Grammar of graphics is a description of how graphics can be broken down into abstract concepts (like languages are divided in nouns, adjectives, etc.). Ggplot graphics abstraction is a very powerful concept to organize all kind of graphics and has become extremely popular in recent years.

Ggplot2, as lattice, is built upon the grid package which is able to control all details of the graphic system in R. This is why ggplot allows you to produce a wide variety of visualizations virtually according to every needs and purpose. For the same reason ggplot is typically the first choice for high-quality works in R, ready to publish.

Briefly, from the ggplot book,

the grammar tells us that a statistical graphics is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system.

Another key feature of ggplot graphics is that they are built with layers and this explain the sum symbol (+) you will see in the code.

# if you are using a Windows machine you need to translate some font for ggplot windowsFonts(Times=windowsFont("TT Times New Roman")) gg1 <-ggplot(dt, aes(x = Sepal.Length, group = Species, fill = Species)) + # set Sepal Length on x-axis, group and fill (with color) according to the values of Species geom_density(alpha = .4) + # transform age data into a density distribution summary xlab("Sepal Length") + # set x-axis label ylab("") + # set y-axis label ggtitle("Sepal Length distributions by Species") + # set plot title guides(fill=guide_legend("Species")) + # color legend according to values of species theme(plot.title = element_text(hjust = 0, vjust=5, size = 14, family = "Times"), # set position, size and font for title axis.text.x = element_text(size = 12, family = "Times"), # set size and font for x axis label axis.text.y = element_text(size = 12, family = "Times"), # set size and font for y axis label panel.background = element_rect(fill = "white") # set background color ) gg1 # to plot ggplot plots you have to call them

**Colours**

A careful choice of colors can help to draw better visualizations. R has 657 built-in color names. Use colors() for a list of all colors known by R.

When we need to show a range of colors we can use palettes. In the map created before the palette was not specified so *mapCountryData* function used its default value (in that case a heat palette, with colors ranging gradually from yellow to red). We can customize palettes to our needs.

A reference package for color palettes is *RColorBrewer*. The function to create palettes is *brewer.pal*. It takes two arguments:

- n –> Number of different colors in the palette, minimum 3, maximum depending on palette
- name –> a palette name

To have a look at all available palettes you can use:

display.brewer.all(n=NULL, type="all") # diverging, sequential, qualitative

display.brewer.all(n=NULL, type="seq") # only sequential

For an interactive viewer of palettes you can visit this page.

# using output from RColorBrewer mapCountryData(dat, nameColumnToPlot="BIODIVERSITY", colourPalette = brewer.pal(7, "Purples"))

**Graphical devices**

Once your nice plot is completed you may want to export it for reporting purpose. There are many graphic devices in R. A graphic device is something where you can make a plot appear:

- a window on your computer (screen device)
- a PDF file (file device)
- a PNG or JPEG (file device)
- a scalable vector graphics (SVG) file (file device)

When you make a plot in R it has to be “sent” to a specific graphic device. The most common place to be sent is the screen. On Mac screen device is launched with the quartz(), in windows with windows(), on Unix/Linux with x11().

Functions like plot(), hist(), ggplot() they all have screen as default device. If you want to send the graphics to a device different from screen you have to:

- explicitly launch a graphic device
- call a plotting function to make a plot (note that if you are using a file device no plot will appear on the screen!)
- annotate plot if necessary (add legends, etc.)
- explicitly close the graphics device with dev.off()

# save the ggplot in pdf pdf(file = "myplot.pdf") gg1 dev.off() # save the ggplot in PNG png(file = "myplot.PNG") gg1 dev.off()

R graphical capabilities are enormous and we have only scratched the surface. To get inspired consider have a tour in R graph gallery.

That’s it for this module! If you have gone through all this code you should have learnt the basics of R graphical capabilities.

The post R Training – Data Visualization appeared first on SLOW DATA.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – SLOW DATA**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...