This post is based on the free and open source Creating-maps-in-R teaching resource for introducing R as a command-line GIS.
R is well known as an language ideally suited for data processing, statistics and modelling. R has a number of spatial packages, allowing analyses that would require hundreds of lines of code in other languages to be implemented with relative ease. Geographically weighted regression, analysis of space-time data and raster processing are three niche areas where R outperform much of the competition, thanks to community contributions such as spgwr, spacetime and the wonderfully straightforward raster packages.
What seems to be less well known is that R performs well as a self standing Geographical Information System (GIS) in its own right. Everyday tasks such as reading and writing geographical data formats, reprojecting, joining, subsetting and overlaying spatial objects can be easy and intuitive in R, once you understand the slightly specialist data formats and syntax of spatial R objects and functions. These basic operations are the basic foundations of GIS. Mastering them will make much more advanced operations much easier. Based on the saying ‘master walking before trying to run’, this mini tutorial demonstrates how to load and plot a simple geographical object in R, illustrating that the ease with which continuous and binned choropleth map color schemes can be created using ggmap, an extension of the popular ggplot2 graphics package. Crucially, we will also see how to join spatial and non spatial datasets, resulting in a map of where the Conservative party succeeded and failed in gaining council seats in the 2014 local elections.
As with any project, the starting point is to load the data we’ll be using. In this case we can download all the datasets from a single souce: the Creating-maps-in-R github repository which is designed to introduce R’s basic geographical functionality to beginners. We can use R to download and unzip the files using the following commands (from a Linux-based operating system). This ensures reproducibility:
# load the packages we'll be using for this tutorial x <- c("rgdal", "dplyr", "ggmap", "RColorBrewer") lapply(x, library, character.only = TRUE) # download the repository: download.file("https://github.com/Robinlovelace/Creating-maps-in-R/archive/master.zip", destfile = "rmaps.zip", method = "wget") unzip("rmaps.zip") # unzip the files
Once ‘in’ the folder, R has easy access to all the datasets we need for this tutorial. As this is about GIS, the first stage is to load and plot some spatial data: a map of London:
setwd("/home/robin/Desktop/Creating-maps-in-R-master/") # navigate into the unzipped folder london <- readOGR("data/", layer = "london_sport") ## OGR data source with driver: ESRI Shapefile ## Source: "data/", layer: "london_sport" ## with 33 features and 4 fields ## Feature type: wkbPolygon with 2 dimensions plot(london)
The data has clearly loaded correctly and can be visualised, but where is it? The
london is simply printed, a load of unreadable information is printed, including the coordinates defining the geographical extent of each zone and additional non-geographical attributes. The polymophic means that generic functions behave differently depending on the type of data they are fed. The following command, for example, is actually calling
mean.Date behind the scenes, allowing R to tell us that the the 2nd of July was half way through the year. The default
mean.default function does not work:
mean(as.Date(c("01/01/2014", "31/12/2014"), format = "%d/%m/%Y")) ##  "2014-07-02"
In the same way, we can use the trusty
summary function to summarise our R object:
summary(london) ## Object of class SpatialPolygonsDataFrame ## Coordinates: ## min max ## x 503571.2 561941.1 ## y 155850.8 200932.5 ## Is projected: TRUE ## proj4string : ## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 ## +y_0=-100000 +ellps=airy +units=m +no_defs] ## Data attributes: ## ons_label name Partic_Per Pop_2001 ## 00AA : 1 Barking and Dagenham: 1 Min. : 9.10 Min. : 7181 ## 00AB : 1 Barnet : 1 1st Qu.:17.60 1st Qu.:181284 ## 00AC : 1 Bexley : 1 Median :19.40 Median :216505 ## 00AD : 1 Brent : 1 Mean :20.05 Mean :217335 ## 00AE : 1 Bromley : 1 3rd Qu.:21.70 3rd Qu.:248917 ## 00AF : 1 Camden : 1 Max. :28.40 Max. :330584 ## (Other):27 (Other) :27
This has outputed some very useful information: the bounding box of the object, its coordinate reference system (CRS) and even summaries of the attributes associated with each zone.
nrow(london) will tell us that there are 33 polygons represented within the object.
To gain a fuller understanding of the structure of the
london object, we can use the
str function (but only on the first polygon, to avoid an extrememly long output):
str(london[1,]) ## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots ## ..@ data :'data.frame': 1 obs. of 4 variables: ## .. ..$ ons_label : Factor w/ 33 levels "00AA","00AB",..: 6 ## .. ..$ name : Factor w/ 33 levels "Barking and Dagenham",..: 5 ## .. ..$ Partic_Per: num 21.7 ## .. ..$ Pop_2001 : int 295535 ## ..@ polygons :List of 1 ## .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots ## .. .. .. ..@ Polygons :List of 1 ## .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots ## .. .. .. .. .. .. ..@ labpt : num [1:2] 542917 165647 ## .. .. .. .. .. .. ..@ area : num 1.51e+08 ## .. .. .. .. .. .. ..@ hole : logi FALSE ## .. .. .. .. .. .. ..@ ringDir: int 1 ## .. .. .. .. .. .. ..@ coords : num [1:63, 1:2] 541178 541872 543442 544362 546662 ... ## .. .. .. ..@ plotOrder: int 1 ## .. .. .. ..@ labpt : num [1:2] 542917 165647 ## .. .. .. ..@ ID : chr "0" ## .. .. .. ..@ area : num 1.51e+08 ## ..@ plotOrder : int 1 ## ..@ bbox : num [1:2, 1:2] 533569 156481 550541 173556 ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr [1:2] "x" "y" ## .. .. ..$ : chr [1:2] "min" "max" ## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots ## .. .. ..@ projargs: chr "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"
This shows us that the fundamental structure of a
SpatialPolygonsDataFrame is actually rather complicated. This complexity is useful, allowing R to store the full range of information needed to describe almost any polygon-based dataset. The
@ symbol in the structure represents slots which are specific to the S4 object class and contain specific pieces of information within the wider
london object. The basic slots within the london object are:
@data, which contains the the attribute data for the zones
@polygons, the geographic data associated with each polygon (this confusingly contains the
@Polygonsslot: each polygon feature can contain multiple
Polygons, e.g. if an administrative zone is non-contiguous)
@plotOrderis simply the order in which the polygons are plotted
@bboxis a slot associated with all spatial objects, representing its spatial extent
@proj4stringthe CRS associated with the object
Critically for exploring the attributes of
london is the
data slot. We can look at and modify the attributes of the subdivisions of
london easily using the
head(london@data) ## ons_label name Partic_Per Pop_2001 ## 0 00AF Bromley 21.7 295535 ## 1 00BD Richmond upon Thames 26.6 172330 ## 2 00AS Hillingdon 21.5 243006 ## 3 00AR Havering 17.9 224262 ## 4 00AX Kingston upon Thames 24.4 147271 ## 5 00BF Sutton 19.3 179767
Having seen his notation, many (if not most) R beginners will tend to always use it to refer to attribute data in spatial objects. Yet
@ is often not needed. To refer to the population of London, for example, the following lines of code yield the same result:
mean(london@data$Pop_2001) ##  217335.1 mean(london$Pop_2001) ##  217335.1
Thus we can treat the S4 spatial data classes as if they were regular data frames in some contexts, which is extremely useful for concise code. To plot the population of London zones on a map, the following code works:
cols <- brewer.pal(n = 4, name = "Greys") lcols <- cut(london$Pop_2001, breaks = quantile(london$Pop_2001), labels = cols) plot(london, col = as.character(lcols))
Now, how about joining additional variables to the spatial object? To join information to the existing variables, the join functions from dplyr (which replaces and improves on plyr) are a godsend. The following code loads a non-geographical dataset and joins an additional variable to
ldat <- read.csv("/home/robin/Desktop/Creating-maps-in-R-master/data/london-borough-profiles-2014.csv") dat <- select(ldat, Code, contains("Anxiety")) dat <- rename(dat, ons_label = Code, Anxiety = Anxiety.score.2012.13..out.of.10.) dat$Anxiety <- as.numeric(as.character(dat$Anxiety)) ## Warning: NAs introduced by coercion london@data <- left_join(london@data, dat) ## Joining by: "ons_label" head(london@data) # the new data has been added ## ons_label name Partic_Per Pop_2001 Anxiety ## 1 00AF Bromley 21.7 295535 3.20 ## 2 00BD Richmond upon Thames 26.6 172330 3.56 ## 3 00AS Hillingdon 21.5 243006 3.34 ## 4 00AR Havering 17.9 224262 3.17 ## 5 00AX Kingston upon Thames 24.4 147271 3.23 ## 6 00BF Sutton 19.3 179767 3.34
Plotting maps with ggplot
In order to plot the average anxiety scores across london we can use ggplot2:
lf <- fortify(london, region = "ons_label") ## Loading required package: rgeos ## rgeos version: 0.2-19, (SVN revision 394) ## GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 ## Polygon checking: TRUE lf <- rename(lf, ons_label = id) lf <- left_join(lf, london@data) ## Joining by: "ons_label" ggplot(lf) + geom_polygon(aes(long, lat, group = group, fill = Anxiety))
Using the skills you have learned in the above tutorial, see if you can replicate the graph below: the proportion of Conservative councilors selected in different parts of London. Hint: the data is contained in
ldat, as downloaded from here: http://data.london.gov.uk/dataset/london-borough-profiles.