The visualization of spatial data is one of the most popular applications when using R. This tutorial is an introduction to the visualization of spatial data and map making in R.
The following issues will be addressed:
- Import anduse of shapefiles in R (shapefile is a file format designed for spatial data).
- Data management and respectively restructuring of data
- Map Making
- Visualization by combining the spatial information of static maps from OpenStreetMap
As an example, a map of Kassel boroughs will be created, which illustrates the number of households by each zip code area. Such a map type is also known as Choropleth map.
In a further step, the stations of the Kassel bicycle sharing system “Konrad” will be added.
Import of shapefiles
The information with the map outlines are saved in a shapefile. Such a shapefile consists of at least three files:
- A file with the extension “shp” which contains all geometric information.
- A file with the extension “dbf” which contains the attributes of geometric figures.
- In our example, it contains the zip codes and the associated city names.
- An index file with the file extension “shx” – in order to distinguish geometric data according to any attributes. In our example, we use a dataset of all German postal code areas but require only those from Kassel city.
In R, a shapefile can be imported e.g. with the command readShapeSpatial by using package “maptools”.
We will receive an s4 object of the class Large SpatialPolygonsDataFrame. In comparison to lists, S4 classes provide a formal object structure and enable an object oriented workflow. These objects usually can be further indexed just like a data frame.
shapes <- readShapeSpatial(“plz/post_pl.shp”)
#subsetting only Kassel shapes
ks_shapes <- shapes[shapes$PLZORT99 == “Kassel”,]
The command plot(ks_shapes) visualizes the outlines in R.
Now wie Combine the map with further Information.
The dataset is from datamarket.com and handed to the department of Statistics. The data is given for neighborhoods, while the map is based on the postal code. Therefore, another data set is necessary to assign the appropriate neighborhoods to the postal codes. Such an assignment can be found on Wikipedia, for example (http://de.wikipedia.org/wiki/Kategorie:Stadtteil_von_Kassel).
We read the data into R with read.csv2. They look like follow:
districts_ks <- read.csv2(“data/mapping_postalcodes_districts_kassel.csv”,
stringsAsFactors = FALSE)
districts_data <- read.csv2(“data/districts_information.csv”, stringsAsFactors = FALSE)
District Households private.used.cars.per.1000.residents
To combine two different datasets with each other, it is necessary to unify the district variables. For example, in one data set we have got the description “Philippinenhof-Warteberg“, in the other one we have got the description „Philippinenhof/Warteberg“. In these small data sets, we change the names simply by indexing and assignment. (For larger datasets, a systematically data cleaning should be done with regular expressions, for example, to search for patterns in a given text.)
#correction of district names in districts_data
districts_data$District[districts_data$District == “Nord (Holland)”] <- “Nord-Holland”
districts_data$District[districts_data$District == “Philippinenhof/Warteberg”] <- “Philippinenhof-Warteberg”
districts_data$District[districts_data$District == “Süsterfeld/Helleböhn”] <- “Süsterfeld”
districts_data$District[districts_data$District == “Wolfsanger/Hasenhecke”] <- “Wolfsanger-Hasenhecke”
The two datasets are merged.
districts_data <- merge(districts_data, districts_ks, all.x = TRUE)
The dataset looks like:
It causes duplicate entries for district and postal code. As a next step, we will summarize the data so we get unique entries for the variable postal code. We sum the respective values of Households, and write the corresponding Districts into a string. For aggregation, we use the data management package dplyr which has its own grammar of data management. An important component is the Pipe Operator %>%. Originally from the magrittr package, function calls can be chained together with %>%, without calling the dataset manually every time.
#aggregation by postalcode
summarise(District = paste(unique(District),collapse = ” & “) , Households = sum(Households),
cars = sum(private.used.cars.per.1000.residents))
The resulting dataset looks like:
An inaccuracy caused by this method is that we do not know exactly how the population is distributed in the districts across the postal code areas. For example, how many of the 5.040 households in the district center accounts to the zip codes 34121 and on how many to 34117?
Thus, the data can be finally visualized on the map, it still needs to be divided into classes.
With the cut function, we create a factor that contains the appropriate classes. We assign names to the classes with the next command.
classes <- cut(data.agg$Households[position], c(0,6000,10000,15000,20000,25000))
levels(classes) <- c(“under 6000″, “6000 – 10000″, “10000 – 15000″, “15000 – 20000″,”20000 – 25000″)
We create a color vector that contains 5 different shades of green.
colours<- c(“#73AC97″, “#489074″, “#267356″, “#0E563B”, “#003924″)
To display the map, we use the plot function in base R. First, we set global parameters for the output device.
par(mar = c(0,0,1,9),oma = c(3,0,2,1),
bg = “lightgrey”,xpd = TRUE)
We define the inner margins of the plot with mar = c(0,0,1,9). We don’t want an inner margin below and left, above an inner of a row and on the right side, the legend should have with nine lines a slightly wider margin. With oma = c(3,0,2,1) we define the size of the outer edges. bg = „lightgrey“ specifies that our graphics gets a light gray background. By setting the Argument xpd = TRUE we ensure, that the legend is displayed within the inner frame.
To visualize the map, we can use the plot function, then we surround the postcode areas dark gray and pass our color vector as color, sorted by classes (Number of households in classes).
plot(ks_shapes,border = “darkgrey”, col = colours[classes])
The function text(coordinates(ks_shapes), labels =ks_shapes$PLZ99_N, col = “white”) adds labels to the areas in the form of postal codes.
A legend can be added with the function called legend() hinzugefügt werden.
legend(x = par(“usr”),y = par(“usr”), pch = 15, col = colours, legend = levels(classes),ncol = 1, bty = “n”, title = “Households”,
xjust = -3.4, yjust = -1.8)
With title(“Households in Kassel”), we can add a title of the plot
Integration of information from openstreetmap.de
The information about the coordinates of the stations of bike rental system Konrad come from a osm file which relates to the urban area of Kassel. The file itself is constructed in a xml structure.
To extract the Konrad coordinates from this file, there are two different ways:
osm parsen with the osmar package
The first way is to use the osmar package, which handles openstreetmap data. The commands osmsource_file and get_osm load the data and store the information into an osmar object.
src <- osmsource_file(“C:/Users/msc.EODA/workspace/geodaten_visualisierung/Kassel.osm”)
OBJ <- get_osm(complete_file(), source = src)
It should be noted that this is done for all stored information, not just for the desired Konrad stations. Accordingly, the process requires quite a lot of computing power and the resulting object is almost 40 MB large.
If you want to receive the corresponding nodes of the Konrad stations, search with find(OBJ, node(tags(v %agrep% “Konrad”))), the matching node IDs of the Konrad nodes and build an appropriate subset of the osmar object with subset(OBJ, node_ids = konrad_id) that only contains the Konrad node.
Osm parsen mit xpath
A more efficient method for our application is an xpath query. Xpath is a query language to get information from an xml object. The xml package provides a function environment in R.
With xmlparse (“Kassel.osm”) the document is read into R, next step is to send a query for the latitudes and longitudes and write the result directly into a data.frame.
stations <- data.frame(lat = xpathSApply(osm, path = “//node[tag[@v = ‘Konrad’]]/@lat”),
lon = xpathSApply(osm, path = “//node[tag[@v = ‘Konrad’]]/@lon”),
stringsAsFactors = FALSE)
We pass an xpath command as argument to path in order to search which nodes there are with a tag node which contains an attribute v, of which the value is Konrad and obtain an output value of the attribute lat from this node.
We add the Konrad stations as White dots with points(stations$lon, stations$lat, col = adjustcolor(“white”,0.4),pch = 19)
to the already existing plot and adjust the transparency of the dots with adjustcolor(“white”,0.4).
Finally, we have to add a reference.
legend(par(“usr”),par(“usr”), pch = 19, col = adjustcolor(“white”,0.4), legend = “KonradnStations”, bty = “n”,
xjust = -5.1, yjust = -4)
Finally, we have to add a reference.
mtext(“Sources:ndatamarket.comnhttp://arnulf.us/PLZnopenstreetmap.de”, side = 1, line = 1, outer = TRUE, cex = 0.7, col = “darkgrey”)
Use the mtext function to write text into the margins of a plot side = 1
describes the bottom side of the plot, line = 1
means the second line (the function begins counting at 0) outer = TRUE
writes the text to the outer margin, finally cex and col define the font size and font color.
The code and the corresponding data are available at GitHub.
The visualization of analytical results as well as the creation of high quality and interactive graphics with R is one of the subjects of the training program of the eoda R Academy.