Using R — Working with Geospatial Data (and ggplot2)

[This article was first published on Working With Data » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a follow-up blog-post to an earlier introductory post by Steven Brey: Using R: Working with Geospatial Data. In this post, we’ll learn how to plot geospatial data in ggplot2. Why might we want to do this? Well, it’s really about your personal taste. Some people are willing to forfeit the fine-grained control of base graphics in exchange for the elegance of a ggplot. The choice is entirely yours.

To get started, we’ll need the ggplot2 package and some data! The dataset we’ll look at are shapefiles defining watersheds in Washington state.

Loading libraries and data

# load libraries
library(ggplot2)
library(sp)
library(rgdal)
library(rgeos)

# create a local directory for the data
localDir <- "R_GIS_data"
if (!file.exists(localDir)) {
  dir.create(localDir)
}

# download and unzip the data
url <- "ftp://www.ecy.wa.gov/gis_a/hydro/wria.zip"
file <- paste(localDir, basename(url), sep='/')
if (!file.exists(file)) {
  download.file(url, file)
  unzip(file,exdir=localDir)
}

# create a layer name for the shapefiles (text before file extension)
layerName <- "WRIA_poly"

# read data into a SpatialPolygonsDataFrame object
dataProjected <- readOGR(dsn=localDir, layer=layerName)

Transforming the data

Thus far, we haven’t done anything radically different than before, but in order to prepare the data for plotting in a ggplot, we’ll have to do a couple manipulations to the structure of the data. ggplot2 will only work with a data.frame object, so our object of class of SpatialPolygonsDataFrame will not be appropriate for plotting. Let’s write some code and discuss why this kind of transformation is necessary.

# add to data a new column termed "id" composed of the rownames of data
dataProjected@data$id <- rownames(dataProjected@data)

# create a data.frame from our spatial object
watershedPoints <- fortify(dataProjected, region = "id")

# merge the "fortified" data with the data from our spatial object
watershedDF <- merge(watershedPoints, dataProjected@data, by = "id")

# NOTE : If we so choose, we could have loaded the plyr library to use the
#      : join() function. For those familiar with SQL, this may be a more
#      : intuitive way to understand the merging of two data.frames. An
#      : equivalent SQL statement might look something like this:
#      : SELECT *
#      : FROM dataProjected@data
#      : INNER JOIN watershedPoints
#      : ON dataProjected@data$id = watershedPoints$id

# library(plyr)
# watershedDF <- join(watershedPoints, dataProjected@data, by = "id")

What does all this code mean and why do we need it? Let’s go through this line by line.

dataProjected@data$id <- rownames(dataProjected@data)

Here we are appending to the data an extra column called “id”. This column will contain the rownames so that we define an explicit relationship between the data and the polygons associated with that data.

watershedPoints <- fortify(dataProjected, region = "id")

Fortify? What does that even mean? A quick search on the internet will yield some helpful documentation. (See fortify.sp documentation). Basically, fortify take two arguments: model, which will consist of the SpatialPolygonsDataFrame object we wish to convert and region, the name of the variable by which to split regions. If all goes according to plan, some magic happens and we get a data.frame, just like we wanted… well, not quite. If you inspect this data.frame, you’ll notice it appears to be missing some critical information. Fret not! Using the relationship we created earlier, we can merge these two datasets with the following command.

watershedDF <- merge(watershedPoints, dataProjected@data, by = "id")

And viola! Now that we’ve created a data.frame that ggplot2 likes, we can begin plotting. Before we get to plotting, let’s take a quick look at this new data.frame we’ve created.

head(watershedDF)

##   id    long     lat order  hole piece group WRIA_ID WRIA_NR WRIA_AREA_
## 1  0 2377934 1352106     1 FALSE     1   0.1       1      62     789790
## 2  0 2378018 1352109     2 FALSE     1   0.1       1      62     789790
## 3  0 2382417 1352265     3 FALSE     1   0.1       1      62     789790
## 4  0 2387199 1352434     4 FALSE     1   0.1       1      62     789790
## 5  0 2387693 1352452     5 FALSE     1   0.1       1      62     789790
## 6  0 2392524 1352623     6 FALSE     1   0.1       1      62     789790
##        WRIA_NM Shape_Leng Shape_Area
## 1 Pend Oreille     983140   3.44e+10
## 2 Pend Oreille     983140   3.44e+10
## 3 Pend Oreille     983140   3.44e+10
## 4 Pend Oreille     983140   3.44e+10
## 5 Pend Oreille     983140   3.44e+10
## 6 Pend Oreille     983140   3.44e+10

Your first ggplot

If you’re coming from base graphics, some of the syntax may appear intimidating, but’s it’s all part of the “grammar of graphics” after which ggplot2 is modeled. You’ll notice a graph is built layer by layer, beginning with the data and the mapping of data to “aesthetic attributes”. We’ll add “geoms” or geometric objects and perhaps we’ll compute some statistics. We may also want to adjust the scale or coordinate system. All this can be added in a very modular fashion; this is one of the key advantages to using ggplot2. So, enough talk, let’s make a plot!

ggWatershed <- ggplot(data = watershedDF, aes(x=long, y=lat, group = group,
                                              fill = WRIA_NM)) +
  geom_polygon()  +
  geom_path(color = "white") +
  scale_fill_hue(l = 40) +
  coord_equal() +
  theme(legend.position = "none", title = element_blank(),
        axis.text = element_blank())

print(ggWatershed)

washingtonWatershedsPlot

Alright, so we have created our first ggplot. Looks pretty spiffy, right? We started by passing a data.frame to the function ggplot. From there, we added some aesthetic mappings. x and y are fairly self-explainatory, group = group simply identifies the groups of coordinates that pertain to individual polygons and fill = WRIA_NM will attempt to assign an appropriate color scale to data based on the “WRIA_NM” column. Next, we added several “geoms” including polygons and paths. The polygons are the brightly colored shapes you see, and the path is the white outline around each shape. scale_fill_hue() changes the properties of the colors displayed and theme() can be used to change a number of properties; in this case I chose not to display a legend or axis labels since they add very little to the plot. Lastly, coord_equal() fixes the aspect ratio between the horizontal and vertical scales.

Manipulating spatial objects

Now, let’s get some practice working with spatial objects. For this exercise, we will subset the data and observe watersheds in the Puget Sound region. For ease of access, we’ll cut out some the data we don’t particularly care about and rename some of the columns to be more descriptive.

# identify some interesting attributes
attributes <- c("WRIA_NR", "WRIA_AREA_", "WRIA_NM")

# subset the full dataset extracting only the desired attributes
dataProjectedSubset <- dataProjected[,attributes]

# assign these attributes of interest to more descriptive names
names(dataProjectedSubset) <- c("number", "area", "name")

# create a data.frame name (potentially different from layerName)
dataName <- "WRIA"

# reproject the data onto a "longlat" projection and assign it to the new name
assign(dataName,spTransform(dataProjectedSubset, CRS("+proj=longlat")))

# save the data
save(list=c(dataName),file=paste(localDir,"WAWRIAs.RData",sep="/"))

# inspect the watershed names
WRIA$name

##  [1] Pend Oreille            Upper Lake Roosevelt
##  [3] Nooksack                Kettle                 
##  [5] Okanogan                Upper Skagit           
##  [7] Methow                  San Juan               
##  [9] Colville                Sanpoil                
## [11] Lower Skagit - Samish   Middle Lake Roosevelt  
## [13] Lyre - Hoko             Chelan                 
## [15] Soleduc                 Stillaguamish          
## [17] Island                  Nespelem               
## [19] Quilcene - Snow         Elwha - Dungeness      
## [21] Foster                  Little Spokane         
## [23] Middle Spokane          Wenatchee              
## [25] Entiat                  Lower Spokane          
## [27] Lower Lake Roosevelt    Grand Coulee           
## [29] Kitsap                  Upper Crab-Wilson      
## [31] Skokomish - Dosewallips Moses Coulee           
## [33] Queets - Quinault       Hangman                
## [35] Palouse                 Upper Yakima           
## [37] Lower Chehalis          Kennedy - Goldsborough 
## [39] Lower Crab              Alkali - Squilchuck    
## [41] Chambers - Clover       Deschutes              
## [43] Naches                  Nisqually              
## [45] Upper Chehalis          Willapa                
## [47] Esquatzel Coulee        Middle Snake           
## [49] Cowlitz                 Lower Snake            
## [51] Lower Yakima            Grays - Elochoman      
## [53] Walla Walla             Klickitat              
## [55] Lewis                   Rock - Glade           
## [57] Wind - White Salmon     Salmon - Washougal     
## [59] Snohomish               Cedar - Sammamish      
## [61] Duwamish - Green        Puyallup - White       
## 62 Levels: Alkali - Squilchuck Cedar - Sammamish ... Wind - White Salmon

# save a subset including only regions surrounding the Puget Sound (as it
# turns out, this will be the first 19 entries)
PSWRIANumbers <- c(1:19)
WRIAPugetSound <- WRIA[WRIA$number %in% PSWRIANumbers,]

# plot Puget Sound watersheds to make sure this is approximately what we want
plot(WRIAPugetSound)

subsetData

# save the data
dataName <- "WRIAPugetSound"
save(list=c(dataName),file=paste(localDir,"WRIAPugetSound.RData",sep="/"))

Since the plot was simply a “sanity check” of sorts, I decided to use base graphics for a quick peek. Perhaps you should try making it into a ggplot.

Before wrapping up this exercise, lets transform the subsetted dataset so it’s ready to be used by ggplot2.

# add to data a new column termed "id" composed of the rownames of data
dataProjectedSubset@data$id <- rownames(dataProjectedSubset@data)

# create a data.frame from our spatial object
watershedPointsSubset <- fortify(dataProjectedSubset, region = "id")

# merge the "fortified" data with the data from our spatial object
watershedSubsetDF <- merge(watershedPointsSubset, dataProjectedSubset@data,
                           by = "id")

More ggplots

Onto the more fun analysis. While this dataset is fairly limited in information, it does contain some numerical values that might be worth investigating. Let’s get some information regarding the area of Washington’s watersheds. You’ll find that ggplot2 makes it very easy to visualize this kind of information.

ggWatershedArea <- ggplot(data = watershedSubsetDF, aes(x=long, y=lat,
                                                        group = group, 
                                                        fill = area)) +
  geom_polygon()  +
  geom_path(color = "white") +
  scale_fill_gradient(breaks=c(500000,1000000,1500000),
                      labels=c("Low","Medium","High")) +
  coord_equal() +
  theme(axis.title = element_blank(), axis.text = element_blank()) +
  labs(title = "Area of Washington's Watersheds", fill = "Area")

print(ggWatershedArea)

washingtonWatershedsByAreaPlot

That seemed pretty painless. All we had to do was set the fill aesthetic to the region area and add scale_fill_gradient() to define a continuous color scale; ggplot seemed to handle everything else for us. Notice in the previous ggplot that I set fill = WRIA_NM, or the watershed name. This information was identified as a categorical variable and regions were therefore filled with variety of colors. In the script above, I passed fill = area. If I did not add scale_fill_gradient(), I would have recieved the following error message: Error: Continuous value supplied to discrete scale. This ggplot expected a categorical variable, so it became necessary to add the scale_fill_gradient() layer.

Say, the way this ggplot colors the regions is actually kind of counter- intuitive; light regions for larger areas and dark regions for smaller areas? Doesn’t make much sense if you ask me. Let’s do something about that. Since this is Washington, we’ll don some UW spirit. Hope you like purple and gold! (Apologies to any WSU fans out there.)

ggWatershedAreaPurple <- ggWatershedArea + geom_path(color = "goldenrod1", 
                                                     size=1) +
  scale_fill_gradient(low = "plum1", high = "purple4",
                      breaks=c(500000,1000000,1500000),
                      labels=c("Low","Medium","High"))

print(ggWatershedAreaPurple)

washingtonWatershedsByAreaPurplePlotPasta

Okay, so the purple and gold might be a bit obnoxious, but you get the idea. Let’s just color the largest of the watersheds. Before we get to that, let’s inspect the data to determine exactly which one that is. You can already sort of guess based on the size and color of the regions, but let’s be sure.

# find the largest area
maxArea <- max(dataProjectedSubset$area)

# create a "mask" identifying the biggest area
biggestAreaMask <- which(dataProjectedSubset$area == maxArea)
biggestAreaName <- dataProjectedSubset$name[biggestAreaMask]
biggestAreaName

## [1] Lower Yakima
## 62 Levels: Alkali - Squilchuck Cedar - Sammamish ... Wind - White Salmon

# NOTE : Each "mask" we create is a vector of logical values that we will use
#      : to subset the data.frame. Masks are particularly helpful when
#      : querying large datasets as the comparison of logicals is faster than
#      : comparing more complex data types.

# create a subset
biggestArea <- dataProjectedSubset[biggestAreaMask,]

So supposedly we have identified Washington’s largest watershed. Let make another ggplot to see what this watershed looks like. Remember, ggplot2 only likes data.frames, so we’ll have to mess around with the data.

# add to data a new column termed "id" composed of the rownames of data
biggestArea@data$id <- rownames(biggestArea@data)

# create a data.frame from our spatial object
biggestAreaPoints <- fortify(biggestArea, region = "id")

# merge the "fortified" data with the data from our spatial object
biggestAreaDF <- merge(biggestAreaPoints, biggestArea@data, by = "id")

If we’re going to do this kind of transformation every time we make a ggplot, perhaps we can make a method to reduce coding time. Alas, that will wait for another day. Now we plot!

ggBiggestArea <- ggplot(data = biggestAreaDF, aes(x=long, y=lat)) +
  geom_polygon(fill = "deepskyblue2")  +
  coord_equal() +
  theme(legend.position = "none", axis.title = element_blank(),
        axis.text = element_blank()) +
  labs(title=paste(biggestAreaName, "\n Area=", maxArea, " (units)", sep=""))

print(ggBiggestArea)

biggestWatershedPlot

That seemed like a lot of trouble to go through just to find the largest area. Can we do that without creating an entirely new spatial object? Of course we can!

# create a "mask" identifying the biggest area
biggestAreaMask <- which(watershedSubsetDF$area == max(watershedSubsetDF$area))

#creat a subset
biggestWatershed <- watershedSubsetDF[biggestAreaMask,]

ggBiggestAreaPlus <- ggplot(data = watershedDF, aes(x=long, y=lat,
                                                    group = group)) +
  geom_polygon()  +
  geom_polygon(data = biggestWatershed, fill ="deepskyblue2") +
  geom_path(color = "white") +
  coord_equal() +
  theme(legend.position = "none", title = element_blank(),
        axis.text = element_blank())

print(ggBiggestAreaPlus)

washingtonWatershedsBiggestPlot

Using ggmap

The last thing I’ll describe in this post is the function and use of ggmap. This library is used for visualizing spatial data with the likes of Google Maps using ggplot2. A quick example is provided below.

# load library
library(ggmap)

# reproject the data onto a "longlat" projection
subsetTransform <- spTransform(dataProjectedSubset, CRS("+proj=longlat"))

# determine the bounding box of the spatial object
b <- bbox(subsetTransform)

# get and plot a map
washingtonState <- ggmap(get_map(location = b, maptype = "satellite", zoom = 6))

subsetTransformFortified <- fortify(subsetTransform, region = "id")
subsetTransformFortified <- merge(subsetTransformFortified,
                                  subsetTransform@data, by.x = "id")

washingtonState + geom_polygon(data = subsetTransformFortified,
                               aes(x = long, y = lat, group = group,
                                   fill = name), alpha = 0.5) +
  scale_x_continuous(limits = c(b[1,1],b[1,2])) +
  scale_y_continuous(limits = c(b[2,1],b[2,2])) +
  theme(legend.position = "none", title = element_blank())

ggmapExamplePlot

Final remarks

While these plots may look “nicer”, ggplot2 has a couple of disadvantages. Perhaps most glaring is the increase in computing time. base graphics are built to be fast. Generally speaking, visualizing geospatial data is not the fastest process in the world, but using base vs. ggplot2 can be the difference between 0.5 seconds and 10 seconds. So if you’re producing graphics on-they-fly, stick with base, but if you’re looking to create publication quality graphics, ggplot2 is certainly worth learning.

Read more

Learn more about plotting spatial data using ggplot2 at these sources:

Files:

To leave a comment for the author, please follow the link and comment on their blog: Working With Data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)