Mazama Science has just released its first package on CRAN — MazamaSpatialUtils. Here is the description:
A suite of conversion scripts to create internally standardized spatial polygons dataframes. Utility scripts use these datasets to return values such as country, state, timezone, watershed, etc. associated with a set of longitude/latitude pairs. (They also make cool maps.)
In this post we discuss the reasons for creating this package and provide examples of its use.
At Mazama Science we often work with data that is geo-located:
- biological and chemical samples from streams
- seismic sensor data
- pollution monitoring data
- output from gridded atmospheric models
- forest management and geomorphology data
- national and state demographic and economic data
Using the sp package, all of these types of data can be plotted on maps or combined with other geospatial data to ask more detailed, more interesting questions. For an introduction to spatial data in R see Working With Geospatial Data or Working With Geopatial Data (and ggplot2).
The long term goal of the MazamaSpatialUtils package is to make it easier for us to work with GIS shapefile data we discover on the web as we create a library of interesting spatial datasets for use in R. This first release of the package addresses three specific issues:
- creating a scalable system for working with spatial data
- normalizing identifiers in spatial data
- finding spatial information based on a set of locations
Creating a scalable system
Shapefiles with high resolution features are by nature quite large. Working with the tz_world timezones dataset we see that a single timezone polygon, ‘America/Godthab’, takes up 4.1 Mb of RAM because of the highly detailed outline of the Greenland coast.
Spatial datasets are large and their conversion from shapefile to SpatialPolygonsDataFrame can be time consuming. In addition, there is little uniformity to the dataframe data found in these datasets. The MazamaSpatialUtils package addresses these issues in two ways:
- It provides a package state variable called SpatialDataDir which is used internally as the location for all spatial datasets.
- It defines a systematic process for converting shapefile data into .RData files with SpatialPolygonsDataFrames objects.
Spatial Data Directory
Users will want to maintain a directory where their .RData versions of spatial data reside. The package provides a setSpatialDataDir() function which sets a package state variable storing the location. Internally, getSpatialDataDir() is used whenever data need to be accessed. (Hat tip to Hadley Wickham’s description of Environments and package state.)
The package comes with several convert~() functions that download, convert and normalize shapefile datasets available on the web. Version 0.1 of the package has five such scripts that walk through the same basic steps with minor differences depending on the needs of the source data. With these as examples, users should be able to create their own convert~() functions to process other spatial data. Once converted and normalized, each dataset will benefit from other package utility functions that depend on the consistent availability and naming of certain columns in the @data slot of each SpatialPolygonsDataFrame.
The great thing about working with the shapefile format is that it is the defacto standard format for spatial data. We LOVE standards! Many shapefiles, but not all, also use the ISO 3166-1 alpha-2 character encoding for identifying countries, hereafter called the countryCode. However, there seems to be no agreement at all about what to call this encoding. We have seen ’ISO’, ‘ISO2′, ‘country’, ‘CC’ and many more. Even the ISOcodes package calls this column of identifiers ‘Alpha_2′ in one dataframe and ‘Country’ in another.
Of course there are many other spatial datasets that do not include a column with the countryCode. Sometimes it is because they use FIPS or ISO 3166-1 alpha-3 or some (non-standardized) version of the plain English name. Other times it is because the data are part of a national dataset and the country is assumed.
Wouldn’t it be nice if every spatial dataset you worked with was guaranteed to have a column named countryCode with the ISO 3166-1 alpha-2 encoding? We certainly think so!
The heart of spatial data ‘normalization’ in this package is the conversion of various spatial datasets into .RData files with guaranteed and uniformly named identifiers including at least:
- countryCode – ISO 3166-1 alpha-2
- stateCode – ISO 3166-2 alpha-2 (if appropriate)
Any datasets with timezone information will include:
- timezone – Olson timezone
The uniformity of identifiers in the spatial datasets makes it easy to generate maps with data from any dataset that uses standard ISO codes for countries or states.
Version 0.1 of this package is targeted to the following use case:
How can we determine the timezones associated with a set of locations?
Here is how we arrived at this question:
We are working with pollution monitoring data collected by sensors around the United States. Data are collected hourly and aggregated into a single annual dataset with a GMT time axis. So far so good. Not surprisingly, pollution levels show a strong diurnal signal so it is useful do identify measurements as being either during the daytime or nighttime. Luckily, the maptools package has a suite of ‘sun-methods’ for calculating the local sunrise and sunset if you provide a longitude, latitude and POSIXct object with the proper timezone.
Determining the timezone associated with a location is an inherently spatial question and can be addressed with a point-in-polygon query as enabled by the sp package. Once we enabled this functionality with a timezone dataset we realized that we could extract more metadata for our monitoring stations from other spatial datasets: country, state, watershed, legislative district, etc. etc. But we’re getting ahead of ourselves.
Here is an example demonstrating a search for Olson timezone identifiers:
library(MazamaSpatialUtils) # Vector of lons and lats lons <- seq(-120,-60,5) lats <- seq(20,80,5) # Get Olson timezone names timezones <- getTimezone(lons, lats) print(timezones)
 NA NA "America/Hermosillo"  "America/Denver" "America/Chicago" "America/Chicago"  "America/Toronto" "America/Toronto" NA  "America/Iqaluit" "America/Iqaluit" NA  "America/Godthab"
Additional information is available by specifying allData=TRUE:
# Get all information in the dataset timezoneDF <- getTimezone(lons, lats, allData=TRUE) print(timezoneDF)
timezone UTC_offset UTC_DST_offset countryCode longitude latitude 1 <NA> NA NA <NA> NA NA 2 <NA> NA NA <NA> NA NA 3 America/Hermosillo -7 -7 MX -110.96667 29.06778 4 America/Denver -7 -6 US -104.98417 39.74556 5 America/Chicago -6 -5 US -87.65000 41.86417 6 America/Chicago -6 -5 US -87.65000 41.86417 7 America/Toronto -5 -4 CA -79.38333 43.66083 8 America/Toronto -5 -4 CA -79.38333 43.66083 9 <NA> NA NA <NA> NA NA 10 America/Iqaluit -5 -4 CA -68.46667 63.74556 11 America/Iqaluit -5 -4 CA -68.46667 63.74556 12 <NA> NA NA <NA> NA NA 13 America/Godthab -3 -2 GL -51.73333 64.18639
They also make cool maps
Using spatial data to create location-specific metadata can be very rewarding. But it doesn’t satisfy our ever-present craving for eye candy. As long as we have all of this spatial data at our fingertips, let’s do something fun.
library(MazamaSpatialUtils) library(sp) # for spatial plotting # Remove Antarctica tz <- subset(SimpleTimezones, countryCode != 'AQ') # Assign timezone polygons an index based on UTC_offset colorIndices <- .bincode([email protected]$UTC_offset, breaks=seq(-12.5,12.5,1)) # Color timezones by UTC_offset plot(tz, col=rainbow(25)[colorIndices])
library(MazamaSpatialUtils) library(stringr) # Specify the directory for spatial data setSpatialDataDir('~/SpatialData') # Install NaturalEarthAdm1 if not already installed installSpatialData('NaturalEarthAdm1', adm=1) # Load the data loadSpatialData('NaturalEarthAdm1') # Vector of random lons and lats lons <- runif(1000,-180,180) lats <- runif(1000,-90,90) # Get country dataframe from SimpleCountries countryDF <- getCountry(lons, lats, allData=TRUE) # Determine which countries are involved (NA values indicate "over water") CC <- unique(countryDF$countryCode[!is.na(countryDF$countryCode)]) # Get state dataframe stateDF <- getState(lons, lats, countryCodes=CC , allData=TRUE) # Create Country-State names names <- paste(countryDF$countryName,'-',stateDF$stateName) # Display names that aren't "NA - NA" = "over water" names[names != 'NA - NA'][1:20]
 "Australia - Western Australia" "Antarctica - Antarctica"  "Argentina - Santa Fe" "NA - Sakha (Yakutia)"  "Iraq - Al-Anbar" "Algeria - Bordj Bou Arréridj"  "Antarctica - Antarctica" "China - Hubei"  "Yemen - Amran" "Russia - Chita"  "Antarctica - Antarctica" "Antarctica - Antarctica"  "Russia - Chita" "Russia - Omsk"  "China - Xinjiang" "China - Liaoning"  "Russia - Sakha (Yakutia)" "Russia - Krasnoyarsk"  "Antarctica - Antarctica" "Australia - Northern Territory"
Our plans for this package will depend upon project needs but we will certainly be adding convert~() functions for the administrative boundaries data from gadm.org and for some of the USGS water resources datasets.
We encourage interested parties to contribute convert~() functions for their own favorite spatial datasets. If they produce SpatialPolygonDataFrames that adhere to the package standards, we’ll include them in the next release.