This tutorial was prepared for the Ninth Annual Midwest Graduate Student Summit on Applied Economics, Regional, and Urban Studies (AERUS) on April 23rd-24th, 2016 at the University of Illinois at Urbana Champaign.
This notes illustrate the usage of
R for spatial econometric analysis. The theory is heavily borrowed from Anselin and Bera (1998) and Arbia (2014) and the practical aspect is an updated version of Anselin (2003), with some additions in visualizing spatial data on
What’s R and why use it?
R is a free, open-source, and object oriented language. Free and open-source means that anyone is free to use, redistribute and change the software in any way. Moreover, “R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.” (http://cran.r-project.org)
There are lot of software out there to do data analysis that are prettier and seem easier than R, so why should I invest learning R? There are in my opinion at least three characteristics of R that make it worthwhile learning it. First of all, it is free. Many fancier looking software used today are quite expensive, but R is free and is going to be always free. R also is a language, which means that you don’t only get to use the functions that are build in the software but you can create your own (just to get an on the of the power of the R language you can take a look Professor Koenker’s Quantile Regression package). The last reason is that R is extremely well supported. If you have a question you just can Google it, post it to StackOverflow or use R-blogger. If you are not convinced yet, just can type “why use the R language”” in Google and I think the results will speak by themselves. All these characteristics plus the fact that researchers at the frontier of the profession use
R as part of their research make
R a great tool for spatial data analysis.
Introduction to spatial analysis in R
Motivation for using spatial analysis
Probably the most important argument for taking a spatial approach is that the independence assumption between observation is no longer valid. Attributes of observation i may influence the attributes of observation j.
To illustrate a very small set of what can be achieved in
R for spatial analysis our running example will be examining violent crimes and foreclosures in the City of Chicago. The crime data as well as tracts shapefiles used here comes from Chicago Data Portal while foreclosures data come from the U.S. Department of Housing and Urban Development (HUD), all of these are public available.
R packages for spatial data analysis.
In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. As of April 2016, there were over 8,200 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package. (Wickham (2015))
Today we’ll focus on three packages
maptools (R. Bivand and Lewin-Koh (2016)),
spdep (R. Bivand and Piras (2015), R. Bivand, Hauke, and Kossowski (2013)) and
leaflet (Cheng and Xie (2015)) and use a fourth one, the
RColorBrewer package (Neuwirth (2014)) to make our plots more attractive. To install the packages, we simple have to type
install.packages("maptools", dependencies = TRUE) install.packages("spdep", dependencies = TRUE) install.packages("leaflet", dependencies = TRUE) install.packages("RColorBrewer", dependencies = TRUE)
maptools package has the functions needed to read geographic data, in particular ESRI shapefiles. The
spdep package on the other hand contains the functions that will use in the spatial data analysis. The dependencies option installs uninstalled packages which these packages depend on. To access the functions and data in the package we must first call it. The function to call a package is
## Loading required package: sp ## Loading required package: Matrix
## Checking rgeos availability: TRUE
Now that the packages are loaded we can read in our data.
Reading and Mapping spatial data in R
Spatial data comes in many “shapes” and “sizes”, the most common types of spatial data are:
- Points are the most basic form of spatial data. Denotes a single point location, such as cities, a GPS reading or any other discrete object defined in space.
- Lines are a set of ordered points, connected by straight line segments
- Polygons denote an area, and can be thought as a sequence of connected points, where the first point is the same as the last
- Grid (Raster) are a collection of points or rectangular cells, organized in a regular lattice
For more details, see R. S. Bivand, Pebesma, and Gomez-Rubio (2008). Today we’ll focus on Polygons. Spatial data usually comes in
shapefile. This type of files stores non topological geometry and attribute information for the spatial features in a data set. Moreover, they don’t require a lot of disk space and are easy to read and write. (ESRI (1998))
First we have to download and save the files in our computer from our server at http://www.econ.uiuc.edu/~lab/workshop/foreclosures/. I saved mine in my desktop. We’ve downloaded 3 files:
- Main file: foreclosures.shp
- Index file: foreclosures.shx
- dBASE table: foreclosures.dbf
The main file describes a shape with a list of its vertices. In the index file, each record contains the offset of the corresponding main file record from the beginning of the main file. The dBASE table contains feature attributes with one record per feature. (ESRI (1998))
To read data from a polygon shapefile into
R we use the function
readShapePoly that will create a
SpatialPolygonsDataFrame object. To learn more about the function, the command
?readShapePoly will give you access to the help file.
But before reading in the shapefile we first set our working directory. R is always pointing to a directory on your computer, to find which one use
getwd() (get working directory). To specify a working directory:
On windows you may have to use
\. The foreclosures folder contain the shapefiles that I downloaded before. Now we are ready to read in our shapefile.
chi.poly < - readShapePoly('foreclosures.shp')
the shapefile is now read and stored in an object called
chi.poly. To check that it is a
SpatialPolygonsDataFrame we can use the function
##  "SpatialPolygonsDataFrame" ## attr(,"package") ##  "sp"
SpatialPolygonsDataFrame objects brings together the spatial representations of the polygons with data. The identifying tags of the polygons in the slot are matched with the row names of the data frame to make sure that the correct data rows are associated with the correct spatial object. (R. S. Bivand, Pebesma, and Gomez-Rubio (2008))
This object has four “parts” or
slots: the first one is the data slot that contains the variables that will be used for our analysis; the second one is the polygon slot and contains the “shape” information. The third slot, bbox, is the bounding box drawn around the boundaries and the fourth slot is the proj4string which contains the projections. To access the data slot we can use the
slot function or the
@ symbol. For a compact look of the data slot we can use the
## 'data.frame': 897 obs. of 16 variables: ## $ SP_ID : Factor w/ 897 levels "1","10","100",..: 1 112 223 334 445 556 667 778 887 2 ... ## $ fips : Factor w/ 897 levels "17031010100",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ est_fcs : int 43 129 55 21 64 56 107 43 7 51 ... ## $ est_mtgs : int 904 2122 1151 574 1427 1241 1959 830 208 928 ... ## $ est_fcs_rt: num 4.76 6.08 4.78 3.66 4.48 4.51 5.46 5.18 3.37 5.5 ... ## $ res_addr : int 2530 3947 3204 2306 5485 2994 3701 1694 443 1552 ... ## $ est_90d_va: num 12.61 12.36 10.46 5.03 8.44 ... ## $ bls_unemp : num 8.16 8.16 8.16 8.16 8.16 8.16 8.16 8.16 8.16 8.16 ... ## $ county : Factor w/ 1 level "Cook County": 1 1 1 1 1 1 1 1 1 1 ... ## $ fips_num : num 1.7e+10 1.7e+10 1.7e+10 1.7e+10 1.7e+10 ... ## $ totpop : int 5391 10706 6649 5325 10944 7178 10799 5403 1089 3634 ... ## $ tothu : int 2557 3981 3281 2464 5843 3136 3875 1768 453 1555 ... ## $ huage : int 61 53 56 60 54 58 48 57 61 48 ... ## $ oomedval : int 169900 147000 119800 151500 143600 145900 153400 170500 215900 114700 ... ## $ property : num 646 914 478 509 641 612 678 332 147 351 ... ## $ violent : num 433 421 235 159 240 266 272 146 78 84 ... ## - attr(*, "data_types")= chr "C" "C" "N" "N" ...
The first part is the
data.frame part that contains the data for our analysis. We can see the variables contained in the data portion of the file including:
- est_fcs: estimated count of foreclosure starts from Jan. 2007 through June 2008
- est_mtgs: estimated number of active mortgages from Jan. 2007 through June 2008
- est_fcs_rt: number of foreclosure starts divided by number of mortgages times 100
- bls_unemp: June 2008 place or county unemployment rate
- totpop: total population from 2000 Census
- violent: number of violent crimes reported between Jan. 2007 through December 2008
- property: number of property crimes reported between Jan. 2007 through December 2008
We can also get summary statistics of the variables of interest using the function summary. For example,
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 49.0 103.0 169.3 226.0 1521.0
Here we accessed the data portion of the shapefile with the
@ sign and then the variable with the
$ sign. To see the other slots, we can proceed in the same fashion.
A nice feature of the class of spatial objects is that we can use the traditional plotting features of
R. The following command give’s us a plot of Chicago’s Census Tracts:
but we can go a step further and make nicer plots using the
leaflet package. This generates an interactive map that can be rendered on HTML pages.
library(leaflet) leaflet(chi.poly) %>% addPolygons(stroke = FALSE, fillOpacity = 0.5, smoothFactor = 0.5) %>% addTiles() #adds a map tile, the default is OpenStreetMap