Aggregating spatial points by clusters

[This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With ubiquitous collection devices (e.g. smartphones), having too much
data may become an increasingly common problem for spatial analysts,
even with increasingly powerful computers. This is ironic,
because a few short decades ago, too little data was a primary constraint.

This tutorial builds on the 'Attribute joins' section of the
Creating maps in R tutorial
to demonstrate how clusters can be identified
in a field of spatial points and then used as the basis of aggregation
to reduce the total number of points. It is also available
on rpubs.
As well as providing a useful way of manipulating data,
the tutorial also demonstrates the creation of a multi-layered map
with R's base graphics.

plot of chunk Aggregated data for culters

Load the data

So, starting from within the project's folder (which can be downloaded
from here),
lets start by reaffirming our starting point:
transport points in London:

library(sp)  # add spatial package to load the S4 spatial objects
load("data/stations.RData")  # load 731 station points
class(stations)  # double-check what type of data we have

## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

load("data/lnd.RData")  # load London zones for context

plot of chunk Plot of the point data

Plotting density

From the above plot it seems that these points are quite evenly distributed, with
some minor clusters. How can we identify these? A common option is to
convert the point data into a 2d continuous field of point density.
In order to do this,
and then create density contours, we will first convert them into the ppp
class. This allows us to use the powerful functions of the spatstat
package to be used on the points.

library(spatstat)  # to calculate field of point density

## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.7-28. For overview type 'help("mgcv-package")'.
## Loading required package: deldir
## deldir 0.0-22
## spatstat 1.32-0     (nickname: 'Logistical Nightmare') 
## For an introduction to spatstat, type 'beginner'

library(maptools)  # to convert to point pattern

## Checking rgeos availability: TRUE

sSp <- as(SpatialPoints(stations), "ppp")  # convert points to pp class
Dens <- density(sSp, adjust = 0.2)  # create density object
class(Dens)  # just for interest: it's got it's of pixel image class

## [1] "im"

plot(Dens)  # default plot for density

plot of chunk Density plot

The density plot illustrates that there are some areas of relatively
high density close to the centre of London. This is as we'd expect.
Another way to represent this density information is with contours of
equal value, similar to those used in topographic maps to simultaneously
represent slope (the closeness of the lines) and altitude.
spatstat enables simple plotting of countours with the sensibly named
contour function.

contour(density(sSp, adjust = 0.2), nlevels = 4)  # plot as contours - this is where we're heading

plot of chunk Contour plot

In the above image, it is the "islands" of high dot density that we are most
interested in. Having successfully visualised these, the next stage is to
save them into R's native spatial data format, before extracting the polygons.
These polygons will be used to aggregate the points.

Save the contour lines

To extract the contours, first as lines and then finally as polygons,
we need to convert the data into two more formats: a spatial grid and
then as a raster image. Both of these file formats are supported by the
basic sp package.

Dsg <- as(Dens, "SpatialGridDataFrame")  # convert to spatial grid class
Dim <- as.image.SpatialGridDataFrame(Dsg)  # convert again to an image
Dcl <- contourLines(Dim, nlevels = 8)  # create contour object - change 8 for more/fewer levels
SLDF <- ContourLines2SLDF(Dcl, CRS(proj4string(lnd)))  # convert to SpatialLinesDataFrame
plot(SLDF, col = terrain.colors(8))

plot of chunk unnamed-chunk-1

Extract the density polygons

Now the lines have been saved as the SLDF object, it is
time to convert the level we are are interested in into polygons.
This is quite a tricky task and can involve many steps using the basic
sp package (see from code chunk 60 onwards here).
Fortunately, the wonderful rgeos provides a single line solution, with the
intuitively named gPolygonize function.


## rgeos version: 0.2-19, (SVN revision 394)
##  GEOS runtime version: 3.3.8-CAPI-1.7.8 
##  Polygon checking: TRUE

Polyclust <- gPolygonize(SLDF[5, ])
gas <- gArea(Polyclust, byid = T)/10000
Polyclust <- SpatialPolygonsDataFrame(Polyclust, data = data.frame(gas), match.ID = F)

plot of chunk Polygons representing areas of high dot density

Aggregate the points within high density zones

The next task is to aggregate the points within each high density zone.
This is done automatically with base R's aggregate function. Note that
when the sp package is installed, aggregate behaves differently if it is provided
with spatial data as its input, outputting spatial data with aggregate statistics for the
specified variable (or indeed all variables).
Now summarise the data for each of the polygons.

cAg <- aggregate(stations, by = Polyclust, FUN = length)
# lb <- gBoundary(lnd)
plot(Dens, main = "")
plot(lnd, border = "grey", lwd = 2, add = T)
plot(SLDF, col = terrain.colors(8), add = T)
plot(cAg, col = "red", border = "white", add = T)
graphics::text(coordinates(cAg) + 1000, labels = cAg$CODE)

plot of chunk Aggregated data for culters

Save points inside and outside polygons

In terms of visualisation, our work here is done.
But that is not enough for most projects: we need to export the
results. Remember from the beginning that the aim was to reduce the
number of points we had to deal with? Well it is in this stage
that we find out how many points we've removed by aggregation
and save only the points lying outside the aggregation zones.

sIn <- stations[cAg, ]  # select the stations inside the clusters
sOut <- stations[!row.names(stations) %in% row.names(sIn), ]  # stations outside the clusters
plot(sOut)  # the more sparsely distributed points - notice the 'holes' of low density
plot(cAg, border = "red", lwd = 3, add = T)

plot of chunk Points outside the clusters

The plot shows that we've reduced the number of points, by removing those in
areas of high density. But how much space have saved by removing closely clustered points?

nrow(sIn)/nrow(stations)  # proportion of points in cluster

## [1] 0.3406


## [1] 0.1178

The results of the final commands show that we have squeezed out 1/3
of the points from only only of the area. This is not a huge saving
because the points are quite evenly distributed. For more highly clustered
datasets, however, the savings could be vast.


We have seen how to create polygons of high point density in R and then,
in relatively few lines of code, we have identified and aggregated the
points that fit into these high density zones. This has applications for
reducing data redundancy in highly clustered datasets. But the methods
shown here have many other applications, including:

  • Objective identification of zones to "target" based on point data
  • Visualisation of zones of interest from the point data
  • Provision of spatial data on which to base a stratified spatial sampling strategy

This final application could be of interest if one is trying to identify if the
density of certain points of pollutants (e.g. petrol stations) related to
other variables (e.g. species diversity). By creating zones of low density
and high density, and then taking measurements from each zone type, one could
test the hypothesis that the former has an impact on the latter.

Other applications could be found for the method. The take home message is
that R has powerful spatial capabilities that would be difficult to implement
in conventional GIS software. By usin scripts, we can ensure that the code to create the results of
such an analysis are reproducible and can be used by others.
To this end, all the example code and data used in this example are reproducible -
see the vignettes folder of the "Creating-maps-in-R"
GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: Robin Lovelace - R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)