Geographic clustering of UK cities

Posted on November 23, 2015 by Adventures in Data in R bloggers | 0 Comments

[This article was first published on Adventures in Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I know I am probably late to this party but I recently found out about DBSCAN or “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”[^1]. In a nutshell, the algorithm visits successive data point and asks whether neighbouring points are density-reachable. In other words is it possible to connect two points with a chain of points all conforming to some density criteria. This has some major advantages over other clustering algorithms that I have used before.

It can identify clusters of arbitrary shape.
Number of clusters is not an input parameter.
It's fast as it only visits the data points rather than the space in between.
A data point with no close neighbours is assigned noise rather than its nearest cluster.

Let have a go at clustering uk cities from library(maps). First load the packages and the data, then subset the data to get only the UK cities.

library(ggplot2)
library(dplyr)
library(maps)
library(dbscan)

data("world.cities")
UK <- world.cities %>% filter(country.etc == "UK")

Now we can run the algorithm on the latitude and longitude collumns. Then we can pull the cluster assignments out of the resulting object.

EPS <- 0.15
clusters <- dbscan(select(UK, lat, long), eps = EPS)
UK$cluster <- clusters$cluster

Finally we can split the original data into two according to whether dbscan has assigned or cluster or noise.

groups  <- UK %>% filter(cluster != 0)
noise  <- UK %>% filter(cluster == 0)

Now lets have a look at the results[^2].

ggplot(UK, aes(x = long, y = lat, alpha = 0.5)) + 
  geom_point(aes(fill = "grey"), noise) +
  geom_point(aes(colour = as.factor(cluster)), groups,
             size = 3) +
  coord_map() +
  theme_stripped +
  theme_empty +
  theme(legend.position = "none")

I arbitrarily set the EPS parameter. How to tune it? Discussion for another time...

[^1]: I recommend reading the paper which is quite accesible. Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

[^2]: I am stripping out some of the ggplot defaults with two objects theme_stripped and theme_empty which I use routinely to either remove the background and gridlines or to remove everything including axes.

To leave a comment for the author, please follow the link and comment on their blog: Adventures in Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Geographic clustering of UK cities

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)