Geographic clustering of UK cities

[This article was first published on Adventures in Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I know I am probably late to this party but I recently found out about DBSCAN or “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”[^1]. In a nutshell, the algorithm visits successive data point and asks whether neighbouring points are density-reachable. In other words is it possible to connect two points with a chain of points all conforming to some density criteria. This has some major advantages over other clustering algorithms that I have used before.

  • It can identify clusters of arbitrary shape.
  • Number of clusters is not an input parameter.
  • It's fast as it only visits the data points rather than the space in between.
  • A data point with no close neighbours is assigned noise rather than its nearest cluster.

Let have a go at clustering uk cities from library(maps). First load the packages and the data, then subset the data to get only the UK cities.

library(ggplot2)
library(dplyr)
library(maps)
library(dbscan)

data("world.cities")
UK <- world.cities %>% filter(country.etc == "UK")

Now we can run the algorithm on the latitude and longitude collumns. Then we can pull the cluster assignments out of the resulting object.

EPS <- 0.15
clusters <- dbscan(select(UK, lat, long), eps = EPS)
UK$cluster <- clusters$cluster

Finally we can split the original data into two according to whether dbscan has assigned or cluster or noise.

groups  <- UK %>% filter(cluster != 0)
noise  <- UK %>% filter(cluster == 0)

Now lets have a look at the results[^2].

ggplot(UK, aes(x = long, y = lat, alpha = 0.5)) + 
  geom_point(aes(fill = "grey"), noise) +
  geom_point(aes(colour = as.factor(cluster)), groups,
             size = 3) +
  coord_map() +
  theme_stripped +
  theme_empty +
  theme(legend.position = "none")

plot of chunk unnamed-chunk-5

I arbitrarily set the EPS parameter. How to tune it? Discussion for another time...

[^1]: I recommend reading the paper which is quite accesible. Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

[^2]: I am stripping out some of the ggplot defaults with two objects theme_stripped and theme_empty which I use routinely to either remove the background and gridlines or to remove everything including axes.

To leave a comment for the author, please follow the link and comment on their blog: Adventures in Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)