Geographic clustering of UK cities

November 23, 2015
By

(This article was first published on Adventures in Data, and kindly contributed to R-bloggers)

I know I am probably late to this party but I recently found out about DBSCAN or "A
Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise"[^1]. In a nutshell, the algorithm visits successive data point and asks whether neighbouring points are density-reachable. In other words is it possible to connect two points with a chain of points all conforming to some density criteria. This has some major advantages over other clustering algorithms that I have used before.

  • It can identify clusters of arbitrary shape.
  • Number of clusters is not an input parameter.
  • It's fast as it only visits the data points rather than the space in between.
  • A data point with no close neighbours is assigned noise rather than its nearest cluster.

Let have a go at clustering uk cities from library(maps). First load the packages and the data, then subset the data to get only the UK cities.

library(ggplot2)
library(dplyr)
library(maps)
library(dbscan)

data("world.cities")
UK <- world.cities %>% filter(country.etc == "UK")

Now we can run the algorithm on the latitude and longitude collumns. Then we can pull the cluster assignments out of the resulting object.

EPS <- 0.15
clusters <- dbscan(select(UK, lat, long), eps = EPS)
UK$cluster <- clusters$cluster

Finally we can split the original data into two according to whether dbscan has assigned or cluster or noise.

groups  <- UK %>% filter(cluster != 0)
noise  <- UK %>% filter(cluster == 0)

Now lets have a look at the results[^2].

ggplot(UK, aes(x = long, y = lat, alpha = 0.5)) + 
  geom_point(aes(fill = "grey"), noise) +
  geom_point(aes(colour = as.factor(cluster)), groups,
             size = 3) +
  coord_map() +
  theme_stripped +
  theme_empty +
  theme(legend.position = "none")

plot of chunk unnamed-chunk-5

I arbitrarily set the EPS parameter. How to tune it? Discussion for another time…

[^1]: I recommend reading the paper which is quite accesible. Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

[^2]: I am stripping out some of the ggplot defaults with two objects theme_stripped and theme_empty which I use routinely to either remove the background and gridlines or to remove everything including axes.

To leave a comment for the author, please follow the link and comment on their blog: Adventures in Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)