Clustering French Cities (based on Temperatures)

February 11, 2016

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities.

> temp=read.table(
+ "",
+ header=TRUE,dec=",")

We have 15 cities, with monthly observations

> X=temp[,1:12]
> boxplot(X)

Since the variance seems to be rather stable, we will not ‘normalize’ the variables here,

> apply(X,2,sd)
    Janv     Fevr     Mars     Avri 
2.007296 1.868409 1.529083 1.414820 
     Mai     Juin     juil     Aout 
1.504596 1.793507 2.128939 2.011988 
    Sept     Octo     Nove     Dece 
1.848114 1.829988 1.803753 1.958449

In order to get a hierarchical cluster analysis, use for instance

> h <- hclust(dist(X), method = "ward")
> plot(h, labels = rownames(X), sub = "")

An alternative is to use

> library(FactoMineR)
> h2=HCPC(X)
> plot(h2)

Here, we visualise observations with a principal components analysis. We have here also an automatic selection of the number of classes, here 3. We can get the description of the groups using

> h2$desc.ind

or directly

> cah=hclust(dist(X))
> groups.3 <- cutree(cah,3)

We can also visualise those classes by ourselves,

> acp=PCA(X,scale.unit=FALSE)
> plot(acp$ind$coord[,1:2],col="white")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=groups.3)

It is possible to plot the centroïds of those clusters

> PT=aggregate(acp$ind$coord,list(groups.3),mean)
> points(PT$Dim.1,PT$Dim.2,pch=19)

If we add Voroid sets around those centroïds, here we do not see them (actually, we see the point – in the middle – that is exactly at the intersection of the three regions),

> library(tripack)
> V <- voronoi.mosaic(PT$Dim.1,PT$Dim.2)
> plot(V,add=TRUE)

To visualize those regions, use

> p=function(x,y){
+   which.min((PT$Dim.1-x)^2+(PT$Dim.2-y)^2)
+ }
> vx=seq(-10,12,length=251)
> vy=seq(-6,8,length=251)
> z=outer(vx,vy,Vectorize(p))
> image(vx,vy,z,col=c(rgb(1,0,0,.2),
+ rgb(0,1,0,.2),rgb(0,0,1,.2)))
> CL=c("red","black","blue")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=CL[groups.3])

Actually, those three groups (and those three regions) are also the ones we obtain using a k-mean algorithm,

> km=kmeans(acp$ind$coord[,1:2],3)
> km
K-means clustering 
with 3 clusters of sizes 3, 7, 5

(etc). But actually, since again we have some spatial data, it is possible to visualize them on a map

> library(maps)
> map("france")
> points(temp$Long,temp$Lati,col=groups.3,pch=19)

or, to visualize the regions, use e.g.

> library(car)
> for(i in 1:3) 
+ dataEllipse(temp$Long[groups.3==i],
+ temp$Lati[groups.3==i], levels=.7,add=TRUE,
+ col=i+1,fill=TRUE)

Those three regions actually make sense, geographically speaking.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)