Clustering French Cities (based on Temperatures)

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities.

> temp=read.table(
+ "http://freakonometrics.free.fr/FR_temp.txt",
+ header=TRUE,dec=",")

We have 15 cities, with monthly observations

> X=temp[,1:12]
> boxplot(X)

Since the variance seems to be rather stable, we will not ‘normalize’ the variables here,

> apply(X,2,sd)
    Janv     Fevr     Mars     Avri 
2.007296 1.868409 1.529083 1.414820 
     Mai     Juin     juil     Aout 
1.504596 1.793507 2.128939 2.011988 
    Sept     Octo     Nove     Dece 
1.848114 1.829988 1.803753 1.958449

In order to get a hierarchical cluster analysis, use for instance

> h <- hclust(dist(X), method = "ward")
> plot(h, labels = rownames(X), sub = "")

An alternative is to use

> library(FactoMineR)
> h2=HCPC(X)
> plot(h2)

Here, we visualise observations with a principal components analysis. We have here also an automatic selection of the number of classes, here 3. We can get the description of the groups using

> h2$desc.ind

or directly

> cah=hclust(dist(X))
> groups.3 <- cutree(cah,3)

We can also visualise those classes by ourselves,

> acp=PCA(X,scale.unit=FALSE)
> plot(acp$ind$coord[,1:2],col="white")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=groups.3)

It is possible to plot the centroïds of those clusters

> PT=aggregate(acp$ind$coord,list(groups.3),mean)
> points(PT$Dim.1,PT$Dim.2,pch=19)

If we add Voroid sets around those centroïds, here we do not see them (actually, we see the point – in the middle – that is exactly at the intersection of the three regions),

> library(tripack)
> V <- voronoi.mosaic(PT$Dim.1,PT$Dim.2)
> plot(V,add=TRUE)

To visualize those regions, use

> p=function(x,y){
+   which.min((PT$Dim.1-x)^2+(PT$Dim.2-y)^2)
+ }
> vx=seq(-10,12,length=251)
> vy=seq(-6,8,length=251)
> z=outer(vx,vy,Vectorize(p))
> image(vx,vy,z,col=c(rgb(1,0,0,.2),
+ rgb(0,1,0,.2),rgb(0,0,1,.2)))
> CL=c("red","black","blue")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=CL[groups.3])

Actually, those three groups (and those three regions) are also the ones we obtain using a k-mean algorithm,

> km=kmeans(acp$ind$coord[,1:2],3)
> km
K-means clustering 
with 3 clusters of sizes 3, 7, 5

(etc). But actually, since again we have some spatial data, it is possible to visualize them on a map

> library(maps)
> map("france")
> points(temp$Long,temp$Lati,col=groups.3,pch=19)

or, to visualize the regions, use e.g.

> library(car)
> for(i in 1:3) 
+ dataEllipse(temp$Long[groups.3==i],
+ temp$Lati[groups.3==i], levels=.7,add=TRUE,
+ col=i+1,fill=TRUE)

Those three regions actually make sense, geographically speaking.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)