Clustering and sector strength

January 21, 2013

(This article was first published on Portfolio Probe » R language, and kindly contributed to R-bloggers)

An exploration of the usefulness of sectors.


This subject was discussed in “S&P 500 sector strengths”.


Stocks are put into groups based on the sector that the company is considered to be in.  Cluster analysis is a statistical technique that finds groups.  If sectors really move together, then clustering should recover sectors.  Will it?


The data were the 2012 returns of the set of stocks with full data for the year and known sector from the US market portraits.  This was 453 US large capitalization equities.

Table 1 gives the number of stocks in each sector for this set of equities.

Table 1: The sectors and the number of stocks in each.

Sector number of stocks
Telecommunications Services 7
Materials 30
Utilities 31
Consumer Staples 33
Energy 34
Health Care 46
Industrials 59
Information Technology 63
Consumer Discretionary 74
Financials 76

Clustering — the traditional variety at least — uses “distances” between objects.  The distance between two stocks was taken to be one minus the correlation of the 2012 daily returns.   The distribution of correlations is shown in Figure 1.

Figure 1: Distribution of estimated correlations of 2012 daily returns among the 453 stocks.  cordist12


Figure 2 shows the fraction of stocks within each sector that fall into the same clusters.

Figure 2: Agreement of sector and cluster classifications. sectorcluster

The sectors are sorted in the plot by the maximum fraction of stocks that fall into a single cluster.  So “Energy” had the biggest percentage of its stocks in one cluster (about 88%) and had about 6% in each of two other clusters; that is, 30 stocks in one cluster and 2 stocks in each of two other clusters.

This plot has the same objective as Figure 1 in “S&P 500 sector strengths”.  “Energy”, “Utilities”, “Industrials” and “Information Technology” have strong showings in both cases.  There is not so much agreement with the other sectors, except that the two consumer sectors are weak in both cases.

This analysis is about individual correlations while the previous one is about mean correlation.  A sector with some very high correlations and some weak correlations will probably look better in the previous analysis than the current one.

What can go wrong?

Sectors that have a large number of stocks in them are going to be more prone to be split up in the clustering than smaller sectors.  Let’s see how much the order of the size of the sectors matches the order of the clustering — this can be done with the rank (or Spearman) correlation.  Figure 3 shows the distribution of the rank correlation from randomly permuting the sector allocations, along with the correlation using the actual sector allocation.

Figure 3: Spearman correlation between sector size and maximum allocation to a cluster, randomly permuted sector allocation (blue), actual allocation (gold). corpermThis plot tells us two things:

  • we are very unlikely to learn much from this correlation
  • there’s not even a hint that our clustering is anything but random

Or put another way, the plot pretty much only informs us of our ignorance.


Care should be taken with sectors (and industries and countries).  Some sectors act as expected, but some probably act more like a random collection.


Whose woods these are I think I know

from “Stopping by Woods on a Snowy Evening” by Robert Frost

Appendix R

Here is an outline of the R commands that produced the analysis.


The command that created the correlation matrix was:

corsamp12 <- cor(retMat12[, !])


There are lots of clustering strategies.  Several of them can be done via the hclust function.  In most cases you’ll do pretty well if you just take the default settings of R functions if you are unfamiliar with a statistical technique.  That’s not so true with clustering.

In this case many clustering methods put almost all the stocks into one category.  That’s another way things can go wrong — if all stocks were in one cluster, then the sector allocations would look perfect.

The default method (complete) didn’t create one big cluster, even so Ward’s method was the one used here:

hclus12sampward <- hclust(as.dist(1 - corsamp12), 
grp12sampward <- cutree(hclus12sampward, k=10)

The first command does the clustering — note the use of as.dist to turn what is logically a distance matrix into a distance object that hclust is expecting.  The second command uses the clustering object to decide how to form 10 clusters (groups).

clusters and sectors

The function to match up clusters and sectors is:

pp.secclust <- function(groups, sectors=sector12sub)
  # placed in public domain 2013 by Burns Statistics

  stopifnot(all(names(groups) == names(sectors)))
  tab <- table(groups, sectors)
  sweep(tab, 2, colSums(tab), FUN=`/`)


The function to do the plotting is:

pp.plotSecclust <- function(tabscale, labels=NULL,
                            col="steelblue", ...)
  # placed in public domain 2013 by Burns Statistics

  oldpar <- par(mfcol=c(ncol(tabscale), 1), 
     mar=c(0,12,0,0) + .1)
  tabscale <- tabscale[, rev(order(apply(tabscale, 2, 
  if(!length(labels)) {
    labels <- colnames(tabscale)
    names(labels) <- labels
  for(i in 1:ncol(tabscale)) {
    thistab <- sort(tabscale[,i], decreasing=TRUE)
    barplot(thistab, ylim=c(0,1), col=col, axes=FALSE,
            names.arg=rep("", nrow(tabscale)), ...)
    mtext(side=2, labels[colnames(tabscale)[i]], las=1,

This is used like:


rank correlation

The function to do permutation tests of the Spearman correlation on sector size is:

pp.secSpearPerm <- function(groups, sectors=sector12sub, 
  # placed in public domain 2013 by Burns Statistics

  secTab <- table(sectors)
  real <- cor(apply(pp.secclust(groups, sectors=sectors),
       2, max), secTab, method="spearman")
  perms <- numeric(trials)
  thisSec <- sectors
  for(i in 1:trials) {
    thisSec[] <- sample(sectors)
    perms[i] <- cor(apply(pp.secclust(groups, 
        sectors=thisSec),  2, max), 
        secTab, method="spearman")
  list(realCor=real, permCor=perms,

The command using this function was:

secSpear <- pp.secSpearPerm(grp12sampward)

To leave a comment for the author, please follow the link and comment on their blog: Portfolio Probe » R language. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)