Miles of iles

December 24, 2012
By

(This article was first published on Portfolio Probe » R language, and kindly contributed to R-bloggers)

An explanation of quartiles, quintiles deciles, and boxplots.

Previously

“Again with variability of long-short decile tests” and its predecessor discusses using deciles but doesn’t say what they are.

The *iles

These are concepts that have to do with approximately equally sized groups created from sorted data. There are 4 groups with quartiles, 5 with quintiles and 10 with deciles.

But it isn’t quite as easy as perhaps it should be.  These words are used for two different concepts:

  • the data in a group
  • the dividing line between groups

So the top decile can be either the 10% of the data that are biggest, or a point that divides that group of biggest values from the next smaller group.

There is one fewer dividing line than groups.

Boxplot

The premier graph that uses quartiles is the boxplot.

Figure 1 is an example boxplot — it shows the daily log returns during 2012 (so far) for a particular stock (MMM).

Figure 1: Boxplot of daily log returns of MMM in 2012 year-to-date. boxoneFigure 2 is an explanation of the elements of Figure 1.

Figure 2: Annotated boxplot. boxexplan

The middle half of the data is in the box.  The line inside the box is the median.

The interquartile range (IQR) is the 3rd quartile minus the 1st quartile.  This is one of the statistics given in the market portraits.

The whiskers end at a data point, but can be no longer than some length.  In R the default is that whiskers can be no longer than 1.5 times the IQR.

Points beyond the whiskers are sometimes referred to as “outliers”.  That is not particularly good nomenclature.  For there to be an outlier, there really needs to be a model.

Boxplots are useful for a single variable, but they really shine when you have multiple variables to compare.  Figure 3 is a boxplot of daily returns of a few stocks.

Figure 3: Boxplot of daily log returns of a few stocks in 2012 year-to-date. boxmult

Epilogue

Some are watching it from the wings
Some are standing in the center

–from “People’s Parties” by Joni Mitchell

Appendix R

R is a good environment for whatever *ile you are interested in.

multiple boxplots

There are three likely ways of getting multiple boxplots.

The first is to give the boxplot function a list.  In particular, a data frame is a list.  So we can do:

boxplot(data.frame(retMat[,1:5])) # Figure 4

This extracts the first 5 columns of the return matrix and then changes it into a data frame.

Figure 4: Boxplot from a list (data frame). boxlist

The second way is via a formula.  An R formula involves the ~ operator.  boxplot expects the formula to be of the form:

values ~ groups

where values and groups are vectors of the same length.

It is often the case that you’ll give a data frame as the data argument, and the columns of that data frame hold the variables that appear in the formula.  However, that is not mandatory:

boxplot(retMat[,1] ~ substr(rownames(retMat), 6,7))
# Figure 5

Here we use the first column of the return matrix as the values part of the formula, and extract the month of the year (2012) from the row names to use as the groups.  Hence we have a separate boxplot for each month of 2012.  The result is Figure 5.

Figure 5: Daily returns of MMM by month of 2012 (base). boxgrp

All of the graphics above use the base graphics in R.  An alternative is the ggplot2 package.  We can do the same thing as in Figure 5 with the ggplot2 command:

qplot(substr(rownames(retMat), 6,7), retMat[,1], 
    geom='boxplot') # Figure 6

Figure 6: Daily returns of MMM by month of 2012 (ggplot2). boxgg

annotated boxplot

The function that created Figure 2 was:

function (filename = "boxexplan.png") 
{
        if(length(filename)) {
                png(file=filename, width=512)
                par(mar=c(5,4, 0, 2) + .1)
        }
        bp <- boxplot(retMat[,1] * 100, 
                ylab="Daily log returns (%)", 
                col="gold", xlim=c(.5, 3))
        bps <- bp$stats
        loc <- c(mean(bps[1:2]), bps[2:4], 
             mean(bps[4:5]), -2.5, 2.5)
        arrows(2, loc, c(1.1, 1.3, 1.3, 1.3, 
             1.1, 1.1, 1.1), loc)
        text(2.03, loc, adj=0, c("bottom whisker", 
                "1st quartile",
                "2nd quartile (median)", 
                "3rd quartile", "top whisker",
                "extreme data points", 
                "extreme data points"))

        if(length(filename)) {
                dev.off()
        }
}

computing dividing lines

The quantile function will provide values that divide the groups.  For example deciles can be computed as:

quantile(x, probs=seq(.1, .9, by=.1))

or possibly:

quantile(x, probs=seq(0, 1, by=.1))

finding groups

We’ll go from simplistic to less simple.

1. do-it-yourself

The do-it-yourself approach is:

cut(x, quantile(x,(0:10)/10), labels=FALSE,
     include.lowest=TRUE)
2. cut2

You can use the cut2 function in the Hmisc package.  If you want deciles, then you would do something along the lines of:

cut2(x, m=length(x)/10)
3. quantcut

The quantcut function in gtools takes care of a problem that can arise in the solutions above.  If there are a lot of repeated values, then the dividing points may not be unique.  quantcut checks for that and if it occurs, then it reduces the number of groups returned.

4. fixer-upper

You may really want the same number of groups that you requested.  Hence you could end up with observations that have the same value but are in different groups.

In general, the number of data points is not divisible by the number of groups.  A nicety is to make groups in the center have an additional observation, rather than the first (smallest) groups.

Here is a function to do that:

ntile <-
function (x, ngroups, na.rm=FALSE) 
{
        # function to get ntile 
        # (quartile, quintile, decile, etc.)
        # groups from a vector

        # placed in the public domain 2012 
        # by Burns Statistics

        # testing status: 
        # seems to work

        stopifnot(is.numeric(ngroups), 
                length(ngroups) == 1,
                ngroups > 0)
        if(na.rm) {
                x <- x[!is.na(x)]
        } else if(nas <- sum(is.na(x))) {
                stop(nas, " missing values present")
        }
        nx <- length(x)
        if(nx < ngroups) {
                stop("more groups (", ngroups, 
                        ") than observations (", 
                        nx, ")")
        }
        basenum <- nx %/% ngroups
        extra <- nx %% ngroups
        repnum <- rep(basenum, ngroups)
        if(extra) {
                eloc <- seq(floor((ngroups - extra)/2 
                    + 1), length=extra)
                repnum[eloc] <- repnum[eloc] + 1
        }
        split(sort(x), rep(1:ngroups, repnum))
}

This can be used like:

ntile(rnorm(63), 10) # deciles
ntile(rnorm(63), 5)  # quintiles
ntile(rnorm(63), 4)  # quartiles

 
Anyone see any problems with this function?

To leave a comment for the author, please follow the link and comment on his blog: Portfolio Probe » R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.