**Portfolio Probe » R language**, and kindly contributed to R-bloggers)

An explanation of quartiles, quintiles deciles, and boxplots.

## Previously

“Again with variability of long-short decile tests” and its predecessor discusses using deciles but doesn’t say what they are.

## The *iles

These are concepts that have to do with approximately equally sized groups created from sorted data. There are 4 groups with quartiles, 5 with quintiles and 10 with deciles.

But it isn’t quite as easy as perhaps it should be. These words are used for two different concepts:

- the data in a group
- the dividing line between groups

So the top decile can be either the 10% of the data that are biggest, or a point that divides that group of biggest values from the next smaller group.

There is one fewer dividing line than groups.

## Boxplot

The premier graph that uses quartiles is the boxplot.

Figure 1 is an example boxplot — it shows the daily log returns during 2012 (so far) for a particular stock (MMM).

Figure 1: Boxplot of daily log returns of MMM in 2012 year-to-date. Figure 2 is an explanation of the elements of Figure 1.

The middle half of the data is in the box. The line inside the box is the median.

The *interquartile range* (IQR) is the 3rd quartile minus the 1st quartile. This is one of the statistics given in the market portraits.

The whiskers end at a data point, but can be no longer than some length. In R the default is that whiskers can be no longer than 1.5 times the IQR.

Points beyond the whiskers are sometimes referred to as “outliers”. That is not particularly good nomenclature. For there to be an outlier, there really needs to be a model.

Boxplots are useful for a single variable, but they really shine when you have multiple variables to compare. Figure 3 is a boxplot of daily returns of a few stocks.

Figure 3: Boxplot of daily log returns of a few stocks in 2012 year-to-date.

## Epilogue

*Some are watching it from the wings*

* Some are standing in the center*

–from “People’s Parties” by Joni Mitchell

## Appendix R

R is a good environment for whatever *ile you are interested in.

#### multiple boxplots

There are three likely ways of getting multiple boxplots.

The first is to give the boxplot function a list. In particular, a data frame is a list. So we can do:

boxplot(data.frame(retMat[,1:5])) # Figure 4

This extracts the first 5 columns of the return matrix and then changes it into a data frame.

Figure 4: Boxplot from a list (data frame).

The second way is via a formula. An R formula involves the ~ operator. boxplot expects the formula to be of the form:

values ~ groups

where `values`

and `groups`

are vectors of the same length.

It is often the case that you’ll give a data frame as the `data`

argument, and the columns of that data frame hold the variables that appear in the formula. However, that is not mandatory:

boxplot(retMat[,1] ~ substr(rownames(retMat), 6,7)) # Figure 5

Here we use the first column of the return matrix as the values part of the formula, and extract the month of the year (2012) from the row names to use as the groups. Hence we have a separate boxplot for each month of 2012. The result is Figure 5.

Figure 5: Daily returns of MMM by month of 2012 (base).

All of the graphics above use the base graphics in R. An alternative is the `ggplot2`

package. We can do the same thing as in Figure 5 with the `ggplot2`

command:

qplot(substr(rownames(retMat), 6,7), retMat[,1], geom='boxplot') # Figure 6

Figure 6: Daily returns of MMM by month of 2012 (ggplot2).

#### annotated boxplot

The function that created Figure 2 was:

function (filename = "boxexplan.png") { if(length(filename)) { png(file=filename, width=512) par(mar=c(5,4, 0, 2) + .1) } bp <- boxplot(retMat[,1] * 100, ylab="Daily log returns (%)", col="gold", xlim=c(.5, 3)) bps <- bp$stats loc <- c(mean(bps[1:2]), bps[2:4], mean(bps[4:5]), -2.5, 2.5) arrows(2, loc, c(1.1, 1.3, 1.3, 1.3, 1.1, 1.1, 1.1), loc) text(2.03, loc, adj=0, c("bottom whisker", "1st quartile", "2nd quartile (median)", "3rd quartile", "top whisker", "extreme data points", "extreme data points")) if(length(filename)) { dev.off() } }

#### computing dividing lines

The `quantile`

function will provide values that divide the groups. For example deciles can be computed as:

quantile(x, probs=seq(.1, .9, by=.1))

or possibly:

quantile(x, probs=seq(0, 1, by=.1))

#### finding groups

We’ll go from simplistic to less simple.

##### 1. do-it-yourself

The do-it-yourself approach is:

cut(x, quantile(x,(0:10)/10), labels=FALSE, include.lowest=TRUE)

##### 2. cut2

You can use the `cut2`

function in the `Hmisc`

package. If you want deciles, then you would do something along the lines of:

cut2(x, m=length(x)/10)

##### 3. quantcut

The `quantcut`

function in `gtools`

takes care of a problem that can arise in the solutions above. If there are a lot of repeated values, then the dividing points may not be unique. `quantcut`

checks for that and if it occurs, then it reduces the number of groups returned.

##### 4. fixer-upper

You may really want the same number of groups that you requested. Hence you could end up with observations that have the same value but are in different groups.

In general, the number of data points is not divisible by the number of groups. A nicety is to make groups in the center have an additional observation, rather than the first (smallest) groups.

Here is a function to do that:

ntile <- function (x, ngroups, na.rm=FALSE) { # function to get ntile # (quartile, quintile, decile, etc.) # groups from a vector # placed in the public domain 2012 # by Burns Statistics # testing status: # seems to work stopifnot(is.numeric(ngroups), length(ngroups) == 1, ngroups > 0) if(na.rm) { x <- x[!is.na(x)] } else if(nas <- sum(is.na(x))) { stop(nas, " missing values present") } nx <- length(x) if(nx < ngroups) { stop("more groups (", ngroups, ") than observations (", nx, ")") } basenum <- nx %/% ngroups extra <- nx %% ngroups repnum <- rep(basenum, ngroups) if(extra) { eloc <- seq(floor((ngroups - extra)/2 + 1), length=extra) repnum[eloc] <- repnum[eloc] + 1 } split(sort(x), rep(1:ngroups, repnum)) }

This can be used like:

ntile(rnorm(63), 10) # deciles ntile(rnorm(63), 5) # quintiles ntile(rnorm(63), 4) # quartiles

Anyone see any problems with this function?

**leave a comment**for the author, please follow the link and comment on their blog:

**Portfolio Probe » R language**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...