[This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Percentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969)
test.data <- rnorm(n=10000, mean=100, sd=15)
library(ggplot2)
ggplot(as.data.frame(test.data), aes(test.data)) +
geom_histogram(binwidth=1, aes(y=..density..), fill="dodgerblue") +
geom_line(stat="function", fun=dnorm, args=list(mean=100, sd=15), colour="red", size=1) +
geom_area(stat="function", fun=dnorm, args=list(mean=100, sd=15),
colour="red", fill="red", alpha=0.5, xlim=quantile(test.data, c(0.5, 0.75))) +
theme(text=element_text(size=16))


The quantile default function and the 95th percentile give the following results:

> quantile(test.data)
0%       25%       50%       75%      100%
39.91964  89.68041 100.16437 110.01910 153.50195

> quantile(test.data, probs=0.95)
95%
124.7775


## Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(test.data, 0.95, type=m))

95%      95%      95%      95%      95%      95%      95%      95%      95%
124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775


## Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name $latex Q_8 p &s=1. Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data? In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6). weibull.quantile <- function(x, p) { # Order Samples from large to small x <- x[order(x, decreasing=FALSE)] # Determine ranking of percentile according to Weibull (1939) r <- p * (length(x) + 1) # Linear interpolation rfrac <- (r - floor(r)) return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1]) }  ## Turbidity Data Example Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits. Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53). When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result. ZoneR1R2R3R4R5R6R7R8R9 Bealiba0.3000.3000.2000.2400.2900.3000.2450.3000.300 Dunolly0.400000.400000.300000.340000.390000.435000.345000.405000.40125 Laanecoorie0.500000.500000.400000.440000.490000.535000.445000.505000.50125 Tarnagulla0.40.40.40.40.40.40.40.40.4 The graph and the table were created with the following code snippet: ggplot(turbidity, aes(Result)) + geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + facet_wrap(~Zone) + theme(text=element_text(size=16)) tapply(turbidity$Result, turbidity\$Zone,
function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))


The post Percentile Calculations in Water Quality Regulations appeared first on The Devil is in the Data.