Transformations for compositional data by @ellis2013nz
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In engaging with this Twitter thread four months ago, I discovered that there was a whole set of statistical methods that I knew nothing about – transforming data that is in the form of a simplex. Common examples of this sort of data would include soil composition (which the Twitter thread was about), chemical composition, time use composition – basically anything where by its very structure, each observation is constrained to add up to a constant number (most often 1). I now think this was a material gap in my skillset, a well-rounded applied statistician needs to know about this.
The original question was in essence “can I take these observations of the proportions of samples that are silt, clay and sand as points in three-dimensional space and just calculate distances between them?”. My first answer was “yes so long as the units are the same.” Then when it was pointed out to me that each observation was constrained to add up one I thought “hmm, perhaps use just two dimensions as the third is redundant, and maybe it doesn’t matter which two”.
Turns out this is wrong – not disastrously wrong like “sort each column in your data independently before you do a regression to get higher correlations” is disastrously wrong; but at least not the best practice in dealing with data of this sort. To my credit, I did mention that I knew nothing about soil science.
People who do know about it, particularly Morgan Williams and Dylan Beaudette, luckily chipped in and mentioned that this is a known problem and there are a bunch of standard ways to deal with it.
I don’t have time to explore all the things mentioned in that Twitter thread so I’m going to focus on one of the more fundamental – the idea of working with “isometric log ratios” and calculating the distance between them, rather than the original dimensions.
Simulated data and a naive method
First let’s look at the first thing that made me uneasily feel my Twitter-brain hadn’t thought this through, my hope that you could measure distances between points using any 2 of 3 dimensions. I simulated some data from a zero-inflated 3 dimensional multivariate log normal distribution then constrained each observation to add up to 1.
Although the first raw cut of the data had positive correlations between the variables (which in my mental model was related to the ‘size’ of each sample that observations were being taken on), once you turn them into composition proportions they naturally are strongly negatively correlated. Of course, if one element is taking up 90% of the composition, the other two elements are going to be small. So a pairs plot of the data shows an interesting triangle shape.
There’s also a skewed univariate distribution for each dimension considered by itself, which is more a product of my simulation process (which gave the underlying multivariate normal data a common mean) than an essential part of the data structure:
This was done with this code:
Let’s look at all the 4,950 pairwise distances between those 100 observations, using x and y, x and z, y and z and all three of x, y and z.
We can easily see that my naive hope that you could just pick any arbitrary two of the three dimensions and get the same result is wrong. In fact, some highschool maths would have told us this - substituting in z = (1 - x - y) for (y) in a calculation of the distance between two points (sqrt((x1 - x2)^2 + (y1 - y2 ^2))) is going to get you different results.
The different two-dimensional distances are correlated with each other of course, and even more so with the three-dimensional distances, but the correlations are noticeably below 1.
Code for this step will be shown a bit later because, for efficiencies sake, I did some of the calculations in the next section at the same time and I want to talk about them first.
Isometric logarithm (or Box-Cox) centered ratios
So let’s look at the proper way to do it. First thing I thought on seeing the name of this technique was ratios of what I wondered? And how do they get to be isometric?
Luckily I found this comprehensive answer by the always-impressive whuber on Cross Validated, the Stack Exchange statistics forum. It’s brilliant, clear, well-explained and with reproducible code; I nearly didn’t write this blog due to lack of any obvious value-add from me. So do yourself a favour if you’re interested in this - and in how to craft a useful answer on Cross Validated or Stack Overflow - and give it a read. There’s a couple of tiny bugs in his code so that can be my value-add, but the real value of writing this blog post is forcing me to think through the process myself.
So recall that we are dealing with a k x n matrix where each of the n rows is an observation, and the k columns represent observations of some proportions that are constrained to add up to 1. So you could work out the values of any one of the columns by subtracting the other columns’ values from one.
ILR turns out (as I understand it at the moment - bear in mind I’m self-learning here so may have got some terminology wrong) to be a three step process:
- transform the ratios (i.e. the proportions that make up individual numbers of the simplex) to make them less skewed. The ‘L’ in ILR stands for a logarithm transform at this point, but doesn’t seem to have any theoretical necessity so it makes sense to instead use a more general Box-Cox transformation (of which a logarithm is a special case)
- center the results by subtracting the geometric mean
- rotate them in such a way that the data becomes two dimensional
Of that last step, whuber explains:
…the hyperplane is rotated (or reflected) to coincide with the plane with vanishing kth coordinate and one uses the first k−1 coordinates. (Because rotations and reflections preserve distance they are isometries, whence the name of this procedure.)
OK… so the end result with my 3 dimensional original data will be 2 dimensions that are a transformed but still full-information version of the original.
Center and rotate with no transformation
An interesting thing about this is that if we skip the transformation of the original data (or, equivalently, transform with a Box-Cox transformation with parameter p = 1, which means the transformation is just subtracting one from it), then this final transformed version should be just a simple mean-subtraction and rotation of the original. Which would mean that calculating distances from the two transformed dimensions should get the same results as (or a linear combination of) the original three-dimensional data!
Let’s check that out in the first instance. Like the previous plot, each point in the image below represents one of the pairwise distances between the original 100 observations:
In the straight line of points in the facet in bottom row, second from right, we see a straight line of points. This is the perfect correlation of 1.0 between the distances calculated from our rotated data (d_ilr) and those on the original (d_xyz).
To make that perfectly clear - if you skip the ‘transform’ step of the ILR process, you end up with distances between points (from data rotated to need just two dimensions) that are perfectly correlated with the distances between points in the original three-dimensional space.
Center and rotate after logarithm transform
OK, so let’s put the transform back in - after all it is pretty fundamental to the concept of ILR. Starting with a logarithm (as per the ‘L’ in ILR), which is equivalent to Box-Cox with parameter lambda = 0, here is what we see:
As expected, there is no longer a simple linear correlation between the pairwise distances of points after ILR and any of the untransformed distances - whether three dimensional or the three possible sets of two dimensional distances.
But there’s something interesting here which is that the distribution of the differences after transformation and rotation has a long, thin right tail - it’s more skewed than is the distribution of differences on the original scale. It’s not nice for exploratory data analysis, where we typically look to transform data to be roughly symmetrical.
What might be going on here? Well, remember we’re looking at a plot of the pairwise distances between points after log-transform, centering and rotation. Let’s simplify things a little but just looking at the log-transformed original data:
We can see from here what an experienced data analyst would think of as the data having been transformed too much. The original simplex data has had its right-skew fixed, but overcompensated for - so we now have left-skewed data. This is the sort of situation where a Box-Cox transformation can be handy, giving us a broader range of transformations than just the logarithm.
Center and rotate after a more generalized transform
To give an idea why, here is the original data but this time with a square root transform:
Now the data is nice! In fact, for the simulated data in whuber’s example on Cross-Validated, he uses a Box-Cox transformation with a parameter of 0.5 - which is very similar to taking the square root - and as he points out it works “beautifully” with his Dirichlet distribution data.
By the way, this is how those plots of transformed data were produced:
So this leads to my final version of this generalised ILR procedure, this time using a Box-Cox transformation with lambda = 0.5.
So this transformation, centering and rotation is a beginning, not an ending. Whether the purpose is exploratory data analysis or more formal modelling, we would use the transformed data for that purpose.
Following the original line of thought on Twitter, I have been focusing on pair-wise distances between observations, which cna be used in any number of ways from classification to multi-dimensional scaling, but that would take me beyond the scope of an already too-long blog.
Efficient transform-center-rotate and calculation of distances
Here’s the code that does the transformations and calculates pair-wise distances. As I abstracted the core tasks out into functions it made sense to have all this code at once at the end of the blog rather than interspersed with all the individual plots above.
ilr() function below is very lightly adapted from whuber’s original on Cross-Validated.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.