Summarising multivariate palaeoenvironmental data
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The horseshoe effect is a well known and discussed issue with principal component analysis (PCA) (e.g. Goodall 1954; Swan 1970; NoyMeir & Austin 1970). Similar geometric artefacts also affect correspondence analysis (CA). In part 1 of this series I looked at the implications of these “artefacts” for the recovery of temporal or single dominant gradients from multivariate palaeoecological data. In part 2, I introduce the topic of principal curves (Hastie & Stuetzle 1989).
A principal curve (PC) is a smooth, onedimensional curve fitted through a data set in ( m ) dimensions in such as way that the curve “fits” the data best, for some definition of “best”, which we will take as meaning that the distances between the observations and the curve are minimised in some way. PCs can be thought of as a combination of principal components and splines and as you’ll see in part 3 of this series, this concept is not just superficial. In the figure below, data from the equation
[ y = 0.9x + 2x^2 + 1.4x^3 + , N(= 0, = 0.05) ]
are plotted and the fitted relationship between (x) and (y) is shown as produced by four methods indicated by the red line in each panel. The four methods used were
 Ordinary Least Squares (OLS)
 Principal Components Analysis (PCA)
 Cubic smoothing spline (CSS), and
 Principal Curves (PCs)
For our purposes, think of (x) and (y) as being two species. In the OLS solution, the sum of squared errors in (y) are minimised by the fitted line and hence (x) and (y) play different asymmetric roles in the model; uncertainty in (x) is considered to be zero and all the random variation is in (y). Contrast this with the way errors are minimised in PCA; the error is now with respect to both (x) and (y) and hence the distance from a point to it’s location on the principal component line is orthogonal (or at 90°) to the line. In both cases however, the fitted line, the “model”, is linear, a straight line.
Splines of various sorts have been used to provide non or semiparametric fits between (x) and (y). In panel (c) of the plot above, a cubic smoothing spline is fitted to the sample data. As with OLS, the CSS minimises squared errors in (y) only; it is a regression method afterall. The difference is that the fitted model can be a smooth curve rather than a straight line.
Prinicpal curves combine the features of the PCA approach with those of the spline. The corresponding principal curve for the sample data is shown in panel (d) and is a smooth, nonlinear fit but this time the fit has minimised the error in (x) and (y). The errors are orthogonal to the principal curve and join the observation with the point on the curve to which it projects, in other words is closest to. I’ll look at exactly how that is done later in this post.
The algorithm developed by Hastie & Stuetzle (1989) to fit a PC has two essential steps
 a projection step, and
 a local averaging step.
These two steps are iterated to convergence and can be arranged in any order; you can start with either the projection step or the local averaging step.
In the projection step, the sample locations are projected onto the current PC by finding the location along the curve closest to each point. Closest is defined in the same way as for PCA, namely the length of the line orthogonal to the curve that joins the sample point and its projection point on the curve.
For the local averaging step, one end of the curve is chosen arbitrarily and the distance along the curve from this end is determined. These distances are then used as the predictor variable in a smooth regression model to predict the response values of a single species. A model is fitted in turn to each species. This step has the effect of bending the curve towards the data. The type of smoother used for the models is a plugin to the PC algorithm; smoothing splines were initially used, but any smooth regression model could be employed, such as a kernel smoother or regression splines, or a generalized additive model.
Convergence is declared when the fitted curve is sufficiently close to that of the previous iteration, the meaning of sufficiently close being controlled by a threshold parameter that can be varied by the user to allow for more or less strict convergence.
To illustrate the fitting process, I turn to a synthetic data set from Podani & Miklós (2002), one that is similar in spirit to the data I used from Legendre & Legendre (2012 482) in the earlier post. The data set is somewhat tedious to create hence I have written a couple of functions that can generate data sets from the paper (Podani & Miklós 2002) that are available on Github. To read this code into R and recreate the data from Figure 1 (Podani & Miklós 2002) run
<span class="c1">## load Podani and Miklos datasetgenerating function from github</span>
tmp <span class="o"><</span> tempfile<span class="p">()</span>
download.file<span class="p">(</span><span class="s">"https://github.com/gavinsimpson/random_code/raw/master/podani.R"</span><span class="p">,</span>
tmp<span class="p">,</span> method <span class="o">=</span> <span class="s">"wget"</span><span class="p">)</span>
source<span class="p">(</span>tmp<span class="p">)</span>
<span class="c1">## generate data set 1 of Podani & Miklos</span>
p1 <span class="o"><</span> podani1<span class="p">()</span>
As with the data in the earlier post, when ordinated using PCA a very strong horseshoe is apparent in the solution. Projecting points on to the first PCA axis for example, the true ordering of samples along the gradient is lost at the ends of the axis where the horseshoe bends back on itself.
<span class="c1">## ordinate</span>
library<span class="p">(</span><span class="s">"vegan"</span><span class="p">)</span>
pca1 <span class="o"><</span> rda<span class="p">(</span>p1<span class="p">)</span>
<span class="c1">## plot data and PCA</span>
layout<span class="p">(</span>matrix<span class="p">(</span>c<span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">2</span><span class="p">),</span> nrow <span class="o">=</span> <span class="m">3</span><span class="p">))</span>
matplot<span class="p">(</span>p1<span class="p">,</span> type <span class="o">=</span> <span class="s">"o"</span><span class="p">,</span> pch <span class="o">=</span> <span class="m">19</span><span class="p">,</span> ylab <span class="o">=</span> <span class="s">"Abundance"</span><span class="p">,</span> xlab <span class="o">=</span> <span class="s">"Gradient"</span><span class="p">)</span>
ordipointlabel<span class="p">(</span>pca1<span class="p">,</span> display <span class="o">=</span> <span class="s">"sites"</span><span class="p">)</span>
layout<span class="p">(</span><span class="m">1</span><span class="p">)</span>
There are several R packages that fit PCs as described above, including princurve, which was converted from the S original implementation of Trevor Hastie, and pcurve, Glenn De’ath’s port of Hastie’s code and given an ecological makeover as described in (De’ath 1999) and which I now maintain. However, I will use the interface I wrote for the analogue package, which is a wrapper to the principal.curve()
function of princurve, designed specifically for working with palaeo data. To fit a PC to the example data, use
<span class="c1">## Load analogue</span>
library<span class="p">(</span><span class="s">"analogue"</span><span class="p">)</span>
prc1 <span class="o"><</span> prcurve<span class="p">(</span>p1<span class="p">,</span> trace <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> plotit <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> thresh <span class="o">=</span> <span class="m">0.0005</span><span class="p">,</span> maxit <span class="o">=</span> <span class="m">50</span><span class="p">)</span>
Because I used trace = TRUE
, information on the fitting process is printed to the screen. Note that I increased the maximum number of iterations to 50 (maxit = 50
) and reduced the default convergence tolerance (thresh = 0.0005
) to give a closer fit and to make a more interesting animation showing how the curve in sequentially updated. The plotit
argument will draw a plot of the current iteration’s PC in a PCA of the data. The figure below shows this updating of the curve as an animation
Although I haven’t actually checked, I suspect the reason the fitted curve moves away from the samples at the top of the figure in later iterations is that it is being bent closer to samples in other dimensions than those shown on the PCA. If you were to use the default tolerance, the algorithm converges after five iterations and is much closer to the samples at the top of the figure and has a slightly better fit than the one shown in the figure, however, it is not selfconsistent given the tighter tolerance of thresh = 0.0005
. A curve is said to be selfconsistent if for any point on the curve, the average of the points that project there coincides with the point on the curve. Hastie & Stuetzle (1989) note that they are “unable to prove… that each step guarantees a decrease in the criterion [sum of squared errors]”
, which fits with our observations here. prcurve()
spat the following information to the screen during fitting
<span class="go">R> prc1 < prcurve(p1, trace = TRUE, plotit = TRUE, thresh = 0.0005,</span>
<span class="gp">+ </span> maxit <span class="o">=</span> <span class="m">50</span><span class="p">)</span>
<span class="go"></span>
<span class="go">Initial curve: d.sq: 415.4000</span>
<span class="go">Iteration 1: d.sq: 225.0781</span>
<span class="go">Iteration 2: d.sq: 226.6742</span>
<span class="go">Iteration 3: d.sq: 227.8931</span>
<span class="go">Iteration 4: d.sq: 229.2305</span>
<span class="go">Iteration 5: d.sq: 229.0200</span>
<span class="go">Iteration 6: d.sq: 229.4651</span>
<span class="go">Iteration 7: d.sq: 230.2869</span>
<span class="go">Iteration 8: d.sq: 231.3375</span>
<span class="go">Iteration 9: d.sq: 232.8548</span>
<span class="go">Iteration 10: d.sq: 234.2305</span>
<span class="go">Iteration 11: d.sq: 235.1783</span>
<span class="go">Iteration 12: d.sq: 235.4467</span>
<span class="go">Iteration 13: d.sq: 235.1242</span>
<span class="go">Iteration 14: d.sq: 236.7539</span>
<span class="go">Iteration 15: d.sq: 242.6807</span>
<span class="go">Iteration 16: d.sq: 247.1509</span>
<span class="go">Iteration 17: d.sq: 248.8073</span>
<span class="go">Iteration 18: d.sq: 249.5422</span>
<span class="go">Iteration 19: d.sq: 250.2597</span>
<span class="go">Iteration 20: d.sq: 250.9525</span>
<span class="go">Iteration 21: d.sq: 251.5495</span>
<span class="go">Iteration 22: d.sq: 252.0903</span>
<span class="go">Iteration 23: d.sq: 252.5108</span>
<span class="go">Iteration 24: d.sq: 252.8692</span>
<span class="go">Iteration 25: d.sq: 253.1439</span>
<span class="go">Iteration 26: d.sq: 253.4109</span>
<span class="go">Iteration 27: d.sq: 253.6296</span>
<span class="go">Iteration 28: d.sq: 253.8369</span>
<span class="go">Iteration 29: d.sq: 254.0217</span>
<span class="go">Iteration 30: d.sq: 254.1867</span>
<span class="go">Iteration 31: d.sq: 254.3728</span>
<span class="go">Iteration 32: d.sq: 254.5084</span>
<span class="go">Iteration 33: d.sq: 254.6286</span>
<span class="go"></span>
<span class="go">PC Converged in 33 iterations.</span>
<span class="go"></span>
33 iterations were required before convergence in this instance. Printing the object returned by prcurve()
yields further information on the fitted curve
<span class="go">R> prc1</span>
<span class="go"> Principal Curve Fitting</span>
<span class="go">Call: prcurve(X = p1, maxit = 50, trace = TRUE, thresh = 5e04, plotit</span>
<span class="go">= TRUE)</span>
<span class="go">Algorithm converged after 33 iterations</span>
<span class="go"> SumSq Proportion</span>
<span class="go">Total 415 1.00</span>
<span class="go">Explained 161 0.39</span>
<span class="go">Residual 255 0.61</span>
<span class="go">Fitted curve uses 336.433 degrees of freedom.</span>
The curve explains 39% of the variance in the example data set and whilst there is a significant amount of unexplained variance, the PC is a significant improvement over the % fit that the PCA affords these data (the value displayed below is a proportion) although it does use a significantly higher number of degrees of freedom in doing so (~336 compared with 42 for the PCA — the first principal component is a linear combination of 42 species scores.).
<span class="go">R> varExpl(pca1)</span>
<span class="go"> PC1 </span>
<span class="go">0.05963</span>
There’s not a whole lot more to principal curves, except a bit of fiddling and plenty of details on how those smoothers are fitted. I don’t want this post to turn into an even larger missive than it already is, so I’ll postpone those details to another post. In the meantime, I’ll show a principal curve fitted to the Abernethy Forest data I introduced at the end of the previous post. I’ll quickly run through the code without much explanation as I’ll cover the details in part three of this series.
<span class="c1">## Load the Abernethy Forest data</span>
data<span class="p">(</span>abernethy<span class="p">)</span>
<span class="c1">## Remove the Depth and Age variables</span>
abernethy2 <span class="o"><</span> abernethy<span class="p">[,</span> <span class="o"></span><span class="p">(</span><span class="m">37</span><span class="o">:</span><span class="m">38</span><span class="p">)]</span>
<span class="c1">## Fit the principal curve varying complexity for each species</span>
aber.pc <span class="o"><</span> prcurve<span class="p">(</span>abernethy2<span class="p">,</span> trace <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> vary <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> penalty <span class="o">=</span> <span class="m">1.4</span><span class="p">)</span>
aber.pc
<span class="c1">## Plot</span>
op <span class="o"><</span> par<span class="p">(</span>mar <span class="o">=</span> c<span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">2</span><span class="p">)</span> <span class="o">+</span> <span class="m">0.1</span><span class="p">)</span>
plot<span class="p">(</span>aber.pc<span class="p">)</span>
par<span class="p">(</span>op<span class="p">)</span>
<span class="c1">## 3d plot</span>
plot3d<span class="p">(</span>aber.pc<span class="p">)</span>
The plot below (code not shown for this) shows how the responses of the main taxa in the data set vary as the algorithm iterates to convergence in six iterations. The major changes happen in the first 2 iterations.
The fitted PC is shown below
and explains 96% of the variance in the data
R<span class="o">></span> aber.pc
Principal Curve Fitting
Call<span class="o">:</span> prcurve<span class="p">(</span>X <span class="o">=</span> abernethy2<span class="p">,</span> vary <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> trace <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span> penalty <span class="o">=</span> <span class="m">1.4</span><span class="p">)</span>
Algorithm converged after <span class="m">6</span> iterations
SumSq Proportion
Total <span class="m">103234</span> <span class="m">1.00</span>
Explained <span class="m">98864</span> <span class="m">0.96</span>
Residual <span class="m">4370</span> <span class="m">0.04</span>
Fitted curve uses <span class="m">218.339</span> degrees of freedom.
and we’d need 4 or 5 PCA axes to do as well as this, and even then we’d be in the unenviable position of having to visualize that 5d space.
R<span class="o">></span> varExpl<span class="p">(</span>rda<span class="p">(</span>abernethy2<span class="p">),</span> axes <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">,</span> cumulative <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
PC1 PC2 PC3 PC4 PC5
<span class="m">0.4650</span> <span class="m">0.8022</span> <span class="m">0.9059</span> <span class="m">0.9439</span> <span class="m">0.9703</span>
The principal curve, whilst using a larger number of degrees of freedom (~220 vs ~180 for a 5axis PCA solution), embeds the same amount of information in a single variable, the arc length distance along the principal curve, which may be used as one would a PCA axis score.
By way of some eye candy to close^{1}, the output of plot3d(aber.pc)
is an rgl device window which contains a spinable, zoomable 3D representation of the data and the fitted curve. The axes of the plot are the first three principal components, the points are where the samples rfom the core are located in this reduced space and the orange line is the fitted principal curve. Rather than show a static image, I used the rgl package to spin the display and wrapped this up in a video, displayed below. The video can be streamed but weighs in at 11Mb (! — I struggled to keep the text legible), just so you know.
Seeing the curve (and the data) in three dimensions rather than the usual two is quite illuminating!
That’s it for part 2. Next time I’ll take a closer look at the prcurve()
function, what the options for fitting principal curves are, among other things.
References
De’ath G. et al. (1999) Principal Curves: a new technique for indirect and direct gradient analysis. Ecology 80, 2237–2253.
Goodall D.W. et al. (1954) Objective methods for the classification of vegetation. III. An essay in the use of factor analysis. Australian Journal of Botany 2, 304–324.
Hastie T. & Stuetzle W. et al. (1989) Principal Curves. Journal of the American Statistical Association 84, 502–516.
Legendre P. & Legendre L. et al. (2012) Numerical Ecology, 3rd edn. Elsevier.
NoyMeir I. & Austin M.P. et al. (1970) Principal Component Ordination and Simulated Vegetational Data. Ecology 51, 551–552.
Podani J. & Miklós I. et al. (2002) Resemblance Coefficients and the Horseshoe Effect in Principal Coordinates Analysis. Ecology 83, 3331–3343.
Swan J.M.A. et al. (1970) An Examination of Some Ordination Problems By Use of Simulated Vegetational Data. Ecology 51, 89–102.

Well, not that much candy; I’m still getting to grips with rgl and creating videos from individual PNG frames….↩
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.