**Climate Change Ecology » R**, and kindly contributed to R-bloggers)

Piecewise regression comes about when you have ‘breakpoints’, where there are clearly two different linear relationships in the data with a sudden, sharp change in directionality. This crops up occasionally in ecology when dealing with, for example, species richness of understory plants and forest age. There is initially a rapid drop as succession progresses (rapid colonizing plants die as light becomes more limiting), then a sudden breakpoint and a more moderately paced rise in species richness as more shade tolerant plants and seedlings of older trees begin to colonize the mature forest.

If you Google ‘R piecewise regression’, you may get a variety of methods and advice on how to run a piecewise regression. Essentially, you can do it manually or use a canned package to run the regression. If you’re like me, you might be wondering: Are these methods are the same? If not, which is best? How do I actually use them? Well, hopefully I’ll answer some of these questions.

There are essentially two methods I’ll walk through: brute force iterative approaches (as in ‘The R Book’ by Crawley) and the ‘segmented’ package. Using these approaches allows you to statistically estimate the breakpoint, which is better than just eyeballing it and fitting two models around what you *think* is the breakpoint. Let statistics do the work for you objectively.

First, let’s generate some segmented data:

x <- c(1:10, 13:22) y <- numeric(20) ## Create first segment y[1:10] <- 20:11 + rnorm(10, 0, 1.5) ## Create second segment y[11:20] <- seq(11, 15, len=10) + rnorm(10, 0, 1.5) ## Plot it par(mar=c(4,4,1,1)+0.2) plot(x,y, ylim=c(5, 20), pch=16)

You should get a V-shaped plot, where the right side is a little shallower sloped.

**METHOD 1: ITERATIVE SEARCHING**

First, we’ll model the data using the iterative search procedure described by Crawley. We’re going to use a model that looks like this:

In this case, *c* is the breakpoint. *I*(*x < c*) and *I*(*x > c*) are essentially dummy variables. *I*(*x < c*) is 1 if x is less than the breakpoint and 0 if it is above. *I*(*x > c*) is a dummy variable that is 1 if x is above the breakpoint and 0 if it is below. It should be obvious that there are two sets of parameters being modeled, depending on the value of x.

The complicated bit is choosing the breakpoint. We can eyeball the data and say that the breakpoint is somewhere between 9 and 17. Choose a wider range than you might think, just to be safe. Create a variable called breaks to hold these breakpoints:

breaks <- x[which(x >= 9 & x <= 17)]

Now we’re going to iteratively search these breakpoints for the model that has the lowest residual MSE, using that as our criteria for the best model. Create an empty container for MSE values from each model, and use a for( ) loop to run a linear regression for each possible breakpoint. Formulate the linear model exactly like the above formula.

mse <- numeric(length(breaks)) for(i in 1:length(breaks)){ piecewise1 <- lm(y ~ x*(x < breaks[i]) + x*(x>=breaks[i])) mse[i] <- summary(piecewise1)[6] } mse <- as.numeric(mse)

If we plot MSE by breakpoints, we can visually estimate the breakpoint as the lowest point on the curve:

It could easily be either 13 or 15, so I just pick the breakpoint with the lowest error:

breaks[which(mse==min(mse))]

This tells me that the breakpoint in my simulation was 15. I run my final model with my newly estimated breakpoint:

piecewise2 <- lm(y ~ x*(x < 15) + x*(x > 15)) summary(piecewise2)

The summary looks like this:

This is a pretty nasty output. Read it like this: (Intercept) is more or less useless on its own. The intercept for the line when x < 15 is (Intercept) + x < 15TRUE, or 3.3133 + 16.6352 = 19.9485. The slope of the line when x < 15 is x + x:x<15TRUE, 0.5843 – 1.3025 = -0.7182. So, when x is less than 15, the formula is 19.9485 – 0.7182x.

When x is greater than 15, the intercept is 3.3133 – 0.9116 = 2.4017 and the slope is just the variable x, 0.5843. So when x is greater than 15, the line has the equation 2.4017 + 0.5843x. Notice this table excludes the coefficients with NA estimates. Those arise from singularities and can be ignored.

plot(x,y, ylim=c(5, 20), pch=16) curve((3.3133 + 16.6352) + (0.5843-1.3025)*x, add=T, from=1, to=15) curve((3.3133 - 0.9116) + 0.5843*x, add=T, from=15, to=max(x)) abline(v=15, lty=3)

Notice that the segments **were not** constrained to be touching or continuous. This is inherent in the algorithm that we used.

**METHOD 2: USE ‘SEGMENTED’ PACKAGE**

To use the ‘segmented’ package you will, of course, need to install and load it. The procedure of ‘segmented’ uses maximum likelihood to fit a somewhat different parameterization of the model:

*I*(*x > c*) is a dummy variable as above, so when x < c, the model is essentially:

The *γ* term is simply a measure of the distance between the end of the first segment and the beginning of the next. The model converges when *γ *is minimized, thus this method **constrains the segments to be (nearly) continuous**. This is a *major* difference from the iterative approach in Method 1 above.

To use this method, you first fit a generic linear model. You then use the segmented( ) function to fit the piecewise regression. The segmented( ) function takes for its arguments the generic linear model, seg.Z which is a one sided formula describing the predictor with a segment (we only have one predictor, x, which has the segment), and psi, which is a starting value of the breakpoint (as in nls( ), you need to supply a best-guess estimate). More complicated models are a bit more complicated in terms of arguments, but this is a good starting example.

In our case, x is the predictor with a segment (it’s the only predictor) and based on my very first scatterplot (the first graph on the page), I’m going to guess the breakpoint is 14.

lin.mod <- lm(y~x) segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

Using the summary(segmented.mod) command, you will get the estimated breakpoint +/- some error (12.07 for me), the intercept and slope for the first segment, and U1.x, which is the slope of the second segment.

**UPDATE:**

**U1.x is not the slope of the second segment. It is the difference in slopes between the first and second segment. So if your coefficients are x = -0.9699 and U1.x = 1.4163, then the slope of the second segment is -0.9699+1.4163 = 0.4464. Using the slope(segmented.mod) command will give you the slopes of each segment, which should match the above calculation.**

Plotting it is simple:

plot(x,y, pch=16, ylim=c(5,20)) plot(segmented.mod, add=T)

There you go: two methods for piecewise regression. Each is relatively simple to implement, but they do very different things. If you are dead set on piecewise regression, make sure you choose the one that makes the most sense for your question: do the segments need to be continuous or can they be discontinuous? Make sure you have a rationale for either.

Also note that these models are **not**** nested, **so you can’t used likelihood ratio tests for model selection. You can try to use AIC for model selection, but I advise you to think logically about the continuity of the segments instead.

**leave a comment**for the author, please follow the link and comment on their blog:

**Climate Change Ecology » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...