**Statistical Research » R**, and kindly contributed to R-bloggers)

I had someone ask me the other day how to take a scatterplot and draw something other than a straight line through the graph using Excel. Yes, it can be done in Excel and it’s really quite simple, but there are some limitations when using the stock Excel dialog screens. So it is probably in one’s best interest to use a higher quality statistical software package.

There are several ways to add a curve to a scatter plot. This could be something as simple as adding a polynomial to the model or something a little more complex like a kernel curve, LOESS, or splines.

**Linear vs. Non-Linear**

First, a difference needs to be made between linear and non-linear regression. Linear regression can also include drawing “lines” that, when plotted on a graph, actually have a curved shape. Polynomial regression is a type linear regression that produces curves and is referred to a curvilinear regression. Specifically, if you can take the partial derivative of the equation and the result is a function that still includes the unknown parameters then the model is considered nonlinear. Polynomials continue remain a linear combination. Take the following linear model equation.

The partial derivatives, as seen below, are no longer functions of . This means that even though it is a polynomial this is a linear model.

**Example Data**

An example that I often use is the speed of a car and the fuel economy. This often shows a strong curvilinear relationship. A lot of other data including biological data and data relating to speed and weight can often have a form of curved relationship. This graph is from the 2012 Fuel Economy Guide.

Data can also be simulated that is curvilinear. In this case I added a cubic component and show the straight line regression.

The example I’ll use is the *cars* dataset that is shipped with R in the *datasets *library. It has two variables and is the speed of the car and the stopping distance.

**Polynomials**

Polynomials can be a very good choice because a straight forward closed-form solution can easily be obtained. Additionally, polynomial models are fairly easy to implement and the models can be easily compared to other predictive models.

**Smoothing Spline**

Like other smoothers the *spline* uses a range of the *x *value to determine its smoothness. This is referred to as the *knot**. *The more knots the tighter the fit of the model. Fewer knots produce a smoother curve. Here the natural spline (green) and the smoothing spline (blue) are fairly similar.

**Kernel Smoothing**

Kernel smoothing uses a weighted average of the corresponding x/y point. The sensitivity is based on a predefined value called *bandwidth*. The greater the bandwidth the smoother the curve will be as it dampened down by the additional data points but it doesn’t handle localized variation very well. A smaller bandwidth can be very sensitive to local variation and will allow some very dramatic changes in the curve. For example here is a comparison of a bandwidth of 1.0 and 5.0. The smoother curve is, of course, the larger bandwidth.

**LOESS and LOWESS**

This is a non-parametric locally weighted regression using a nearest neighbor approach. It too uses a value to control the smoothing. In R this is referred to as *span* but can also be referred to as *bandwidth, *similar to kernel smoothing. I nice feature the comes with LOESS is it’s ability to produce uncertainly around the prediction. This means that confidence intervals can easily be produced and plotted.

**Closing Thoughts**

It can be a little troublesome trying to determine which is best model or whether the curve is over/under fit. It may also depend on how sensitive to local variation you want to allow. There are options available to help determine. For example the MGCV package in R has some feature to help with this. As can be seen in this example all the approach result in very similar solutions. If all you need is some sort of smoother through a scatterplot then it may be best to use the simplest approach. However, if one is trying to make predictions then a lot more care needs to be taken to fit the model properly.

**Example Code**

[sourecode language=”css”]

library(graphics)

library(splines) # Used for the ns() function — (natural cubic splines)

R = matrix(cbind(1,.99, .99,1),nrow=2)

U = t(chol(R))

nvars = dim(U)[1]

numobs = 1000

set.seed(1)

random.normal = matrix(rnorm(nvars*numobs,10,1), nrow=nvars, ncol=numobs);

X = U %*% random.normal

newX = t(X)

raw = as.data.frame(newX)

orig.raw = as.data.frame(t(random.normal))

names(raw) = c(“response”,”predictor1″)

raw$predictor1.3 = raw$predictor1^3

raw$predictor1.2 = raw$predictor1^2

fit = lm(raw$response ~ raw$predictor1.3)

plot(raw$response ~ raw$predictor1.3, pch=16, cex=.4, xlab=”Predictor”, ylab=”Response”, main=”Simulated Data with Slight Curve”)

abline(fit)

x <- with(cars, speed)

y <- with(cars, dist)

eval.length = 50

?cars

# This LOESS shows two different R function arriving at the same solution.

# Careful using the LOESS defaults as they differ and will produce different solutions.

fit.loess = loess.smooth(x, y, evaluation = eval.length,

family=”gaussian”, span=.75, degree=1)

fit.loess2= loess(y ~ x, family=”gaussian”,

span=.75, degree=1)

## Set a simple 95% CI on the fit.loess model

new.x = seq(min(x),max(x), length.out=eval.length)

ci = cbind(

predict(fit.loess2, data.frame(x=new.x)),

predict(fit.loess2, data.frame(x=new.x))+

predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2),

predict(fit.loess2, data.frame(x=new.x))-

predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2)

)

## Linear Model

fit = lm(y ~ x )

## Polynomial

fit.3 = lm(y ~ poly(x,3) )

## Natural Spline

fit.ns.3 = lm(y ~ ns(x, 3) )

## Smoothing Spline

fit.sp = smooth.spline(y ~ x, nknots=15)

plot(x,y, xlim=c(min(x),max(x)), ylim=c(min(y),max(y)), pch=16, cex=.5,

ylab = “Stopping Distance (feet)”, xlab= “Speed (MPH)”, main=”Comparison of Models”

, sub=”Splines”)

## Add additional models on top of graph. It can get cluttered with all the models.

## LOESS with Confidence Intervals

matplot(new.x, ci, lty = c(1,2,2), col=c(1,2,2), type = “l”, add=T)

## Linear

lines(new.x, predict(fit, data.frame(x=new.x)), col=’orange’, lty=3)

## Polynomial

lines(new.x, predict(fit.3, data.frame(x=new.x)), col=’light blue’, lty=4)

## Natural Spline

lines(new.x, predict(fit.ns.3, data.frame(x=new.x)), col=’green’, lty=5)

## Smoothing Spline

lines(fit.sp, col=’blue’, lty=6)

## Kernel Curve

lines(ksmooth(x, y, “normal”, bandwidth = 5), col = ‘purple’, lty=7)

legend(“topleft”,c(“Linear”,”Polynomial”,”Natural Spline”,”Smoothing Spline”,”Kernel”),

col=c(‘black’,’light blue’,’green’,’blue’,’purple’), lty=c(3,4,5,6,7), lwd=2)

[/sourcecode]

**leave a comment**for the author, please follow the link and comment on his blog:

**Statistical Research » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...