Optimal regularization for smoothing splines

December 16, 2011

(This article was first published on R snippets, and kindly contributed to R-bloggers)

In smooth.spline procedure one can use df or spar parameter to control smoothing level. Usually they are not set manually but recently I was asked a question which one of them is a better measure of regularization level. Hastie et al. (2009) discuss the advantages of df but I thought of a simple graphical illustraition of this fact.
I the following criterion to judge that some quantity measures roughness penalty well: Increasing training sample size should should influence the value of optimal roughness penalty level in a monotonic way.
I compared performance of df or spar parametrization using a sample function (defined in gen.data function) for different sizes of training sample size. Here is the code for df parametrization:

gen.data <- function(n) {
      x <- runif(n, 2, 2)
      y <- x ^ 2 / 2 + sin(4 * x) + rnorm(n)
      return(data.frame(x, y))
df.levels <- seq(5, 15, length.out = 100)
n.train <- (3 ^ (0 : 5)) * (2 ^ (6 : 1))
cols <- 1
reps <- 100
valid <- gen.data(100000)
plot(NULL, xlab = “df”, ylab = “mse”,
     xlim = c(5, 18), ylim = c(1, 1.3))
for (n in n.train) {
      mse <- rep(0, length(df.levels))
      for (j in 1 : reps) {
            train <- gen.data(n)
            for (i in seq(along.with = df.levels)) {
                  ss <- smooth.spline(train, df = df.levels[i])
                  ss.y <- predict(ss, valid$x)$y
                  mse[i] <- mse[i] + mean((ss.y valid$y) ^ 2)
      mse <- mse / reps
      lines(df.levels, mse, col = cols, lwd = 2)
      points(df.levels[which.min(mse)], min(mse),
             col = cols, pch = 19)
      text(15, mse[length(mse)], paste(“n =”, n),
           col = cols, pos = 4)
      cols <- cols + 1

It produces the following result:

The plot shows the desired property. Similar plot can be obtained for spar parameter by simple modification of the code:

It is easy to notice that optimal values of spar do not change in a monotonic way as number of observation increases.

This comparison shows that  df is a better measure of regularization level in comparison to spar.

Additionally one can notice that curves for different sizes of training sample intersect for spar parametrization, which is unexpected. It might be only due to the randomness of data generation process, but I have run the simulation several times and the curves always intersected. Unfortunately I do not have the proof what should happen when valid data set size and reps parameter both tend to infinity.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)