**R snippets**, and kindly contributed to R-bloggers)

In smooth.spline procedure one can use df or spar parameter to control smoothing level. Usually they are not set manually but recently I was asked a question which one of them is a better measure of regularization level. Hastie et al. (2009) discuss the advantages of df but I thought of a simple graphical illustraition of this fact.

I the following criterion to judge that some quantity measures roughness penalty well: *Increasing training sample size should should influence the value of optimal roughness penalty level in a monotonic way.*

I compared performance of df or spar parametrization using a sample function (defined in gen.data function) for different sizes of training sample size. Here is the code for df parametrization:

**(**1

**)**

**<-**

**function**

**(**n

**)**

**{**

**<-**runif

**(**n,

**–**2, 2

**)**

**<-**x

**^**2

**/**2

**+**sin

**(**4

*****x

**)**

**+**rnorm

**(**n

**)**

**(**data.frame

**(**x, y

**))**

**}**

**<-**seq

**(**5, 15, length.out

**=**100

**)**

**<-**

**(**3

**^**

**(**0

**:**5

**))**

*****

**(**2

**^**

**(**6

**:**1

**))**

**<-**1

**<-**100

**<-**gen.data

**(**100000

**)**

**(**

**NULL**, xlab

**=**“df”, ylab

**=**“mse”,

**=**c

**(**5, 18

**)**, ylim

**=**c

**(**1, 1.3

**))**

**for**

**(**n

**in**n.train

**)**

**{**

**<-**rep

**(**0, length

**(**df.levels

**))**

**for**

**(**j

**in**1

**:**reps

**)**

**{**

**<-**gen.data

**(**n

**)**

**for**

**(**i

**in**seq

**(**along.with

**=**df.levels

**))**

**{**

**<-**smooth.spline

**(**train, df

**=**df.levels

**[**i

**])**

**<-**predict

**(**ss, valid

**$**x

**)$**y

**[**i

**]**

**<-**mse

**[**i

**]**

**+**mean

**((**ss.y

**–**valid

**$**y

**)**

**^**2

**)**

**}**

**}**

**<-**mse

**/**reps

**(**df.levels, mse, col

**=**cols, lwd

**=**2

**)**

**(**df.levels

**[**which.min

**(**mse

**)]**, min

**(**mse

**)**,

**=**cols, pch

**=**19

**)**

**(**15, mse

**[**length

**(**mse

**)]**, paste

**(**“n =”, n

**)**,

**=**cols, pos

**=**4

**)**

**<-**cols

**+**1

**}**

It produces the following result:

The plot shows the desired property. Similar plot can be obtained for spar parameter by simple modification of the code:

It is easy to notice that optimal values of spar do not change in a monotonic way as number of observation increases.

This comparison shows that df is a better measure of regularization level in comparison to spar.

Additionally one can notice that curves for different sizes of training sample *intersect* for spar parametrization, which is unexpected. It might be only due to the randomness of data generation process, but I have run the simulation several times and the curves always intersected. Unfortunately I do not have the proof what should happen when valid data set size and reps parameter both tend to infinity.

**leave a comment**for the author, please follow the link and comment on their blog:

**R snippets**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...