Recently I have run an exam where the following question had risen many problems for students (here I give its shortened formulation). You are given the data generating process *y* = 10*x* + *e*, where *e* is error term. Fit linear regression using lm, neural net using nnet with size equal to 2 and 10 and regression tree using rpart. What can be said about distribution of prediction error of such four modeling techniques?

Here is the code that generates the required comparison assuming that *x* ~ U(0, 1) and *e* ~ N(0, 1) and two example levels of training sample size 20 and 200.

library**(**rpart**)**

library**(**nnet**)**

run **<-** **function****(**n**)** **{**

x **<-** runif**(**n**)**

y **<-** 10 ***** x **+** rnorm**(**n**)**

new.x **<-** data.frame**(**x **=** seq**(**0, 1, len **=** 10000**))**

models **<-** list**(**linear **=** lm**(**y **~** x**)**,

tree **=** rpart**(**y **~** x**)**,

nnet2 **=** nnet**(**y **~** x, size **=** 2,

trace **=** F, linout **=** T**)**,

nnet10 **=** nnet**(**y **~** x, size **=** 10,

trace **=** F, linout **=** T**))**

sapply**(**models, **function****(**model**)** **{**

pred **<-** predict**(**model, newdata **=** new.x**)**

sum**((**pred **–** 10 ***** new.x**$**x**)** **^** 2**)**

**})**

**}**

set.seed**(**1**)**

**for** **(**n **in** c**(**20, 200**))** **{**

cat**(**“— n =”, n, “—\n”**)**

print**(**summary**(**t**(**replicate**(**100, run**(**n**)))))**

**}**

# — n = 20 —

# linear tree nnet2 nnet10

# Min. : 17.32 Min. :21046 Min. : 322.9 Min. : 566

# 1st Qu.: 247.25 1st Qu.:22562 1st Qu.: 1753.1 1st Qu.: 5759

# Median : 725.22 Median :24537 Median : 3419.2 Median : 10961

# Mean :1071.07 Mean :25644 Mean : 7221.4 Mean : 87200

# 3rd Qu.:1651.43 3rd Qu.:27559 3rd Qu.: 6877.1 3rd Qu.: 22494

# Max. :6614.57 Max. :40742 Max. :84169.8 Max. :4309641

# — n = 200 —

# linear tree nnet2 nnet10

# Min. : 1.107 Min. :1976 Min. : 32.62 Min. : 119.7

# 1st Qu.: 25.939 1st Qu.:2851 1st Qu.: 183.82 1st Qu.: 313.4

# Median : 76.533 Median :3366 Median : 293.65 Median : 531.5

# Mean :112.766 Mean :3490 Mean : 2008.36 Mean : 2211.1

# 3rd Qu.:160.217 3rd Qu.:3921 3rd Qu.: 479.10 3rd Qu.: 742.3

# Max. :568.374 Max. :6502 Max. :83603.10 Max. :83444.6

It is simple that linear regression is optimal as it is properly specified. Next in general neural net with size 2, neural net with size 10 and regression tree follow. The reason is that neural nets use S-shaped transformations and have effectively more parameters than are needed to fit the relationship. Finally regression tree is simply not well suited for modeling linear relationships between variables.

However, neural nets are initialized using random parameters and sometimes BFGS optimization fails and very poor fits can occur. This can be seen by large values of Max. in nnet2 and nnet10. The median of the results is largely unaffected by this but evaluation of mean expected error is very unstable due to the outliers (in order to get more reliable estimates more than 100 replications are needed).

Of course by modifying rpart or nnet one can get a bit different results but the general conclusions will be similar.

*Related*

To

**leave a comment** for the author, please follow the link and comment on his blog:

** R snippets**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as: visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...