In smooth.spline procedure one can use df or spar parameter to control smoothing level. Usually they are not set manually but recently I was asked a question which one of them is a better measure of regularizatio...

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is i...

The standard textbook analysis of different model selection methods, like cross-validation or validation sample, focus on their ability to estimate in-sample, conditional or expected test error. However, the other interesting question is to compare the...

This week I was running computations transforming some input files into output files. The problem was that it was a repeated process. If new input files were generated or old ones were updated I needed to calculate new output files. The transformation ...

In my last post I have plotted randu dataset to show that all its points lie on 15 parallel planes. But I was not fully satified with the solution and decided to show this numerically.It can be done in four steps:identifying four points lying...

Recently I have stumbled on help description of randu data from datasets package. It contains pseudorandom numbers that are flawed. Help says that "In three dimensional displays it is evident that the triples fall on 15 paralle...

A very typical task in data analysis is calculation of summary statistics for each variable in data frame. Standard lapply or sapply functions work very nice for this but operate only on single function. The problem is that I o...

Recently on R-bloggers I found a post from chem-bla-ics blog concerning conversion of factors to integer vectors. At the end it stated a problem of conversion of factor variable to class-membership matrix. In comments several nice solutions were p...

Gain chart is a popular method to visually inspect model performance in binary prediction. It presents the percentage of captured positive responses as a function of selected percentage of a sample. It is easy to obtain it using ROCR package plott...