This afternoon I went to Max Kuhn’s tutorial on his caret package. caret stands for classification and regression (something beginning with e) trees. It provides a consistent interface to nearly 150 different models in R, in much the same way as the plyr package provides a consistent interface to the apply functions.
The basic usage of caret is to split your data into training and test sets.
my_data <- split(my_data, runif(nrow(my_data)) > p) #for some value of p names(my_data) <- c("training", "testing")
train on your training set.
training_model <- train( response ~ ., data = my_data$training, method = "a type of model!")
Then predict it using
predictions <- predict(training_model, my_data$testing)
So the basic usage is very simple. The devil is of course in the statistical details. You still have to choose (at least one) type of model, and there are many options for how those models should be fit.
Max suggested that a good strategy for modelling is to begin with a powerful black box method (boosting, random forests or support vector machines) since they can usually provide excellent fits. The next step is to use a simple, understandable model (some form of regression perhaps) and see how much predictive power you are losing.
I suspect that in order to get the full benefit of caret, I’ll need to read Max’s book: Applied Predictive Modeling.
Tagged: caret, modelling, r, statistics