Model Segmentation with Cubist

March 18, 2015
By

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data.

pkgs <- c('MASS', 'Cubist', 'caret')
lapply(pkgs, require, character.only = T)

data(Boston)
X <- Boston[, 1:13]
Y <- log(Boston[, 14])

### TRAIN THE MODEL ###
mdl <- cubist(x = X, y = Y, control = cubistControl(unbiased = TRUE,  label = "log_medv", seed = 2015, rules = 5))
summary(mdl)
#  Rule 1: [94 cases, mean 2.568824, range 1.609438 to 3.314186, est err 0.180985]
#
#    if
#	nox > 0.671
#    then
#	log_medv = 1.107315 + 0.588 dis + 2.92 nox - 0.0287 lstat - 0.2 rm
#	           - 0.0065 crim
#
#  Rule 2: [39 cases, mean 2.701933, range 1.94591 to 3.314186, est err 0.202473]
#
#    if
#	nox <= 0.671
#	lstat > 19.01
#    then
#	log_medv = 3.935974 - 1.68 nox - 0.0076 lstat + 0.0035 rad - 0.00017 tax
#	           - 0.013 dis - 0.0029 crim + 0.034 rm - 0.011 ptratio
#	           + 0.00015 black + 0.0003 zn
#
#  Rule 3: [200 cases, mean 2.951007, range 2.116256 to 3.589059, est err 0.100825]
#
#    if
#	rm <= 6.232
#	dis > 1.8773
#    then
#	log_medv = 2.791381 + 0.152 rm - 0.0147 lstat + 0.00085 black
#	           - 0.031 dis - 0.027 ptratio - 0.0017 age + 0.0031 rad
#	           - 0.00013 tax - 0.0025 crim - 0.12 nox + 0.0002 zn
#
#  Rule 4: [37 cases, mean 3.038195, range 2.341806 to 3.912023, est err 0.184200]
#
#    if
#	dis <= 1.8773
#	lstat <= 19.01
#    then
#	log_medv = 5.668421 - 1.187 dis - 0.0469 lstat - 0.0122 crim
#
#  Rule 5: [220 cases, mean 3.292121, range 2.261763 to 3.912023, est err 0.093716]
#
#    if
#	rm > 6.232
#	lstat <= 19.01
#    then
#	log_medv = 2.419507 - 0.033 lstat + 0.238 rm - 0.0089 crim + 0.0082 rad
#	           - 0.029 dis - 0.00035 tax + 0.0006 black - 0.024 ptratio
#	           - 0.0006 age - 0.12 nox + 0.0002 zn
#
# Evaluation on training data (506 cases):
#
#    Average  |error|           0.100444
#    Relative |error|               0.33
#    Correlation coefficient        0.94
#
#	Attribute usage:
#	  Conds  Model
#
#	   71%    94%    rm
#	   50%   100%    lstat
#	   40%   100%    dis
#	   23%    94%    nox
#	         100%    crim
#	          78%    zn
#	          78%    rad
#	          78%    tax
#	          78%    ptratio
#	          78%    black
#	          71%    age

### VARIABLE IMPORTANCE ###
varImp(mdl)
#        Overall
# rm         82.5
# lstat      75.0
# dis        70.0
# nox        58.5
# crim       50.0
# zn         39.0
# rad        39.0
# tax        39.0
# ptratio    39.0
# black      39.0
# age        35.5
# indus       0.0
# chas        0.0

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)