Model Segmentation with Cubist
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data.
pkgs <- c('MASS', 'Cubist', 'caret')
lapply(pkgs, require, character.only = T)
data(Boston)
X <- Boston[, 1:13]
Y <- log(Boston[, 14])
### TRAIN THE MODEL ###
mdl <- cubist(x = X, y = Y, control = cubistControl(unbiased = TRUE, label = "log_medv", seed = 2015, rules = 5))
summary(mdl)
# Rule 1: [94 cases, mean 2.568824, range 1.609438 to 3.314186, est err 0.180985]
#
# if
# nox > 0.671
# then
# log_medv = 1.107315 + 0.588 dis + 2.92 nox - 0.0287 lstat - 0.2 rm
# - 0.0065 crim
#
# Rule 2: [39 cases, mean 2.701933, range 1.94591 to 3.314186, est err 0.202473]
#
# if
# nox <= 0.671
# lstat > 19.01
# then
# log_medv = 3.935974 - 1.68 nox - 0.0076 lstat + 0.0035 rad - 0.00017 tax
# - 0.013 dis - 0.0029 crim + 0.034 rm - 0.011 ptratio
# + 0.00015 black + 0.0003 zn
#
# Rule 3: [200 cases, mean 2.951007, range 2.116256 to 3.589059, est err 0.100825]
#
# if
# rm <= 6.232
# dis > 1.8773
# then
# log_medv = 2.791381 + 0.152 rm - 0.0147 lstat + 0.00085 black
# - 0.031 dis - 0.027 ptratio - 0.0017 age + 0.0031 rad
# - 0.00013 tax - 0.0025 crim - 0.12 nox + 0.0002 zn
#
# Rule 4: [37 cases, mean 3.038195, range 2.341806 to 3.912023, est err 0.184200]
#
# if
# dis <= 1.8773
# lstat <= 19.01
# then
# log_medv = 5.668421 - 1.187 dis - 0.0469 lstat - 0.0122 crim
#
# Rule 5: [220 cases, mean 3.292121, range 2.261763 to 3.912023, est err 0.093716]
#
# if
# rm > 6.232
# lstat <= 19.01
# then
# log_medv = 2.419507 - 0.033 lstat + 0.238 rm - 0.0089 crim + 0.0082 rad
# - 0.029 dis - 0.00035 tax + 0.0006 black - 0.024 ptratio
# - 0.0006 age - 0.12 nox + 0.0002 zn
#
# Evaluation on training data (506 cases):
#
# Average |error| 0.100444
# Relative |error| 0.33
# Correlation coefficient 0.94
#
# Attribute usage:
# Conds Model
#
# 71% 94% rm
# 50% 100% lstat
# 40% 100% dis
# 23% 94% nox
# 100% crim
# 78% zn
# 78% rad
# 78% tax
# 78% ptratio
# 78% black
# 71% age
### VARIABLE IMPORTANCE ###
varImp(mdl)
# Overall
# rm 82.5
# lstat 75.0
# dis 70.0
# nox 58.5
# crim 50.0
# zn 39.0
# rad 39.0
# tax 39.0
# ptratio 39.0
# black 39.0
# age 35.5
# indus 0.0
# chas 0.0
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.