Modeling muti-category Outcomes With vtreat

October 1, 2018
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Let’s work a specific example: trying to model multi-class y as a function of x1 and x2.

library("vtreat")
# create example data
set.seed(326346)
sym_bonuses <- rnorm(3)
names(sym_bonuses) <- c("a", "b", "c")
sym_bonuses3 <- rnorm(3)
names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3)))
n_row <- 1000
d <- data.frame(
  x1 = rnorm(n_row),
  x2 = sample(names(sym_bonuses), n_row, replace = TRUE),
  x3 = sample(names(sym_bonuses3), n_row, replace = TRUE),
  y = "NoInfo",
  stringsAsFactors = FALSE)
d$y[sym_bonuses[d$x2] > 
      pmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1"
d$y[sym_bonuses3[d$x3] > 
      pmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2"

knitr::kable(head(d))
x1 x2 x3 y
0.8178292 b 3 Large2
0.5867139 b 3 Large2
-0.6711920 a 3 Large2
0.1033166 c 2 NoInfo
-0.3182176 b 1 NoInfo
-0.5914308 c 2 NoInfo

We define the problem controls and use mkCrossFrameMExperiment() to build both a cross-frame and a treatment plan.

# define problem
vars <- c("x1", "x2", "x3")
y_name <- "y"

# build the multi-class cross frame and treatments
cfe_m <- mkCrossFrameMExperiment(d, vars, y_name)

The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.

# look at the data we would train models on
str(cfe_m$cross_frame)
## 'data.frame':    1000 obs. of  16 variables:
##  $ x1_clean      : num  0.818 0.587 -0.671 0.103 -0.318 ...
##  $ x2_catP       : num  0.313 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ...
##  $ x3_catP       : num  0.333 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ...
##  $ x2_lev_x_a    : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ x2_lev_x_b    : num  1 1 0 0 1 0 0 0 1 0 ...
##  $ x2_lev_x_c    : num  0 0 0 1 0 1 1 0 0 0 ...
##  $ x3_lev_x_1    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 1 0 1 ...
##  $ x3_lev_x_3    : num  1 1 1 0 0 0 1 0 1 0 ...
##  $ Large1_x2_catB: num  -11.23 -11.2 1.25 -11.41 -11.27 ...
##  $ Large1_x3_catB: num  -11.356 -11.239 -11.239 0.379 0.431 ...
##  $ Large2_x2_catB: num  0.0862 0.1446 -0.0243 -0.1268 0.0862 ...
##  $ Large2_x3_catB: num  4.98 6.09 4.69 -3.11 -13.86 ...
##  $ NoInfo_x2_catB: num  -0.0537 0.1084 -0.2827 0.2859 0.1084 ...
##  $ NoInfo_x3_catB: num  -4.82 -5.24 -4.83 2.13 2.53 ...
##  $ y             : chr  "Large2" "Large2" "Large2" "NoInfo" ...

prepare() can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.

# pretend original data is new data to be treated
# NA out top row to show processing
for(vi in vars) {
  d[[vi]][[1]] <- NA
}
str(prepare(cfe_m$treat_m, d))
## 'data.frame':    1000 obs. of  16 variables:
##  $ x1_clean      : num  0.0205 0.5867 -0.6712 0.1033 -0.3182 ...
##  $ x2_catP       : num  0.0005 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ...
##  $ x3_catP       : num  0.0005 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ...
##  $ x2_lev_x_a    : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ x2_lev_x_b    : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ x2_lev_x_c    : num  0 0 0 1 0 1 1 0 0 0 ...
##  $ x3_lev_x_1    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 1 0 1 ...
##  $ x3_lev_x_3    : num  0 1 1 0 0 0 1 0 1 0 ...
##  $ Large1_x2_catB: num  0 -11.6 1.2 -11.8 -11.6 ...
##  $ Large1_x3_catB: num  0 -11.702 -11.702 0.411 0.436 ...
##  $ Large2_x2_catB: num  0 0.133 -0.0215 -0.0999 0.133 ...
##  $ Large2_x3_catB: num  0 5.1 5.1 -3.54 -14.29 ...
##  $ NoInfo_x2_catB: num  0 0.0206 -0.2829 0.2536 0.0206 ...
##  $ NoInfo_x3_catB: num  0 -4.95 -4.95 2.11 2.34 ...
##  $ y             : chr  "Large2" "Large2" "Large2" "NoInfo" ...

We can easily estimate per-outcome variable importance and per-variable variable importance.

knitr::kable(
  cfe_m$score_frame[, 
                    c("varName", "rsq", "sig", "outcome_level"), 
                    drop = FALSE])
varName rsq sig outcome_level
x1_clean 0.0558908 0.0000382 Large1
x2_catP 0.0275238 0.0038536 Large1
x2_lev_x_a 0.2680953 0.0000000 Large1
x2_lev_x_b 0.0885021 0.0000002 Large1
x2_lev_x_c 0.1060407 0.0000000 Large1
x3_catP 0.0000346 0.9183445 Large1
x3_lev_x_1 0.0141504 0.0382554 Large1
x3_lev_x_2 0.0140364 0.0390420 Large1
x3_lev_x_3 0.0955004 0.0000001 Large1
x1_clean 0.0015382 0.1615618 Large2
x2_catP 0.0013055 0.1971725 Large2
x2_lev_x_a 0.0000387 0.8242956 Large2
x2_lev_x_b 0.0014571 0.1730603 Large2
x2_lev_x_c 0.0009604 0.2686774 Large2
x3_catP 0.0007725 0.3211959 Large2
x3_lev_x_1 0.2602002 0.0000000 Large2
x3_lev_x_2 0.2483708 0.0000000 Large2
x3_lev_x_3 0.9197595 0.0000000 Large2
x1_clean 0.0064771 0.0034947 NoInfo
x2_catP 0.0040540 0.0208595 NoInfo
x2_lev_x_a 0.0071709 0.0021196 NoInfo
x2_lev_x_b 0.0000340 0.8323647 NoInfo
x2_lev_x_c 0.0060493 0.0047665 NoInfo
x3_catP 0.0006576 0.3520950 NoInfo
x3_lev_x_1 0.1838759 0.0000000 NoInfo
x3_lev_x_2 0.1857824 0.0000000 NoInfo
x3_lev_x_3 0.7372570 0.0000000 NoInfo
Large1_x2_catB 0.2675964 0.0000000 Large1
Large1_x3_catB 0.0946910 0.0000001 Large1
Large2_x2_catB 0.0000291 0.8472707 Large2
Large2_x3_catB 0.9239860 0.0000000 Large2
NoInfo_x2_catB 0.0068238 0.0027207 NoInfo
NoInfo_x3_catB 0.7326682 0.0000000 NoInfo

One can relate these per-target and per-treatment performances back to original columns by aggregating.

tapply(cfe_m$score_frame$rsq, 
       cfe_m$score_frame$origName, 
       max)
##         x1         x2         x3 
## 0.05589076 0.26809534 0.92398602
tapply(cfe_m$score_frame$sig, 
       cfe_m$score_frame$origName, 
       min)
##            x1            x2            x3 
##  3.819834e-05  1.892838e-19 5.746904e-258

Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)