Modeling muti-category Outcomes With vtreat

Posted on October 1, 2018 by John Mount in R bloggers | 0 Comments

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Let’s work a specific example: trying to model multi-class y as a function of x1 and x2.

library("vtreat")

# create example data
set.seed(326346)
sym_bonuses <- rnorm(3)
names(sym_bonuses) <- c("a", "b", "c")
sym_bonuses3 <- rnorm(3)
names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3)))
n_row <- 1000
d <- data.frame(
  x1 = rnorm(n_row),
  x2 = sample(names(sym_bonuses), n_row, replace = TRUE),
  x3 = sample(names(sym_bonuses3), n_row, replace = TRUE),
  y = "NoInfo",
  stringsAsFactors = FALSE)
d$y[sym_bonuses[d$x2] > 
      pmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1"
d$y[sym_bonuses3[d$x3] > 
      pmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2"

knitr::kable(head(d))

x1	x2	x3	y
0.8178292	b	3	Large2
0.5867139	b	3	Large2
-0.6711920	a	3	Large2
0.1033166	c	2	NoInfo
-0.3182176	b	1	NoInfo
-0.5914308	c	2	NoInfo

We define the problem controls and use mkCrossFrameMExperiment() to build both a cross-frame and a treatment plan.

# define problem
vars <- c("x1", "x2", "x3")
y_name <- "y"

# build the multi-class cross frame and treatments
cfe_m <- mkCrossFrameMExperiment(d, vars, y_name)

The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.

# look at the data we would train models on
str(cfe_m$cross_frame)

## 'data.frame':    1000 obs. of  16 variables:
##  $ x1_clean      : num  0.818 0.587 -0.671 0.103 -0.318 ...
##  $ x2_catP       : num  0.313 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ...
##  $ x3_catP       : num  0.333 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ...
##  $ x2_lev_x_a    : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ x2_lev_x_b    : num  1 1 0 0 1 0 0 0 1 0 ...
##  $ x2_lev_x_c    : num  0 0 0 1 0 1 1 0 0 0 ...
##  $ x3_lev_x_1    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 1 0 1 ...
##  $ x3_lev_x_3    : num  1 1 1 0 0 0 1 0 1 0 ...
##  $ Large1_x2_catB: num  -11.23 -11.2 1.25 -11.41 -11.27 ...
##  $ Large1_x3_catB: num  -11.356 -11.239 -11.239 0.379 0.431 ...
##  $ Large2_x2_catB: num  0.0862 0.1446 -0.0243 -0.1268 0.0862 ...
##  $ Large2_x3_catB: num  4.98 6.09 4.69 -3.11 -13.86 ...
##  $ NoInfo_x2_catB: num  -0.0537 0.1084 -0.2827 0.2859 0.1084 ...
##  $ NoInfo_x3_catB: num  -4.82 -5.24 -4.83 2.13 2.53 ...
##  $ y             : chr  "Large2" "Large2" "Large2" "NoInfo" ...

prepare() can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.

# pretend original data is new data to be treated
# NA out top row to show processing
for(vi in vars) {
  d[[vi]][[1]] <- NA
}
str(prepare(cfe_m$treat_m, d))

## 'data.frame':    1000 obs. of  16 variables:
##  $ x1_clean      : num  0.0205 0.5867 -0.6712 0.1033 -0.3182 ...
##  $ x2_catP       : num  0.0005 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ...
##  $ x3_catP       : num  0.0005 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ...
##  $ x2_lev_x_a    : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ x2_lev_x_b    : num  0 1 0 0 1 0 0 0 1 0 ...
##  $ x2_lev_x_c    : num  0 0 0 1 0 1 1 0 0 0 ...
##  $ x3_lev_x_1    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ x3_lev_x_2    : num  0 0 0 1 0 1 0 1 0 1 ...
##  $ x3_lev_x_3    : num  0 1 1 0 0 0 1 0 1 0 ...
##  $ Large1_x2_catB: num  0 -11.6 1.2 -11.8 -11.6 ...
##  $ Large1_x3_catB: num  0 -11.702 -11.702 0.411 0.436 ...
##  $ Large2_x2_catB: num  0 0.133 -0.0215 -0.0999 0.133 ...
##  $ Large2_x3_catB: num  0 5.1 5.1 -3.54 -14.29 ...
##  $ NoInfo_x2_catB: num  0 0.0206 -0.2829 0.2536 0.0206 ...
##  $ NoInfo_x3_catB: num  0 -4.95 -4.95 2.11 2.34 ...
##  $ y             : chr  "Large2" "Large2" "Large2" "NoInfo" ...

We can easily estimate per-outcome variable importance and per-variable variable importance.

knitr::kable(
  cfe_m$score_frame[, 
                    c("varName", "rsq", "sig", "outcome_level"), 
                    drop = FALSE])

varName	rsq	sig	outcome_level
x1_clean	0.0558908	0.0000382	Large1
x2_catP	0.0275238	0.0038536	Large1
x2_lev_x_a	0.2680953	0.0000000	Large1
x2_lev_x_b	0.0885021	0.0000002	Large1
x2_lev_x_c	0.1060407	0.0000000	Large1
x3_catP	0.0000346	0.9183445	Large1
x3_lev_x_1	0.0141504	0.0382554	Large1
x3_lev_x_2	0.0140364	0.0390420	Large1
x3_lev_x_3	0.0955004	0.0000001	Large1
x1_clean	0.0015382	0.1615618	Large2
x2_catP	0.0013055	0.1971725	Large2
x2_lev_x_a	0.0000387	0.8242956	Large2
x2_lev_x_b	0.0014571	0.1730603	Large2
x2_lev_x_c	0.0009604	0.2686774	Large2
x3_catP	0.0007725	0.3211959	Large2
x3_lev_x_1	0.2602002	0.0000000	Large2
x3_lev_x_2	0.2483708	0.0000000	Large2
x3_lev_x_3	0.9197595	0.0000000	Large2
x1_clean	0.0064771	0.0034947	NoInfo
x2_catP	0.0040540	0.0208595	NoInfo
x2_lev_x_a	0.0071709	0.0021196	NoInfo
x2_lev_x_b	0.0000340	0.8323647	NoInfo
x2_lev_x_c	0.0060493	0.0047665	NoInfo
x3_catP	0.0006576	0.3520950	NoInfo
x3_lev_x_1	0.1838759	0.0000000	NoInfo
x3_lev_x_2	0.1857824	0.0000000	NoInfo
x3_lev_x_3	0.7372570	0.0000000	NoInfo
Large1_x2_catB	0.2675964	0.0000000	Large1
Large1_x3_catB	0.0946910	0.0000001	Large1
Large2_x2_catB	0.0000291	0.8472707	Large2
Large2_x3_catB	0.9239860	0.0000000	Large2
NoInfo_x2_catB	0.0068238	0.0027207	NoInfo
NoInfo_x3_catB	0.7326682	0.0000000	NoInfo

One can relate these per-target and per-treatment performances back to original columns by aggregating.

tapply(cfe_m$score_frame$rsq, 
       cfe_m$score_frame$origName, 
       max)

##         x1         x2         x3 
## 0.05589076 0.26809534 0.92398602

tapply(cfe_m$score_frame$sig, 
       cfe_m$score_frame$origName, 
       min)

##            x1            x2            x3 
##  3.819834e-05  1.892838e-19 5.746904e-258

Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Modeling muti-category Outcomes With vtreat

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)