WoE and IV Variable Screening with {Information} in R

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A short note on information-theoretic variable screening in R w. {information}.

Variable screening comes as an important step in the
contemporary EDA for predictive modeling: what can we tell about the
nature of the relationships between a set of predictors and the
dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we
start rolling them into a predictive model? In this blog post I will
discuss two information-theoretic measures that are common in variable
screening for binary classification and regression models in the credit
risk arena (the fact being completely unrelated to the simple truth that
they could do you some good in any other application of predictive
modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value
(IV) of a variable in respect to a binary outcome. Then I will
illustrate their computation (it’s fairly easy, in fact) from the
{Information} package in R.

Weight of Evidence

Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):

and assume your models M1, M2 of the world*
are simply two discrete possible states of a binary variable Y, while
the data are given by discrete distributions of some predictor X (or, X
stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n,  take the log:

and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence
(WoE). In theory, any monotonic transformation of the odds would do,
but the logarithm brings an intuitive advantage of obtaining a negative
WoE when the odds are less than one and a positive one when they are
higher than one. To simplify the setting where the analysis under
consideration encompasses a binary dependent Y and a discrete (or binned
continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:

where f denotes counts.

Let’s illustrate the computation of WoE in this setting for a variable from a well-known dataset**. We have one categorical, binary dependent:

dataSet <- read.table(‘bank-additional-full.csv’,
                    header = T,
                    strip.white = F,
                    sep = “;”)


dataSet$y <- recode(dataSet$y,
                  ‘yes’ = 1,
                   ‘no’ = 0)

and we want to compute the WoE for, say, the age variable. Here it goes:

# – compute WOE for: dataSet$age
bins <- 10

q <- quantile(dataSet$age,
            probs = c(1:(bins – 1)/bins),
            na.rm = TRUE,
            type = 3)

cuts <- unique(q)

aggAge <- table(findInterval(dataSet$age,
                           vec = cuts,
                           rightmost.closed = FALSE),

aggAge <- as.data.frame.matrix(aggAge)
aggAge$N <- rowSums(aggAge)
aggAge$WOE <- log((aggAge$`1`*sum(aggAge$`0`))/(aggAge$`0`*sum(aggAge$`1`)))

In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table()
call provides for the conditional distributions like the ones shown in
the table above. The computation of WoE is then straightforward – as
exemplified in the last line. However, you want to spare yourself from
computing the WoE in this way for many variables in the dataset, and
that’s where {Information} in R comes handy; for the same dataset:

# – Information value: all variables

infoTables <- create_infotables(data = dataSet,

                               y = “y”,
                              bins = 10,
                              parallel = T)

# – WOE table:

with the respective data frames in infoTables$Tables standing for the variables in the dataset.

Information Value

A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:

In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xj|Y=1) and P(xj|Y=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.

Using {Information} in R, for the dataset under consideration:

# – Information value: all variables

infoTables <- create_infotables(data = dataSet,

                               y = “y”,
                              bins = 10,
                              parallel = T)

# – Plot IV

plotFrame <- infoTables$Summary[order(-infoTables$Summary$IV), ]
plotFrame$Variable <- factor(plotFrame$Variable,

                            levels = plotFrame$Variable[order(-plotFrame$IV)])

ggplot(plotFrame, aes(x = Variable, y = IV)) +
geom_bar(width = .35, stat = “identity”, color = “darkblue”, fill = “white”) +
ggtitle(“Information Value”) +
theme_bw() +
theme(plot.title = element_text(size = 10)) +
theme(axis.text.x = element_text(angle = 90))

You may have noted the usage of parallel = T in the create_infotables()
call; the {Information} package will try to use all available cores to
speed up the computations by default. Besides the basic package
functionality that I have illustrated, the package provides a natural
way of dealing with uplift models, where the computation of the IVs for
the control vs. treatment designs is nicely automated. Cross-validation
procedures are also built-in.

However, now that we know that we have a nice,
working package for WoE and IV estimation in R, let’s restrain ourselves
from using it to perform automatic feature
selection for models like binary logistic regression. While the
information-theoretic measures discussed here truly assess the
predictive power of a predictor in binary classification, building a
model that encompasses multiple terms model is another story. Do not get
disappointed if you start figuring out how the AICs for the full models
are still lower then those for the nested models obtained by feature
selection based on the IV values; although they can provide useful
guidelines, WoE and IV are not even meant to be used that way (I’ve
tried… once with the dataset used in the previous examples, and then
with the two {Information} built-in datasets; not too much of a success,
as you may have guessed).


* For parametric models, you would need to integrate over the full
parameter space, of course; taking the MLEs would result in obtaining
the standard LR test.

** The dataset is considered in S. Moro, P. Cortez and P. Rita
(2014). A Data-Driven Approach to Predict the Success of Bank
Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. I
have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
(N.B. https://archive.ics.uci.edu/ml/machine-learning-databases/00222/,
file: bank-additional.zip); a nice description of the dataset is found

Goran S. Milovanović, Phd

Data Science Consultant, SmartCat

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)