**The Exactness of Mind**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A short note on information-theoretic variable screening in R w. {information}.

Variable screening comes as an important step in the

contemporary EDA for predictive modeling: what can we tell about the

nature of the relationships between a set of predictors and the

dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we

start rolling them into a predictive model? In this blog post I will

discuss two information-theoretic measures that are common in variable

screening for binary classification and regression models in the credit

risk arena (the fact being completely unrelated to the simple truth that

they could do you some good in any other application of predictive

modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value

(IV) of a variable in respect to a binary outcome. Then I will

illustrate their computation (it’s fairly easy, in fact) from the

{Information} package in R.

**Weight of Evidence**

Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):

and assume your models M1, M2 of the world*

are simply two discrete possible states of a binary variable Y, while

the data are given by discrete distributions of some predictor X (or, X

stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n, take the log:

and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence

(WoE). In theory, any monotonic transformation of the odds would do,

but the logarithm brings an intuitive advantage of obtaining a negative

WoE when the odds are less than one and a positive one when they are

higher than one. To simplify the setting where the analysis under

consideration encompasses a binary dependent Y and a discrete (or binned

continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:

where f denotes counts.

Let’s illustrate the computation of WoE in this setting for a variable from a well-known dataset**. We have one categorical, binary dependent:

dataSet <- read.table(‘bank-additional-full.csv’,

header = T,

strip.white = F,

sep = “;”)str(dataSet)

table(dataSet$y)dataSet$y <- recode(dataSet$y,

‘yes’ = 1,

‘no’ = 0)

and we want to compute the WoE for, say, the age variable. Here it goes:

# – compute WOE for: dataSet$age

bins <- 10q <- quantile(dataSet$age,

probs = c(1:(bins – 1)/bins),

na.rm = TRUE,

type = 3)cuts <- unique(q)

aggAge <- table(findInterval(dataSet$age,

vec = cuts,

rightmost.closed = FALSE),

dataSet$y)aggAge <- as.data.frame.matrix(aggAge)

aggAge$N <- rowSums(aggAge)

aggAge$WOE <- log((aggAge$`1`*sum(aggAge$`0`))/(aggAge$`0`*sum(aggAge$`1`)))

In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table()

call provides for the conditional distributions like the ones shown in

the table above. The computation of WoE is then straightforward – as

exemplified in the last line. However, you want to spare yourself from

computing the WoE in this way for many variables in the dataset, and

that’s where {Information} in R comes handy; for the same dataset:

# – Information value: all variables

infoTables <- create_infotables(data = dataSet,

y = “y”,

bins = 10,

parallel = T)# – WOE table:

infoTables$Tables$age$WOE

with the respective data frames in infoTables$Tables standing for the variables in the dataset.

**Information Value**

A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:

In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xj|Y=1) and P(xj|Y=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.

Using {Information} in R, for the dataset under consideration:

# – Information value: all variables

infoTables <- create_infotables(data = dataSet,

y = “y”,

bins = 10,

parallel = T)# – Plot IV

plotFrame <- infoTables$Summary[order(-infoTables$Summary$IV), ]

plotFrame$Variable <- factor(plotFrame$Variable,levels = plotFrame$Variable[order(-plotFrame$IV)])

ggplot(plotFrame, aes(x = Variable, y = IV)) +

geom_bar(width = .35, stat = “identity”, color = “darkblue”, fill = “white”) +

ggtitle(“Information Value”) +

theme_bw() +

theme(plot.title = element_text(size = 10)) +

theme(axis.text.x = element_text(angle = 90))

You may have noted the usage of parallel = T in the create_infotables()

call; the {Information} package will try to use all available cores to

speed up the computations by default. Besides the basic package

functionality that I have illustrated, the package provides a natural

way of dealing with uplift models, where the computation of the IVs for

the control vs. treatment designs is nicely automated. Cross-validation

procedures are also built-in.

However, now that we know that we have a nice,

working package for WoE and IV estimation in R, let’s restrain ourselves

from using it to perform automatic feature

selection for models like binary logistic regression. While the

information-theoretic measures discussed here truly assess the

predictive power of a predictor in binary classification, building a

model that encompasses multiple terms model is another story. Do not get

disappointed if you start figuring out how the AICs for the full models

are still lower then those for the nested models obtained by feature

selection based on the IV values; although they can provide useful

guidelines, WoE and IV are not even meant to be used that way (I’ve

tried… once with the dataset used in the previous examples, and then

with the two {Information} built-in datasets; not too much of a success,

as you may have guessed).

**References**

* For parametric models, you would need to integrate over the full

parameter space, of course; taking the MLEs would result in obtaining

the standard LR test.

** The dataset is considered in S. Moro, P. Cortez and P. Rita

(2014). A Data-Driven Approach to Predict the Success of Bank

Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. I

have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

(N.B. https://archive.ics.uci.edu/ml/machine-learning-databases/00222/,

file: bank-additional.zip); a nice description of the dataset is found

at:

http://www2.1010data.com/documentationcenter/beta/Tutorials/MachineLearningExamples/BankMarketingDataSet.html)

Goran S. Milovanović, Phd

Data Science Consultant, SmartCat

**leave a comment**for the author, please follow the link and comment on their blog:

**The Exactness of Mind**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.