Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In credit scoring, Information Value (IV) is frequently used to compare predictive power among variables. When developing new scorecards using logistic regression, variables are often binned and recoded using WoE concept. Package riv will help you to assess predicive power of variables, assess WoE patterns and recode raw variables to WoE.

Introduction

I assume that reader has some basic experience in credit scoring. One of our goals when binning variables is to maximize Information Value. Weight of Evidence (WoE) for single bin is defined as:

Information value for variable is defined as:

where n is number of variables.

To ilustrate the concept, here is an example for variable age from german credit scoring dataset:

 Class Good Bad %Good %Bad Odds WoE MIV (;25.5) 110 80 15,7% 26,7% 0,59 -0,53 0,06 <25.5;27.5) 74 27 10,6% 9,0% 1,17 0,16 0,00 <27.5;34.5) 172 85 24,6% 28,3% 0,87 -0,14 0,01 <34.5;38.5) 108 24 15,4% 8,0% 1,93 0,66 0,05 <38.5;) 236 84 33,7% 28,0% 1,20 0,19 0,01 IV: 0,13

Total Information Value of 0,13 indicate medium predictive power.

## riv Package

riv package will help you to analyze WoE patterns and Information Value for whole modeling dataset. Main features are:

• calculate Information Value for variable(s)
• recode original variables to WoE
• plot WoE patterns for variable(s)
• plot Information Value for variable(s)

One of the best features of riv package is automated binning of numeric variables. This uses rpart package and allows the user to pass specific rpart.control() values. For testing “German Credit Data” dataset is used. This dataset is also part of the package.

### Install package

riv is located on github and I prefer to use devtools for installation:

library(devtools)
install_github("riv","tomasgreif")
library(woe)

### Calculate Information Value

We can use function iv.mult() to calculate Information Value for all variables in data frame:

iv.mult(german_data,"gb",TRUE)

This will print the following table:

                    Variable InformationValue Bins ZeroBins    Strength
1                  ca_status      0.666011503    4        0 Very strong
2             credit_history      0.293233547    5        0      Strong
3                   duration      0.259146834    5        0      Strong
4              credit_amount      0.207970035    5        0      Strong
5                    savings      0.196009557    5        0     Average
6                    purpose      0.169195066   10        0     Average
7                        age      0.125210683    5        0     Average
8                   property      0.112638262    4        0     Average
9   present_employment_since      0.086433631    5        0        Weak
10                   housing      0.083293434    3        0        Weak
11         other_installment      0.057614542    3        0        Weak
12                status_sex      0.044670678    5        1        Weak
13            foreign_worker      0.043877412    2        0        Weak
14             other_debtors      0.032019322    3        0        Weak
15   installment_rate_income      0.023858552    2        0        Weak
16          existing_credits      0.010083557    2        0   Wery weak
17                       job      0.008762766    4        0   Wery weak
18                 telephone      0.006377605    2        0   Wery weak
19 liable_maintenance_people      0.000000000    1        0   Wery weak
20   present_residence_since      0.000000000    1        0   Wery weak

We can see there are five columns in output – variable name, information value, number of bins, number of bins where count of either good or bad is zero and overall assessment of predictive strength. Variables duration, credit_amount and age are numeric and riv fitted rpart model to find best possible binning.

Plot Results

There is a simple function iv.plot.summary() that we will use to plot results of iv.mult():

iv.plot.summary(iv.mult(german_data,"gb",TRUE))

This will result in:

Analyze individual variables

In scorecard development it is important for WoE to have logical trend among bins. With riv you can analyze WoE patterns for one ore more variables.  If you need only specific variables, you can use vars parameter:

options(digits=2)
iv.mult(german_data,"gb",vars=c("housing","duration"))

Will result in:

[[1]]
variable    class outcome_0 outcome_1 pct_1 pct_0 odds   woe   miv
1  housing     rent       109        70  0.23 0.156 0.67 -0.40 0.031
2  housing      own       527       186  0.62 0.753 1.21  0.19 0.026
[[2]]
variable       class outcome_0 outcome_1 pct_1 pct_0 odds    woe     miv
1 duration     (;11.5)       153        27  0.09  0.22 2.43  0.887 1.1e-01
2 duration <11.5;15.5)       189        62  0.21  0.27 1.31  0.267 1.7e-02
3 duration   <15.5;19)        72        43  0.14  0.10 0.72 -0.332 1.3e-02
4 duration   <19;34.5)       198        86  0.29  0.28 0.99 -0.013 5.1e-05
5 duration     <34.5;)        88        82  0.27  0.13 0.46 -0.777 1.1e-01

Columns description:

• variable - variable name
• class - name of bin (interval from rpart tree for numeric variables, variable value otherwise)
• outcome_0 - number of good observations
• outcome_1 - number of bad observations
• pct_1 - good observations in bin / total good observations
• odds - pct_1/pct_0
• woe - Weight of Evidence - calculated as ln(odds)
• miv - Marginal Information Value - calcualted as ln(odds) * (pct_0 - pct_1)

You can also plot WoE patterns iv.plot.woe() function:

iv.plot.woe(iv.mult(german_data,"gb",vars=c("housing","duration"),summary=FALSE))

Control rpart parameters

For numeric variables you can pass your own rpart.control(). I will ilustrate this for variable duration and complexity parameter cp:

iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.02))
iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.005))
iv.num(german_data,"duration","gb",rcontrol=rpart.control(cp=.001))

This is result for previous commands. Note how number of leafs is increasing with decreasing cp:

variable   class outcome_0 outcome_1 pct_1 pct_0 odds   woe   miv
1 duration (;34.5)       612       218  0.73  0.87 1.20  0.18 0.027
2 duration <34.5;)        88        82  0.27  0.13 0.46 -0.78 0.115
variable       class outcome_0 outcome_1 pct_1 pct_0 odds    woe     miv
1 duration     (;11.5)       153        27  0.09  0.22 2.43  0.887 0.11408
2 duration <11.5;34.5)       459       191  0.64  0.66 1.03  0.029 0.00056
3 duration     <34.5;)        88        82  0.27  0.13 0.46 -0.777 0.11465
variable       class outcome_0 outcome_1  pct_1 pct_0 odds    woe     miv
1  duration      (;8.5)        84        10 0.0333 0.120 3.60  1.281 1.1e-01
2  duration   <8.5;9.5)        35        14 0.0467 0.050 1.07  0.069 2.3e-04
3  duration  <9.5;11.5)        34         3 0.0100 0.049 4.86  1.580 6.1e-02
4  duration <11.5;12.5)       130        49 0.1633 0.186 1.14  0.128 2.9e-03
5  duration <12.5;15.5)        59        13 0.0433 0.084 1.95  0.665 2.7e-02
6  duration   <15.5;19)        72        43 0.1433 0.103 0.72 -0.332 1.3e-02
7  duration   <19;20.5)         7         1 0.0033 0.010 3.00  1.099 7.3e-03
8  duration <20.5;34.5)       191        85 0.2833 0.273 0.96 -0.038 3.9e-04
9  duration <34.5;37.5)        46        37 0.1233 0.066 0.53 -0.630 3.6e-02
10 duration <37.5;43.5)        12         5 0.0167 0.017 1.03  0.028 1.3e-05
11 duration     <43.5;)        30        40 0.1333 0.043 0.32 -1.135 1.0e-01

### Recoding Variables

Before running logistic regression model we would like to recode variables to WoE. For this task, we use function iv.replace.woe(). I will use smaller dataset to ilustrate this:

> german_data_small <- german_data[c language="("duration","ca_status","credit_amount","gb")"][/c]
> str(german_data_small)
'data.frame':	1000 obs. of  4 variables:
$duration : int 6 48 12 42 24 36 24 36 12 30 ...$ ca_status    : Factor w/ 4 levels "(;0DM)","<0DM;200DM)",..: 1 2 4 1 1 4 4 2 4 2 ...
$credit_amount: int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...$ gb           : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
> german_data_small_woe <- iv.replace.woe(german_data_small,iv=iv.mult(german_data_small,"gb"))
> str(german_data_small_woe)
'data.frame':	1000 obs. of  7 variables:
$duration : int 6 6 6 6 6 6 6 6 6 6 ...$ ca_status        : Factor w/ 4 levels "(;0DM)","<0DM;200DM)",..: 1 1 1 1 1 1 1 1 1 1 ...
$credit_amount : int 338 343 428 448 609 662 666 860 1169 1198 ...$ gb               : Factor w/ 2 levels "bad","good": 2 2 2 1 2 2 2 2 2 1 ...
$duration_woe : num 0.887 0.887 0.887 0.887 0.887 ...$ ca_status_woe    : num  -0.818 -0.818 -0.818 -0.818 -0.818 ...
\$ credit_amount_woe: num  -0.076 -0.076 -0.076 -0.076 -0.076 ...

You see that function iv.replace.woe() added three columns duration_woe, ca_status_woe and credit_amount_woe

### Using help

Because woe is standard R package, there is documentation for every function. This is complete list of available functions:

• iv.num  - calculate WoE/IV for numeric variables
• iv.str - calculate WoE/IV for character/factor variables
• iv.mult - calculate WoE/IV, summary IV for one or more variables
• iv.plot.summary - plot IV summary
• iv.plot.woe - plot WoE patterns for one or more variables
• iv.replace.woe - recode original variables to WoE (adds new columns)

### Final thoughts

I created this package mainly for learning purpose. It was fun learning how to use github, devtools and Rstudio to create a package. In another post there is short tutorial how to start your own package. You can also fork riv on github and improve this package on your own or commit changes to my repository. I appreciate any feedback or comments.