A Prototype of Monotonic Binning Algorithm with R

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been asked many time if I have a piece of R code implementing the monotonic binning algorithm, similar to the one that I developed with SAS (http://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) and with Python (http://statcompute.wordpress.com/2012/12/08/monotonic-binning-with-python). Today, I finally had time to draft a quick prototype with 20 lines of R code, which is however barely useable without the further polish. But it is still a little surprising to me how efficient it can be to use R in algorithm prototyping, much sleeker than SAS macro.

library(sas7bdat)
library(Hmisc)

bin <- function(x, y){
  n <- min(50, length(unique(x)))
  repeat {
    n   <- n - 1
    d1  <- data.frame(x, y, bin = cut2(x, g = n)) 
    d2  <- aggregate(d1[-3], d1[3], mean)
    cor <- cor(d2[-1], method = "spearman")
    if(abs(cor[1, 2]) == 1) break
  }
  d2[2] <- NULL
  colnames(d2) <- c('LEVEL', 'RATE')
  head <- paste(toupper(substitute(y)), " RATE by ", toupper(substitute(x)), sep = '')
  cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
  cat("| ", head, ' |\n', sep = '')
  cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
  print(d2)
  cat("\n")
}

data <- read.sas7bdat("C:\\Users\\liuwensui\\Downloads\\accepts.sas7bdat")
attach(data)

bin(bureau_score, bad)
bin(age_oldest_tr, bad)
bin(tot_income, bad)
bin(tot_tr, bad)

R output:

+--------------------------+
| BAD RATE by BUREAU_SCORE |
+--------------------------+
       LEVEL       RATE
1  [443,618) 0.44639376
2  [618,643) 0.38446602
3  [643,658) 0.31835938
4  [658,673) 0.23819302
5  [673,686) 0.19838057
6  [686,702) 0.17850288
7  [702,715) 0.14168378
8  [715,731) 0.09815951
9  [731,752) 0.07212476
10 [752,776) 0.05487805
11 [776,848] 0.02605210

+---------------------------+
| BAD RATE by AGE_OLDEST_TR |
+---------------------------+
       LEVEL       RATE
1  [  1, 34) 0.33333333
2  [ 34, 62) 0.30560928
3  [ 62, 87) 0.25145068
4  [ 87,113) 0.23346304
5  [113,130) 0.21616162
6  [130,149) 0.20036101
7  [149,168) 0.19361702
8  [168,198) 0.15530303
9  [198,245) 0.11111111
10 [245,308) 0.10700389
11 [308,588] 0.08730159

+------------------------+
| BAD RATE by TOT_INCOME |
+------------------------+
           LEVEL      RATE
1 [   0,   2570) 0.2498715
2 [2570,   4510) 0.2034068
3 [4510,8147167] 0.1602327

+--------------------+
| BAD RATE by TOT_TR |
+--------------------+
    LEVEL      RATE
1 [ 0,12) 0.2672370
2 [12,22) 0.1827676
3 [22,77] 0.1422764

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)