**S+/R – Yet Another Blog in Statistical Computing**, and kindly contributed to R-bloggers)

In previous posts (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package) and (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), I’ve developed 2 different algorithms for monotonic binning. While the first tends to generate bins with equal densities, the second would define finer bins based on the isotonic regression.

In the code snippet below, a third approach would be illustrated for the purpose to generate bins with roughly equal-sized bads. Once again, for the reporting layer, I leveraged the flexible smbinning::smbinning.custom() function with a small tweak.

df <- sas7bdat::read.sas7bdat("Downloads/accepts.sas7bdat") monobin <- function(df, x, y) { yname <- deparse(substitute(y)) xname <- deparse(substitute(x)) d1 <- df[c(yname, xname)] d2 <- d1[which(d1[[yname]] == 1), ] nbin <- round(1 / max(table(d2[[xname]]) / sum(table(d2[[xname]])))) repeat { cuts <- Hmisc::cut2(d2[[xname]], g = nbin, onlycuts = T) d1$cut <- cut(d1[[xname]], breaks = cuts, include.lowest = T) d3 <- Reduce(rbind, Map(function(x) data.frame(xmean = mean(x[[xname]], na.rm = T), ymean = mean(x[[yname]])), split(d1, d1$cut))) if(abs(cor(d3$xmean, d3$ymean, method = "spearman")) == 1 | nrow(d3) == 2) { break } nbin <- nbin - 1 } df$good <- 1 - d1[[yname]] return(smbinning::smbinning.custom(df, "good", xname, cuts = cuts[c(-1, -length(cuts))])) }

As shown in the output, the number of bads in each bin, with the exception for missings, is similar and varying within a small range. However, the number of records tends to increase to ensure the monotonicity of bad rates across all bins.

monobin(df, bureau_score, bad) # Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV #1 <= 602 268 136 132 268 136 132 0.0459 0.5075 0.4925 1.0303 0.0299 -1.3261 0.1075 #2 <= 621 311 185 126 579 321 258 0.0533 0.5949 0.4051 1.4683 0.3841 -0.9719 0.0636 #3 <= 636 302 186 116 881 507 374 0.0517 0.6159 0.3841 1.6034 0.4722 -0.8838 0.0503 #4 <= 649 392 259 133 1273 766 507 0.0672 0.6607 0.3393 1.9474 0.6665 -0.6895 0.0382 #5 <= 661 387 268 119 1660 1034 626 0.0663 0.6925 0.3075 2.2521 0.8119 -0.5441 0.0227 #6 <= 676 529 415 114 2189 1449 740 0.0906 0.7845 0.2155 3.6404 1.2921 -0.0639 0.0004 #7 <= 693 606 491 115 2795 1940 855 0.1038 0.8102 0.1898 4.2696 1.4515 0.0956 0.0009 #8 717 1883 1775 108 5522 4431 1091 0.3226 0.9426 0.0574 16.4352 2.7994 1.4435 0.4217 #10 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 #11 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7508

**leave a comment**for the author, please follow the link and comment on their blog:

**S+/R – Yet Another Blog in Statistical Computing**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...