Improving Binning by Bootstrap Bumping

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the post (https://statcompute.wordpress.com/2018/11/23/more-robust-monotonic-binning-based-on-isotonic-regression), a more robust version of monotonic binning based on the isotonic regression was introduced. Nonetheless, due to the loss of granularity, the predictability has been somewhat compromised, which is a typical dilemma in the data science. On one hand, we don’t want to use a learning algorithm that is too greedy and therefore over-fits the data at the cost of simplicity and generality. On the other hand, we’d also like to get the most predictive power out of our data for better business results.

It is worth mentioning that, although there is a consensus that advanced ensemble algorithms are able to significantly improve the prediction outcome, both bagging and boosting would also destroy the simple structure of binning outputs and therefore might not be directly applicable in this simple case.

In light of above considerations, the bumping (Bootstrap Umbrella of Model Parameters) procedure, which was detailed in Model Search And Inference By Bootstrap Bumping by Tibshirani and Knight (1997), should serve our dual purposes. First of all, since the final binning structure would be derived from an isotonic regression based on the bootstrap sample, the concern about over-fitting the original training data can be addressed. Secondly, through the bumping search across all bootstrap samples, chances are that a closer-to-optimal solution can be achieved. It is noted that, since the original sample is always included in the bumping procedure, a binning outcome with bumping that is at least as good as the one without is guaranteed.

The R function bump_bin() is my effort of implementing the bumping procedure on top of the monotonic binning function based on isotonic regression. Because of the mutual independence of binning across all bootstrap samples, the bumping is a perfect use case of parallelism for the purpose of faster execution, as demonstrated in the function.

The output below shows the bumping result based on 20 bootstrap samples. There is a small improvement in the information value, e.g. 0.8055 vs 0.8021 without bumping, implying a potential opportunity of bumping with a simpler binning structure, e.g. 12 bins vs 20 bins.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 565     92      41     51        92         41        51 0.0158   0.4457  0.5543  0.8039 -0.2183 -1.5742 0.0532
2    <= 620    470     269    201       562        310       252 0.0805   0.5723  0.4277  1.3383  0.2914 -1.0645 0.1172
3    <= 653    831     531    300      1393        841       552 0.1424   0.6390  0.3610  1.7700  0.5710 -0.7850 0.1071
4    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
5    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
6    <= 675    366     290     76      2154       1421       733 0.0627   0.7923  0.2077  3.8158  1.3391 -0.0168 0.0000
7    <= 699    805     649    156      2959       2070       889 0.1379   0.8062  0.1938  4.1603  1.4256  0.0696 0.0007
8    <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
9    <= 716    321     278     43      3592       2616       976 0.0550   0.8660  0.1340  6.4651  1.8664  0.5105 0.0122
10   <= 721    181     159     22      3773       2775       998 0.0310   0.8785  0.1215  7.2273  1.9779  0.6219 0.0099
11   <= 755    851     789     62      4624       3564      1060 0.1458   0.9271  0.0729 12.7258  2.5436  1.1877 0.1403
12      755    898     867     31      5522       4431      1091 0.1538   0.9655  0.0345 27.9677  3.3311  1.9751 0.3178
13  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
14    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8055

The output below is based on bumping with 200 bootstrap samples. The information value has been improved by 2%, e.g. 0.8174 vs 0.8021, with a lower risk of over-fitting, e.g. 14 bins vs 20 bins.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 559     79      34     45        79         34        45 0.0135   0.4304  0.5696  0.7556 -0.2803 -1.6362 0.0496
2    <= 633    735     428    307       814        462       352 0.1259   0.5823  0.4177  1.3941  0.3323 -1.0237 0.1684
3    <= 637     86      53     33       900        515       385 0.0147   0.6163  0.3837  1.6061  0.4738 -0.8822 0.0143
4    <= 653    493     326    167      1393        841       552 0.0845   0.6613  0.3387  1.9521  0.6689 -0.6870 0.0477
5    <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
6    <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
7    <= 679    504     397    107      2292       1528       764 0.0863   0.7877  0.2123  3.7103  1.3111 -0.0448 0.0002
8    <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
9    <= 699    507     413     94      2959       2070       889 0.0869   0.8146  0.1854  4.3936  1.4802  0.1242 0.0013
10   <= 716    633     546     87      3592       2616       976 0.1084   0.8626  0.1374  6.2759  1.8367  0.4808 0.0216
11   <= 722    202     178     24      3794       2794      1000 0.0346   0.8812  0.1188  7.4167  2.0037  0.6478 0.0118
12   <= 746    619     573     46      4413       3367      1046 0.1060   0.9257  0.0743 12.4565  2.5222  1.1663 0.0991
13   <= 761    344     322     22      4757       3689      1068 0.0589   0.9360  0.0640 14.6364  2.6835  1.3276 0.0677
14      761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609  3.4739  2.1179 0.2979
15  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
16    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8174

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)