Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In credit risk modeling and binary classification applications, analysts employ gains tables (also known as KS tables) as a fundamental tool for measuring and quantifying model performance. This tutorial dives into the construction and interpretion of gains tables using R. ## Theoretical Foundation: Understanding Gains Tables
A gains table systematically discretizes the population (typically a validation or test dataset) into groups based on the model’s output predictions (probability scores, log odds, or risk scores). Each group conventionally represents 10% of the total population (deciles), though alternative binning strategies may be employed. The output presents summary statistics for each group and analyzes the cumulative distributions of events (defaults) and non-events to quantify the model’s discriminatory performance.
< section id="package-dependencies" class="level2">Package Dependencies
# Load required packages library(dplyr) library(magrittr) library(knitr) library(scales)
Dataset Preparation
This tutorial utilizes a sample from the Lending Club dataset, which contains comprehensive loan information and associated outcomes suitable for credit risk modeling applications.
# Load the sample data sample <- read.csv("https://bit.ly/42ypcnJ") # Check dimensions dim(sample)
[1] 10000 153
Target Definition
The initial step requires the creation of a binary target variable for modeling purposes. In this credit risk application, we identify borrowers who defaulted on their loan obligations.
# Check unique loan statuses unique(sample$loan_status)
[1] "Fully Paid" [2] "Current" [3] "Charged Off" [4] "Late (31-120 days)" [5] "Late (16-30 days)" [6] "In Grace Period" [7] "Does not meet the credit policy. Status:Fully Paid" [8] "Does not meet the credit policy. Status:Charged Off"
# Define "bad" loans as those that are charged off codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off") # Create a binary flag for defaults sample %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0)) # Check overall event rates sample %>% summarise(events = sum(bad_flag == 1), non_events = sum(bad_flag == 0)) %>% mutate( event_rate = events/(events + non_events))
events non_events event_rate 1 1162 8838 0.1162
Model Development
Subsequently, we develop a logistic regression model to generate predictions that will serve as the foundation for gains table construction.
# Replace NA values with a default value sample[is.na(sample)] <- -1 # Clean the data sample %<>% # Remove cases where home ownership and payment plan are not reported filter(!home_ownership %in% c("", "NONE"), pymnt_plan != "") %>% # Convert categorical variables to factors mutate(home_ownership = factor(home_ownership), pymnt_plan = factor(pymnt_plan)) # Train-test split (70-30) idx <- sample(1:nrow(sample), size = 0.7 * nrow(sample), replace = FALSE) train <- sample[idx,] test <- sample[-idx,]
# Build a logistic regression model mdl <- glm( formula = bad_flag ~ loan_amnt + term + mths_since_last_delinq + total_pymnt + home_ownership + acc_now_delinq + inq_last_6mths + delinq_amnt + mths_since_last_record + mths_since_recent_revol_delinq + mths_since_last_major_derog + mths_since_recent_inq + mths_since_recent_bc + num_accts_ever_120_pd, family = "binomial", data = train ) # Generate predictions on the test set test$pred <- predict(mdl, newdata = test)
Gains Table Construction
The following section demonstrates the step-by-step construction of a comprehensive gains table through systematic binning and statistical analysis.
< section id="population-discretization-into-bins" class="level3">Population Discretization into Bins
# Create deciles based on model predictions q <- quantile(test$pred, probs = seq(0, 1, length.out = 11)) # Add bins to test dataset test$bins <- cut(test$pred, breaks = q, include.lowest = TRUE, right = TRUE, ordered_result = TRUE) # Check the bin levels (note they're in increasing order) levels(test$bins)
[1] "[-5.11,-3.34]" "(-3.34,-2.89]" "(-2.89,-2.64]" "(-2.64,-2.41]" [5] "(-2.41,-2.23]" "(-2.23,-2.03]" "(-2.03,-1.83]" "(-1.83,-1.6]" [9] "(-1.6,-1.26]" "(-1.26,2.51]"
Basic Statistical Measures by Segment
# Create initial gains table with counts gains_table <- test %>% group_by(bins) %>% summarise(total = n(), events = sum(bad_flag == 1), non_events = sum(bad_flag == 0)) # Add event rate column gains_table %<>% mutate(event_rate = percent(events / total, 0.1, 100)) # Display the table kable(gains_table)
bins | total | events | non_events | event_rate |
---|---|---|---|---|
[-5.11,-3.34] | 300 | 3 | 297 | 1.0% |
(-3.34,-2.89] | 300 | 8 | 292 | 2.7% |
(-2.89,-2.64] | 300 | 10 | 290 | 3.3% |
(-2.64,-2.41] | 300 | 14 | 286 | 4.7% |
(-2.41,-2.23] | 300 | 26 | 274 | 8.7% |
(-2.23,-2.03] | 300 | 38 | 262 | 12.7% |
(-2.03,-1.83] | 300 | 46 | 254 | 15.3% |
(-1.83,-1.6] | 300 | 49 | 251 | 16.3% |
(-1.6,-1.26] | 300 | 58 | 242 | 19.3% |
(-1.26,2.51] | 300 | 74 | 226 | 24.7% |
Cumulative Distribution
# Add population percentage and cumulative distributions gains_table %<>% mutate(pop_pct = percent(total/sum(total), 0.1, 100), # Calculate cumulative percentages c.events_pct = cumsum(events) / sum(events), c.non_events_pct = cumsum(non_events) / sum(non_events)) # Display the updated table kable(gains_table)
bins | total | events | non_events | event_rate | pop_pct | c.events_pct | c.non_events_pct |
---|---|---|---|---|---|---|---|
[-5.11,-3.34] | 300 | 3 | 297 | 1.0% | 10.0% | 0.0092025 | 0.1110696 |
(-3.34,-2.89] | 300 | 8 | 292 | 2.7% | 10.0% | 0.0337423 | 0.2202693 |
(-2.89,-2.64] | 300 | 10 | 290 | 3.3% | 10.0% | 0.0644172 | 0.3287210 |
(-2.64,-2.41] | 300 | 14 | 286 | 4.7% | 10.0% | 0.1073620 | 0.4356769 |
(-2.41,-2.23] | 300 | 26 | 274 | 8.7% | 10.0% | 0.1871166 | 0.5381451 |
(-2.23,-2.03] | 300 | 38 | 262 | 12.7% | 10.0% | 0.3036810 | 0.6361257 |
(-2.03,-1.83] | 300 | 46 | 254 | 15.3% | 10.0% | 0.4447853 | 0.7311144 |
(-1.83,-1.6] | 300 | 49 | 251 | 16.3% | 10.0% | 0.5950920 | 0.8249813 |
(-1.6,-1.26] | 300 | 58 | 242 | 19.3% | 10.0% | 0.7730061 | 0.9154824 |
(-1.26,2.51] | 300 | 74 | 226 | 24.7% | 10.0% | 1.0000000 | 1.0000000 |
Performance Metrics
# Add KS statistic, capture rate, and cumulative event rate gains_table %<>% mutate( # KS statistic (difference between cumulative distributions) ks = round(abs(c.events_pct - c.non_events_pct), 2), # Capture rate (percentage of total events captured) cap_rate = percent(cumsum(events)/sum(events), 1, 100), # Cumulative event rate c_event_rate = percent(cumsum(events)/cumsum(total), 0.1, 100), # Format percentage columns c.events_pct = percent(c.events_pct, 0.1, 100), c.non_events_pct = percent(c.non_events_pct, 0.1, 100)) # Display the final table kable(gains_table)
bins | total | events | non_events | event_rate | pop_pct | c.events_pct | c.non_events_pct | ks | cap_rate | c_event_rate |
---|---|---|---|---|---|---|---|---|---|---|
[-5.11,-3.34] | 300 | 3 | 297 | 1.0% | 10.0% | 0.9% | 11.1% | 0.10 | 1% | 1.0% |
(-3.34,-2.89] | 300 | 8 | 292 | 2.7% | 10.0% | 3.4% | 22.0% | 0.19 | 3% | 1.8% |
(-2.89,-2.64] | 300 | 10 | 290 | 3.3% | 10.0% | 6.4% | 32.9% | 0.26 | 6% | 2.3% |
(-2.64,-2.41] | 300 | 14 | 286 | 4.7% | 10.0% | 10.7% | 43.6% | 0.33 | 11% | 2.9% |
(-2.41,-2.23] | 300 | 26 | 274 | 8.7% | 10.0% | 18.7% | 53.8% | 0.35 | 19% | 4.1% |
(-2.23,-2.03] | 300 | 38 | 262 | 12.7% | 10.0% | 30.4% | 63.6% | 0.33 | 30% | 5.5% |
(-2.03,-1.83] | 300 | 46 | 254 | 15.3% | 10.0% | 44.5% | 73.1% | 0.29 | 44% | 6.9% |
(-1.83,-1.6] | 300 | 49 | 251 | 16.3% | 10.0% | 59.5% | 82.5% | 0.23 | 60% | 8.1% |
(-1.6,-1.26] | 300 | 58 | 242 | 19.3% | 10.0% | 77.3% | 91.5% | 0.14 | 77% | 9.3% |
(-1.26,2.51] | 300 | 74 | 226 | 24.7% | 10.0% | 100.0% | 100.0% | 0.00 | 100% | 10.9% |
Reusable Function
The following implementation encapsulates the gains table construction process within a comprehensive, reusable function suitable for any binary classification model evaluation:
gains_table <- function(act, pred, increasing = TRUE, nBins = 10) { # Create bins based on predictions q <- quantile(pred, probs = seq(0, 1, length.out = nBins + 1)) bins <- cut(pred, breaks = q, include.lowest = TRUE, right = TRUE, ordered_result = TRUE) df <- data.frame(act, pred, bins) df %>% # Group by bins and calculate statistics group_by(bins) %>% summarise(total = n(), events = sum(act == 1), non_events = sum(act == 0)) %>% mutate(event_rate = percent(events / total, 0.1, 100)) %>% # Sort the table based on the 'increasing' parameter {if(increasing == TRUE) { arrange(., bins) } else { arrange(., desc(bins)) }} %>% # Add all performance metrics mutate(pop_pct = percent(total/sum(total), 0.1, 100), c.events_pct = cumsum(events) / sum(events), c.non_events_pct = cumsum(non_events) / sum(non_events), ks = round(abs(c.events_pct - c.non_events_pct), 2), cap_rate = percent(cumsum(events)/sum(events), 1, 100), c_event_rate = percent(cumsum(events)/cumsum(total), 0.1, 100), c.events_pct = percent(c.events_pct, 0.1, 100), c.non_events_pct = percent(c.non_events_pct, 0.1, 100)) }
Function Implementation
# Generate a gains table with bins in descending order tab <- gains_table(test$bad_flag, test$pred, FALSE, 10) kable(tab)
bins | total | events | non_events | event_rate | pop_pct | c.events_pct | c.non_events_pct | ks | cap_rate | c_event_rate |
---|---|---|---|---|---|---|---|---|---|---|
(-1.26,2.51] | 300 | 74 | 226 | 24.7% | 10.0% | 22.7% | 8.5% | 0.14 | 23% | 24.7% |
(-1.6,-1.26] | 300 | 58 | 242 | 19.3% | 10.0% | 40.5% | 17.5% | 0.23 | 40% | 22.0% |
(-1.83,-1.6] | 300 | 49 | 251 | 16.3% | 10.0% | 55.5% | 26.9% | 0.29 | 56% | 20.1% |
(-2.03,-1.83] | 300 | 46 | 254 | 15.3% | 10.0% | 69.6% | 36.4% | 0.33 | 70% | 18.9% |
(-2.23,-2.03] | 300 | 38 | 262 | 12.7% | 10.0% | 81.3% | 46.2% | 0.35 | 81% | 17.7% |
(-2.41,-2.23] | 300 | 26 | 274 | 8.7% | 10.0% | 89.3% | 56.4% | 0.33 | 89% | 16.2% |
(-2.64,-2.41] | 300 | 14 | 286 | 4.7% | 10.0% | 93.6% | 67.1% | 0.26 | 94% | 14.5% |
(-2.89,-2.64] | 300 | 10 | 290 | 3.3% | 10.0% | 96.6% | 78.0% | 0.19 | 97% | 13.1% |
(-3.34,-2.89] | 300 | 8 | 292 | 2.7% | 10.0% | 99.1% | 88.9% | 0.10 | 99% | 12.0% |
[-5.11,-3.34] | 300 | 3 | 297 | 1.0% | 10.0% | 100.0% | 100.0% | 0.00 | 100% | 10.9% |
Interpretation
A properly constructed gains table provides multiple critical insights into model performance characteristics:
Monotonicity Assessment: Event rates should demonstrate consistent increases (or decreases) across bins, confirming the model’s effectiveness in rank-ordering risk levels.
Population Distribution: Consistent bin sizes (ideally ~10% each) indicate appropriate score distribution. Inconsistent sizes suggest score clustering, which may complicate threshold determination.
Kolmogorov-Smirnov (KS) Statistic: The maximum KS value represents the model’s discriminatory power. Higher values (approaching 1.0) indicate superior separation between positive and negative cases.
Capture Rate: Demonstrates the percentage of total events captured at each threshold, essential for operational decision-making.
Cumulative Event Rate: Indicates the event rate among all cases up to each bin, facilitating approval threshold establishment.
Applications in Credit Risk Analytics
Gains tables serve multiple critical functions in credit risk management environments:
Threshold Optimization: Identification of appropriate score thresholds for automated approval or rejection decisions.
Tiered Strategy Development: Construction of multi-tier decision strategies (approve, manual review, decline) based on quantified risk levels.
Model Performance Monitoring: Longitudinal tracking of model performance through comparison of actual versus expected distributions.
Comparative Model Evaluation: Systematic comparison of alternative models through KS statistics and capture rate analysis.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.