Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Credit scoring models demonstrate optimal performance within the central regions of the score distribution, yet exhibit diminished reliability at the distribution extremes where data becomes sparse. This tutorial provides a comprehensive methodology for employing bootstrap resampling techniques to quantify prediction variability across different score ranges, thereby enabling practitioners to identify regions where their models demonstrate the highest degree of dependability.
< section id="theoretical-foundation-understanding-estimation-variance" class="level2">Theoretical Foundation: Understanding Estimation Variance
Reduced sample sizes inherently result in increased variance in statistical estimates, particularly for extreme values within a distribution. While central tendency measures such as means demonstrate relative stability when estimated from limited data, tail percentiles (95th, 99th) exhibit substantially greater volatility. This phenomenon is of critical importance in credit scoring applications, where extremely high and low scores correspond to these unstable tail regions of the distribution.
# Number of samples to be drawn from a probability distribution n_samples <- 1000 # Number of times, sampling should be repeated repeats <- 100 # Mean and std-dev for a standard normal distribution mu <- 5 std_dev <- 2 # Sample samples <- rnorm(n_samples * repeats, mean = 10) # Fit into a matrix like object with `n_samples' number of rows # and `repeats' number of columns samples <- matrix(samples, nrow = n_samples, ncol = repeats) # Compute mean across each column sample_means <- apply(samples, 1, mean) # Similarly, compute 75% and 95% quantile across each column sample_75_quantile <- apply(samples, 1, quantile, p = 0.75) sample_95_quantile <- apply(samples, 1, quantile, p = 0.95) sample_99_quantile <- apply(samples, 1, quantile, p = 0.99) # Compare coefficient of variation sd(sample_means)/mean(sample_means)
[1] 0.0100664
sd(sample_75_quantile)/mean(sample_75_quantile)
[1] 0.01270263
sd(sample_95_quantile)/mean(sample_75_quantile)
[1] 0.01888391
# Plot the distributions combined_vec <- c(sample_means, sample_75_quantile, sample_95_quantile, sample_99_quantile) plot(density(sample_means), col = "#6F69AC", lwd = 3, main = "Estimating the mean vs tail quantiles", xlab = "", xlim = c(min(combined_vec), max(combined_vec))) lines(density(sample_75_quantile), col = "#95DAC1", lwd = 3) lines(density(sample_95_quantile), col = "#FFEBA1", lwd = 3) lines(density(sample_99_quantile), col = "#FD6F96", lwd = 3) grid() legend("topright", fill = c("#6F69AC", "#95DAC1", "#FFEBA1", "#FD6F96"), legend = c("Mean", "75% Quantile", "95% Quantile", "99% Quantile"), cex = 0.7)
The visualization demonstrates the substantial increase in uncertainty when estimating extreme values compared to central tendencies. The distribution corresponding to the mean (purple) exhibits considerably narrower dispersion than that of the 99th percentile (pink). This statistical principle directly applies to credit scoring applications, where scores at the extremes of the distribution inherently possess greater uncertainty.
# Load required packages library(dplyr) library(magrittr) library(rsample) library(ggplot2)
Data Acquisition and Preprocessing
In this tutorial, we will utilize a sample from the Lending Club dataset. Loans classified as “Charged Off” will be designated as defaults. The observed class imbalance represents a typical characteristic of credit portfolios and contributes significantly to prediction challenges at distribution extremes.
# Load sample data (sample of the lending club data) sample <- read.csv("http://bit.ly/42ypcnJ") # Mark which loan status will be tagged as default codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off") # Apply above codes and create target sample %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0)) # Replace missing values with a default value sample[is.na(sample)] <- -1 # Get summary tally table(sample$bad_flag)
0 1 8838 1162
Implementing Bootstrap Resampling Strategy
This section demonstrates the creation of 100 bootstrap samples to quantify the variation in model predictions across different score ranges. Bootstrap resampling is a statistical technique that generates multiple simulated datasets to measure prediction uncertainty without requiring additional data collection.
# Create 100 samples boot_sample <- bootstraps(data = sample, times = 100) head(boot_sample, 3)
# A tibble: 3 × 2 splits id <list> <chr> 1 <split [10000/3700]> Bootstrap001 2 <split [10000/3630]> Bootstrap002 3 <split [10000/3639]> Bootstrap003
# Each row represents a separate bootstrapped sample with an analysis set and assessment set boot_sample$splits[[1]]
<Analysis/Assess/Total> <10000/3700/10000>
# Show the first 5 rows and 5 columns of the first sample analysis(boot_sample$splits[[1]]) %>% .[1:5, 1:5]
V1 id member_id loan_amnt funded_amnt 1 70066 143521620 -1 18000 18000 2 74725 143608122 -1 15000 15000 3 64235 122327224 -1 10000 10000 4 26743 70312012 -1 35000 35000 5 80520 68407752 -1 11000 11000
Each bootstrap sample consists of random draws with replacement from the original dataset, creating controlled variations that effectively reveal model sensitivity to different data compositions.
< section id="developing-the-predictive-model-framework" class="level2">Developing the Predictive Model Framework
This tutorial employs logistic regression as the predictive modeling technique. Logistic regression represents the industry standard for credit risk modeling due to its interpretability and regulatory acceptance. The model specification incorporates typical credit variables including loan amount, income, and credit history metrics.
glm_model <- function(df){ # Fit a simple model with a set specification mdl <- glm(bad_flag ~ loan_amnt + funded_amnt + annual_inc + delinq_2yrs + inq_last_6mths + mths_since_last_delinq + fico_range_low + mths_since_last_record + revol_util + total_pymnt, family = "binomial", data = df) # Return fitted values return(predict(mdl)) } # Test the function # Retrieve a data frame train <- analysis(boot_sample$splits[[1]]) # Predict pred <- glm_model(train) # Check output range(pred) # Output is on log odds scale
[1] -30.25449 1.57891
The function returns predictions in log-odds format, which will subsequently be transformed to a more intuitive credit score scale in later steps.
< section id="iterative-model-training-and-prediction-collection" class="level2">Iterative Model Training and Prediction Collection
# First apply the glm fitting function to each of the sample # Note the use of lapply output <- lapply(boot_sample$splits, function(x){ train <- analysis(x) pred <- glm_model(train) return(pred) }) # Collate all predictions into a vector boot_preds <- do.call(c, output) range(boot_preds)
[1] -130.406408 4.125475
# Get outliers q_high <- quantile(boot_preds, 0.99) q_low <- quantile(boot_preds, 0.01) # Truncate the overall distribution to within the lower 1% and upper 1% quantiles # Doing this since it creates issues later on when scaling the output boot_preds[boot_preds > q_high] <- q_high boot_preds[boot_preds < q_low] <- q_low range(boot_preds)
[1] -5.0603991 -0.2283903
# Convert to a data frame boot_preds <- data.frame(pred = boot_preds, id = rep(1:length(boot_sample$splits), each = nrow(sample))) head(boot_preds)
pred id 1 -1.460522 1 2 -1.644074 1 3 -3.123883 1 4 -4.236837 1 5 -3.197778 1 6 -3.123942 1
In this step, we apply the logistic regression model to each bootstrap sample and systematically collect the resulting predictions. Subsequently, we truncate extreme values (beyond the 1st and 99th percentiles) to remove outliers—a procedure analogous to capping techniques commonly employed in production credit models.
< section id="transforming-predictions-to-credit-score-scale" class="level2">Transforming Predictions to Credit Score Scale
This section demonstrates the conversion of log-odds predictions to a recognizable credit score format using the industry-standard Points to Double Odds (PDO) methodology. By employing parameters consistent with real-world credit systems (PDO=30, Anchor=700), we transform the model predictions into intuitive scores where higher numerical values indicate lower credit risk.
scaling_func <- function(vec, PDO = 30, OddsAtAnchor = 5, Anchor = 700){ beta <- PDO / log(2) alpha <- Anchor - PDO * OddsAtAnchor # Simple linear scaling of the log odds scr <- alpha - beta * vec # Round off return(round(scr, 0)) } boot_preds$scores <- scaling_func(boot_preds$pred, 30, 2, 700) # Chart the distribution of predictions across all the samples ggplot(boot_preds, aes(x = scores, color = factor(id))) + geom_density() + theme_minimal() + theme(legend.position = "none") + scale_color_grey() + labs(title = "Predictions from bootstrapped samples", subtitle = "Density function", x = "Predictions (Log odds)", y = "Density")
Quantifying Prediction Uncertainty Across Score Ranges
This section provides a methodology to directly measure prediction reliability variation across different score ranges through the calculation of standard deviation within each score bin. This approach enables precise quantification of uncertainty at different score levels.
# Create bins using quantiles breaks <- quantile(boot_preds$scores, probs = seq(0, 1, length.out = 20)) boot_preds$bins <- cut(boot_preds$scores, breaks = unique(breaks), include.lowest = T, right = T) # Chart standard deviation of model predictions across each score bin boot_preds %>% group_by(bins) %>% summarise(std_dev = sd(scores)) %>% ggplot(aes(x = bins, y = std_dev)) + geom_col(color = "black", fill = "#90AACB") + theme_minimal() + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none") + labs(title = "Variability in model predictions across samples", subtitle = "(measured using standard deviation)", x = "Score Range", y = "Standard Deviation")
As anticipated, the model’s predictions demonstrate enhanced reliability within a specific range of values (700-800), while exhibiting significant variability in the lowest and highest score buckets.
The visualization reveals a characteristic “U-shaped” pattern of prediction variability—a well-documented phenomenon in credit risk modeling. The highest uncertainty manifests in the extreme score ranges (very high and very low scores), while predictions in the middle range demonstrate greater stability. The analysis confirms the initial hypothesis: variability reaches its maximum at score extremes and achieves its minimum in the middle range (600-800). This finding provides direct guidance for credit policy development—scores in the middle range demonstrate the highest reliability, while decisions at the extremes should incorporate additional caution due to elevated uncertainty.
< section id="practical-business-applications" class="level3">Practical Business Applications
These findings yield direct business applications:
- High Score Management: For extremely high scores, implement additional verification steps before automated approval
- Low Score Management: For very low scores, consider manual review procedures rather than automatic rejection
Advanced Methodology: Isolating Training Data Effects
Credit: Richard Warnung
For enhanced analytical control, this section presents an alternative approach where models are trained on bootstrap samples while being evaluated on a consistent validation set. This methodology effectively isolates the impact of training data variation on model predictions.
Vs <- function(boot_split){ # Train model on the bootstrapped data train <- analysis(boot_split) # Fit model mdl <- glm(bad_flag ~ loan_amnt + funded_amnt + annual_inc + delinq_2yrs + inq_last_6mths + mths_since_last_delinq + fico_range_low + mths_since_last_record + revol_util + total_pymnt, family = "binomial", data = train) # Apply to a common validation set validate_preds <- predict(mdl, newdata = validate_set) # Return predictions return(validate_preds) }
This methodology provides enhanced insight into how variations in training data composition affect model predictions, which proves valuable when evaluating model updates in production environments.
# Create overall training and testing datasets id <- sample(1:nrow(sample), size = nrow(sample)*0.8, replace = F) train_data <- sample[id,] test_data <- sample[-id,] # Bootstrapped samples are now pulled only from the overall training dataset boot_sample <- bootstraps(data = train_data, times = 80) # Using the same function from before but predicting on the same test dataset glm_model <- function(train, test){ mdl <- glm(bad_flag ~ loan_amnt + funded_amnt + annual_inc + delinq_2yrs + inq_last_6mths + mths_since_last_delinq + fico_range_low + mths_since_last_record + revol_util + total_pymnt, family = "binomial", data = train) # Return fitted values on the test dataset return(predict(mdl, newdata = test)) } # Apply the glm fitting function to each of the sample # But predict on the same test dataset output <- lapply(boot_sample$splits, function(x){ train <- analysis(x) pred <- glm_model(train, test_data) return(pred) })
Key Takeaways
- Prediction reliability varies significantly across score ranges, with highest uncertainty at distribution extremes
- Bootstrap methodology provides a practical framework for measuring model uncertainty without additional data requirements
- Risk management strategies should be adjusted based on score reliability, with enhanced verification for extreme scores
- The PDO transformation enables intuitive score interpretation while preserving the underlying risk relationships
These techniques can be applied beyond credit scoring to any prediction problem where understanding model uncertainty is critical for decision-making.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.