# Optimising approval rates by creating segmented models

**R'tichoke**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s a short post on why one should consider developing a segmented model when building credit risk scorecards to help optimize approval rates. While segmentation offers a variety of benefits, this post aims to offer a different perspective.

I will be reusing code from a previous post. Feel free to navigate to this post for additional information. Note that sample datasets are available for download from (here)

## Packages

#install.packages("pacman") ## Install if needed library(pacman) # p_load automatically installs packages if needed p_load(dplyr, magrittr, scales, pROC)

## Sample data

smple <- read.csv("../../../download/credit_sample.csv") # Define target codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off") smple %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0)) # Some basic data cleaning smple[is.na(smple)] <- -1 smple %<>% # Remove cases where home ownership and payment plan are not reported filter(! home_ownership %in% c("", "NONE"), pymnt_plan != "") %>% # Convert these two variables into factors mutate(home_ownership = factor(home_ownership), pymnt_plan = factor(pymnt_plan))

## Segments

Typically, credit risk scorecards would have segments like `known goods`

, `known bads`

, `ever chaged off`

etc. For simplicity, I will use the available `FICO grade`

variable in the dataset to segment `known goods`

and `known bads`

. For the purposes of this post, let’s focus on the `known goods`

customers (mostly because policy filters would remove the known bad customers).

smple %>% group_by(grade) %>% summarise(total = n(), bads = sum(bad_flag == 1)) %>% mutate(event_rate = percent(bads / total)) ## # A tibble: 7 × 4 ## grade total bads event_rate ## <chr> <int> <int> <chr> ## 1 A 1882 68 3.6% ## 2 B 3056 239 7.8% ## 3 C 2805 373 13.3% ## 4 D 1435 257 17.9% ## 5 E 590 144 24.4% ## 6 F 182 65 35.7% ## 7 G 49 16 32.7%

Since the event rates are significantly higher, let’s keep borrowers who have a `FICO grade`

of `"E", "F" & "G"`

in the `known bads`

bucket and rest in the `known goods`

bucket.

smple %<>% mutate(segment = ifelse(grade %in% c("E", "F", "G"), "KB", "KG")) smple %>% group_by(segment) %>% tally() ## # A tibble: 2 × 2 ## segment n ## <chr> <int> ## 1 KB 821 ## 2 KG 9178

## Model training

Let’s train two separate sets models like so:

- One to the entire population
- And one model to the
`known goods`

# Create a formula object to be used across models form <- as.formula( "bad_flag ~ mths_since_last_delinq + total_pymnt + acc_now_delinq + inq_last_6mths + delinq_amnt + mths_since_last_record + mths_since_recent_revol_delinq + mths_since_last_major_derog + mths_since_recent_inq + mths_since_recent_bc + num_accts_ever_120_pd" )

### Model on the entire population

set.seed(1234) # Train Test split idx <- sample(1:nrow(smple), size = 0.7 * nrow(smple), replace = F) train <- smple[idx,] test <- smple[-idx,] # Using a GLM model for simplicity mdl_pop <- glm( formula = form, family = "binomial", data = train ) # Get performance on entire sample auc(test$bad_flag, predict(mdl_pop, test)) ## Setting levels: control = 0, case = 1 ## Setting direction: controls < cases ## Area under the curve: 0.6602

Now that we have a model, let’s assume an `expected default propensity`

target of `2%`

. That is to say I can only approve those customers who I believe have an expected default propensity of `2%`

or less. Given this, what would be my expected `approval rate`

? Note that I need to maximise my approval rate (to better utilise my applicant funnel).

# Output scaling function scaling_func <- function(vec, PDO = 30, OddsAtAnchor = 5, Anchor = 700){ beta <- PDO / log(2) alpha <- Anchor - PDO * OddsAtAnchor # Simple linear scaling of the log odds scr <- alpha - beta * vec # Round off return(round(scr, 0)) } # Find the number of customers that can be approved such that # the cumulative bad rate is <= target smple %>% filter(segment == "KG") %>% ## Filter on known goods only mutate(pred = predict(mdl_pop, newdata = .), ## Generate predictions score = scaling_func(pred), ## Scale output total = n()) %>% arrange(pred) %>% mutate(rn = row_number(), c_bad_rate = cumsum(bad_flag)/rn) %>% filter(c_bad_rate <= 0.02) %>% summarise(approve_count = n(), total = mean(total), score_cutoff = min(score), bad_rate = sum(bad_flag)/n()) %>% mutate(approval_rate = approve_count / total) ## approve_count total score_cutoff bad_rate approval_rate ## 1 818 9178 689 0.01833741 0.08912617

Based on the `unsegmented`

model, if we choose a score cutoff of `>=689`

, we should expect a bad rate of `~2%`

at an approval rate of `~9%`

.

### Model on the known good population

set.seed(1234) # Filter on KG sample_kg <- smple %>% filter(segment == "KG") idx <- sample(1:nrow(sample_kg), size = 0.7 * nrow(sample_kg), replace = F) train <- sample_kg[idx,] test <- sample_kg[-idx,] mdl_kg <- glm( formula = form, family = "binomial", data = train ) # Get performance on entire sample auc(test$bad_flag, predict(mdl_kg, test)) ## Setting levels: control = 0, case = 1 ## Setting direction: controls < cases ## Area under the curve: 0.6496

## Simulations

To evaluate the through the door population better, let’s run some simulations. Let’s assume that only `known good`

customers will be assessed through the risk scorecard.

# Function to generate simulations generate_sim <- function(mdl, nSim = 500, sample_size = 5000, cutoff = 650){ # Vectors to store output bad_rates <- c() approval_rates <- c() for(i in 1:nSim){ out <- smple %>% # Filter on KG segment filter(segment == "KG") %>% # Randomly sample with replacement sample_n(sample_size, replace = T) %>% # Generate model output mutate(pred = predict(mdl, newdata = .), score = scaling_func(pred)) %>% filter(score >= cutoff) %>% summarise(bad_rate = sum(bad_flag)/n(), app_rate = n() / sample_size) # Store output bad_rates[i] <- out$bad_rate approval_rates[i] <- out$app_rate } # Plot output par(mfrow = c(1, 2)) plot(density(bad_rates), main = "Event Rate") abline(v = mean(bad_rates)) plot(density(approval_rates), main = "Approval Rate") abline(v = mean(approval_rates)) par(mfrow = c(1, 1)) } # Simulate using the model built on the entire population generate_sim(mdl_pop, cutoff = 689)

Based on the above simulations, when using the model trained on the entire population and using a threshold of `>=689`

the average simulated event rates and approval rates are close to the expected rates from before. But what if we use the model developed only on the `known good population`

?

smple %>% filter(segment == "KG") %>% ## Filter on known goods only mutate(pred = predict(mdl_kg, newdata = .), ## Generate predictions score = scaling_func(pred), ## Scale output total = n()) %>% arrange(pred) %>% mutate(rn = row_number(), c_bad_rate = cumsum(bad_flag)/rn) %>% filter(c_bad_rate <= 0.02) %>% summarise(approve_count = n(), total = mean(total), score_cutoff = min(score), bad_rate = sum(bad_flag)/n()) %>% mutate(approval_rate = approve_count / total) ## approve_count total score_cutoff bad_rate approval_rate ## 1 904 9178 696 0.0199115 0.0984964 # Simulate using the model built only on the known good population generate_sim(mdl_kg, cutoff = 696)

When using the known-good model, we can achieve a slightly higher approval rate keeping the event rate more or less the same. While the difference is not significant, a ~1% higher approval rate on a large funnel could be significant (psst. monthly targets anyone?).

While most modelers/data scientists understand this, explaining the need to build segmented models to a business owner can be easier if the impact can be linked to business outcomes 😉.

*Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!*

**leave a comment**for the author, please follow the link and comment on their blog:

**R'tichoke**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.