# Undersampling Will Change the Base Rates of Your Model’s Predictions

**rstats on Bryan Shalloway's Blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

**TLDR:** In classification problems, under and over sampling^{1} techniques shift the distribution of predicted probabilities towards the minority class. If your problem requires accurate probabilities you will need to adjust your predictions in some way during post-processing (or at another step) to account for this^{2}.

People new to predictive modeling may rush into using sampling procedures without understanding what these procedures are doing. They then sometimes get confused when their predictions appear way off (from those that would be expected according to the base rates in their data). I decided to write this vignette to briefly walk through an example of the implications of under or over sampling procedures on the base rates of predictions^{3}.

My examples will appear obvious to individuals with experience in predictive modeling with imbalanced classes. The code is pulled largely from a few emails I sent in early to mid 2018^{4} to individuals new to data science. Like my other posts, you can view the source code on github.

Note that this post is not about *what* resampling procedures are or *why* you might want to them^{5}, it is meant *only* to demonstrate that such procedures change the base rates of your predictions (unless adjusted for).

*The proportion of TRUE to FALSE cases of the target in binary classification problems largely determines the base rate of the predictions produced by the model. Therefore if you use sampling techniques that change this proportion (e.g. to go from 5-95 to 50-50 TRUE-FALSE ratios) there is a good chance you will want to rescale / calibrate ^{6} your predictions before using them in the wild (if you care about things other than simply ranking your observations^{7}).*

library(tidyverse) library(modelr) library(ggplot2) library(gridExtra) library(purrr) theme_set(theme_bw())

# Create Data

Generate classification data with substantial class imbalance^{8}.

# convert log odds to probability convert_lodds <- function(log_odds) exp(log_odds) / (1 + exp(log_odds)) set.seed(123) minority_data <- tibble(rand_lodds = rnorm(1000, log(.03 / (1 - .03)), sd = 1), rand_probs = convert_lodds(rand_lodds)) %>% mutate(target = map(.x = rand_probs, ~rbernoulli(100, p = .x))) %>% unnest() %>% mutate(id = row_number()) # Change the name of the same of the variables to make the dataset more # intuitive to follow. example <- minority_data %>% select(id, target, feature = rand_lodds)

*In this dataset we have a class imbalance where our target is composed of ~5% positive (TRUE) cases and ~95% negative (FALSE) cases.*

example %>% count(target) %>% mutate(proportion = round(n / sum(n), 3)) %>% knitr::kable()

target | n | proportion |
---|---|---|

FALSE | 95409 | 0.954 |

TRUE | 4591 | 0.046 |

Make 80-20 train – test split^{9}.

set.seed(123) train <- example %>% sample_frac(0.80) test <- example %>% anti_join(train, by = "id")

# Association of ‘feature’ and ‘target’

We have one important input to our model named `feature`

^{10}.

train %>% ggplot(aes(feature, fill = target))+ geom_histogram()+ labs(title = "Distribution of values of 'feature'", subtitle = "Greater values of 'feature' associate with higher likelihood 'target' = TRUE")

## Resample

Make a new sample `train_downsamp`

that keeps all positive cases in the training set and an equal number of randomly sampled negative cases so that the split is no longer 5-95 but becomes 50-50.

minority_class_size <- sum(train$target) set.seed(1234) train_downsamp <- train %>% group_by(target) %>% sample_n(minority_class_size) %>% ungroup()

See below for what the distribution of `feature`

looks like in the down-sampled dataset.

train_downsamp %>% ggplot(aes(feature, fill = target))+ geom_histogram()+ labs(title = "Distribution of values of 'feature' (down-sampled)", subtitle = "Greater values of 'feature' associate with higher likelihood 'target' = TRUE")

## Build Models

Train a logistic regression model to predict positive cases for `target`

based on `feature`

using the training dataset without any changes in the sample (i.e. with the roughly 5-95 class imbalance).

mod_5_95 <- glm(target ~ feature, family = binomial("logit"), data = train)

Train a model with the down-sampled (i.e. 50-50) dataset.

mod_50_50 <- glm(target ~ feature, family = binomial("logit"), data = train_downsamp)

Add the predictions from each of these models^{11} onto our test set (and convert log-odd predictions to probabilities).

test_with_preds <- test %>% gather_predictions(mod_5_95, mod_50_50) %>% mutate(pred_prob = convert_lodds(pred))

Visualize distributions of predicted probability of a positive and negative case for each model.

test_with_preds %>% ggplot(aes(x = pred_prob, fill = target))+ geom_histogram()+ facet_wrap(~model, ncol = 1)

The predicted probabilities for the model built with the down-sampled 50-50 dataset are much higher than those built with the original 5-95 dataset. If picking a decision threshold for the predictions, the model built from the undersampled dataset would have far more predictions of `TRUE`

compared to the rate of `TRUE`

s from the model built from the original training dataset^{12}.

In their current form the predictions from the down-sampled model are not accurate predicted probabilities. For many applications they would need to be rescaled to reflect the *actual* underlying distribution of positive cases associated with the given inputs.

## Rescale Predictions to Predicted Probabilities

Isotonic Regression^{13} or Platt scaling^{14} could be used. Such methods are used to calibrate outputted predictions and ensure they align with *actual* probabilities. Recalibration techniques are typically used when you have models that may not output well-calibrated probabilities^{15}. However these methods can also be used to rescale your outputs (as in this case). (In the case of linear models, there are also simpler approaches available^{16}.)

mod_50_50_rescaled_calibrated <- train %>% add_predictions(mod_50_50) %>% glm(target ~ pred, family = binomial("logit"), data = .) test_with_preds_adjusted <- test %>% spread_predictions(mod_5_95, mod_50_50) %>% rename(pred = mod_50_50) %>% spread_predictions(mod_50_50_rescaled_calibrated) %>% select(-pred) %>% gather(mod_5_95, mod_50_50_rescaled_calibrated, key = "model", value = "pred") %>% mutate(pred_prob = convert_lodds(pred)) test_with_preds_adjusted %>% ggplot(aes(x = pred_prob, fill = target))+ geom_histogram()+ facet_wrap(~model, ncol = 1)

Now that the predictions have been calibrated according to their underlying base rate, you can see the distributions of the predictions between the models are essentially the same.

# Appendix

## Density Plots

Rebuilding plots but using density distributions by class (rather than histograms based on counts).

test_with_preds %>% ggplot(aes(x = pred_prob, fill = target))+ geom_density(alpha = 0.3)+ facet_wrap(~model, ncol = 1)

test_with_preds_adjusted %>% ggplot(aes(x = pred_prob, fill = target))+ geom_density(alpha = 0.3)+ facet_wrap(~model, ncol = 1)

## Lift Plot

test_with_preds_adjusted %>% mutate(target = factor(target, c("TRUE", "FALSE"))) %>% filter(model == "mod_5_95") %>% yardstick::lift_curve(target, pred) %>% autoplot()

In the title I just mention Undersamping for brevities sake. Upsampling and downsampling are synonyms you may hear as well↩

I expect the audience for this post may be rather limited.↩

I wrote this example After having conversations related to this a few times (and participants not grasping points that would become clear with demonstration).↩

before I started using tidymodels↩

or any of a myriad of topics related to this.↩

There are often pretty easy built-in ways to accommodate this.↩

There are also other reasons you may not want to rescale your predictions… but in many cases you will want to.↩

Could have been more precise here…↩

no need for validation for this example↩

The higher incidence of TRUE values in the target at higher scores demonstrates the features predictive value.↩

One built with 5-95 split the other with a downsampled 50-50 split.↩

Of course you could just use different decision thresholds for the predictions as well.↩

Decision tree based approach↩

Logistic regression based approach↩

E.g. when using Support Vector Machines↩

In this case we are starting with a linear model hence we could also have just changed the intercept value to get the same affect. Rescaling methods act on the

*predictions*rather than the model parameters. Hence these scaling methods have the advantage of being generalizable as they are agnostic to model type.↩

**leave a comment**for the author, please follow the link and comment on their blog:

**rstats on Bryan Shalloway's Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.