Classification: Get the Balance Right

[This article was first published on R on datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For classification problems the positive class (which is what you’re normally trying to predict) is often sparsely represented in the data. Unless you do something to address this imbalance then your classifier is likely to be rather underwhelming.

Achieving a reasonable balance in the proportions of the target classes is seldom emphasised. Perhaps it’s not very sexy. But it can have a massive effect on a model.

You’ve got to get the balance right!

The Data

To illustrate we’ll use some medical appointment no-show data.

First load some indispensable packages.

library(dplyr)
library(stringr)
library(lubridate)
library(readr)
library(caret)

Then grab the data.

appointments <- read_csv("medical-appointments.csv") %>%
  # Normalise names.
  setNames(names(.) %>% str_to_lower() %>% str_replace("[.-]", "_")) %>%
  # Correct spelling.
  dplyr::rename(
    hypertension = hipertension,
    handicap = handcap
  )

Convert a few features into factors.

appointments <- appointments %>%
    mutate_at(vars(gender, neighbourhood:no_show), factor)

Neither the patientid nor the appointmentid fields can have any predictive value, so remove them both. The neighbourhood feature has too many levels to be useful as a categorical feature, so remove it too.

appointments <- appointments %>% select(-patientid, -appointmentid, -neighbourhood)

Create some derived features from the date and time columns.

appointments <- appointments %>% mutate(
    scheduleddow = wday(scheduledday) %>% factor(),
    hour = hour(scheduledday) + (minute(scheduledday) + second(scheduledday) / 60) / 60,
    appointmentdow = wday(appointmentday) %>% factor(),
    #
    # How long before the appointment was it scheduled?
    #
    advance = difftime(scheduledday, appointmentday, units = "hours") %>% as.numeric()
) %>% select(-scheduledday, -appointmentday)

Finally change order of levels in the target variable. This is important because the caret package considers the first level to be the positive class.

appointments <- appointments %>% mutate(
  no_show = relevel(no_show, "Yes")
) %>% select(no_show, everything())

Let’s take a look at the resulting data.

# A tibble: 20 x 13
   no_show gender   age scholarship hypertension diabetes alcoholism handicap sms_received scheduleddow      hour appointmentdow    advance
    <fctr> <fctr> <int>      <fctr>       <fctr>   <fctr>     <fctr>   <fctr>       <fctr>       <fctr>     <dbl>         <fctr>      <dbl>
 1      No      F    62           0            1        0          0        0            0            6 18.635556              6  18.635556
 2      No      M    56           0            0        0          0        0            0            6 16.140833              6  16.140833
 3      No      F    62           0            0        0          0        0            0            6 16.317778              6  16.317778
 4      No      F     8           0            0        0          0        0            0            6 17.491944              6  17.491944
 5      No      F    56           0            1        1          0        0            0            6 16.123056              6  16.123056
 6      No      F    76           0            1        0          0        0            0            4  8.614167              6 -39.385833
 7     Yes      F    23           0            0        0          0        0            0            4 15.086667              6 -32.913333
 8     Yes      F    39           0            0        0          0        0            0            4 15.666111              6 -32.333889
 9      No      F    21           0            0        0          0        0            0            6  8.037778              6   8.037778
10      No      F    19           0            0        0          0        0            0            4 12.806944              6 -35.193056
11      No      F    30           0            0        0          0        0            0            4 14.969722              6 -33.030278
12     Yes      M    29           0            0        0          0        0            1            3  8.736667              6 -63.263333
13      No      F    22           1            0        0          0        0            0            5 11.564167              6 -12.435833
14      No      M    28           0            0        0          0        0            0            5 14.868611              6  -9.131389
15      No      F    54           0            0        0          0        0            0            5 10.106667              6 -13.893333
16      No      F    15           0            0        0          0        0            1            3  8.790833              6 -63.209167
17      No      M    50           0            0        0          0        0            0            5  8.863056              6 -15.136944
18     Yes      F    40           1            0        0          0        0            0            5  9.482500              6 -14.517500
19      No      F    30           1            0        0          0        0            1            3 10.905000              6 -61.095000
20      No      F    46           0            0        0          0        0            0            6 10.720556              6  10.720556

Classifier with Unbalanced Data

At this point it looks like we are ready to build a classifier. We’ll partition the data into training and testing sets in 80:20 proportion.

set.seed(13)
train_index <- createDataPartition(appointments$no_show, p = 0.8)[[1]]

train <- appointments[train_index,]
test  <- appointments[-train_index,]

We will use the caret package to build the classifier. Set up some parameters for the model training process.

control = trainControl(
  method = 'cv',
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

By default the classifier will be optimised to achieve maximal accuracy. This is not an ideal metric, especially if we are primarily interested in predicting no show events (the positive outcome). So instead we will aim to achieve optimal sensitivity.

xgb <- train(no_show ~ .,
             data = train,
             method = "xgbTree",
             metric = "Sens",
             trControl = control
)

Let’s see how that performs on the test data.

confusionMatrix(predict(xgb, test), test$no_show)
Confusion Matrix and Statistics

          Reference
Prediction   Yes    No
       Yes   194   215
       No   4269 17426
                                          
               Accuracy : 0.7971          
                 95% CI : (0.7918, 0.8024)
    No Information Rate : 0.7981          
    P-Value [Acc > NIR] : 0.6412          
                                          
                  Kappa : 0.0473          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.043469        
            Specificity : 0.987812        
         Pos Pred Value : 0.474328        
         Neg Pred Value : 0.803227        
             Prevalence : 0.201909        
         Detection Rate : 0.008777        
   Detection Prevalence : 0.018503        
      Balanced Accuracy : 0.515641        
                                          
       'Positive' Class : Yes

Overall that looks pretty decent, with an accuracy of close to 80%. But if we look a little closer then we see that the classifier is good at predicting the negative class (with a specificity of nearly 99%), but not very effective at identifying the positive class (with a sensitivity of just over 4%).

Balancing the Data

Let’s investigate why the model performs so poorly.

table(train$no_show)
  Yes    No 
17856 70567

The data is strongly biased in favour of the negative class: only 20% of the records represent the situation we’re actually trying to predict.

In the confusion matrix above we see that the vast majority of predictions are being assigned the negative class. Since the negative class is far more common in the data, this prediction is correct more often than not. The model naturally ends up being better at predicting the class that is most prevalent in the data.

There are a number of ways that one can address the class imbalance. We’re going to use SMOTE() from the DMwR package. SMOTE (“Synthetic Minority Over-sampling Technique”) is an algorithm which generates balanced data by under-sampling the majority class and over-sampling the minority class. Under-sampling is easy. Over-sampling is more subtle. SMOTE uses nearest neighbour information to synthetically generate new (but representative!) data for the minority class.

library(DMwR)

train_smote <- SMOTE(no_show ~ ., train, perc.over = 100, perc.under = 200)

Unfortunately SMOTE() does not currently work with a tibble, so you need to convert the training data into a data.frame before running the above.

table(train_smote$no_show)
  Yes    No 
35712 35712

Overall there are fewer records, but the two outcome classes are now perfectly in balance.

Classifier with Balanced Data

We’ll train the same classifier but using the balanced data.

xgb_smote <- train(no_show ~ .,
                   data = train_smote,
                   method = "xgbTree",
                   metric = "Sens",
                   trControl = control
)

Let’s see how well that performs on the test data.

confusionMatrix(predict(xgb_smote, test), test$no_show)
Confusion Matrix and Statistics

          Reference
Prediction   Yes    No
       Yes  3128  6815
       No   1335 10826
                                          
               Accuracy : 0.6313          
                 95% CI : (0.6249, 0.6377)
    No Information Rate : 0.7981          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2157          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.7009          
            Specificity : 0.6137          
         Pos Pred Value : 0.3146          
         Neg Pred Value : 0.8902          
             Prevalence : 0.2019          
         Detection Rate : 0.1415          
   Detection Prevalence : 0.4498          
      Balanced Accuracy : 0.6573          
                                          
       'Positive' Class : Yes

Significantly better! The overall accuracy of the classifier has dropped, but the sensitivity has increased dramatically. The specificity has also declined, but that was inevitable: there’s always going to be a compromise between sensitivity and specificity!

Moral: unbalanced data can yield sub-optimal models, but simple rebalancing can improve model performance appreciably.

To leave a comment for the author, please follow the link and comment on their blog: R on datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)