Class Imbalance-Handling Imbalanced Data in R

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Class Imbalance classification refers to a classification predictive modeling problem where the number of observations in the training dataset for each class is not balanced.

In other words, the class distribution is not equal or close and it is skewed into one particular class. So, the prediction model will be accurate for skewed classes and we want to predict another class then the existing model won’t be appropriate.

The imbalance problems may be due to biased sampling methods or may be due to some measurement errors or unavailability of the classes.

Let’s look at one of the datasets and how to handle the same in R.

Stock prediction in R

Load Library

library(ROSE)
library(randomForest)
library(caret)
library(e1071)

Getting Data

data <- read.csv("D:/RStudio/ClassImBalance/binary.csv", header = TRUE) 
str(data) 

The dataset you can access from here. Total 400 observations and 4 variables contains in the dataset.

'data.frame': 400 obs. of 4 variables:
$ admit: int 0 1 1 1 0 1 1 0 1 0 …
$ gre : int 380 660 800 640 520 760 560 400 540 700 …
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 …
$ rank : int 3 3 1 4 4 2 1 2 3 2 …

Let’s convert the admit variable into factor variable for further analysis.

data$admit <- as.factor(data$admit) 
summary(data)
admit gre gpa rank
0:273 Min. :220.0 Min. :2.260 Min. :1.000
1:127 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :580.0 Median :3.395 Median :2.000
Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :800.0 Max. :4.000 Max. :4.000

Based on summary data 273 observations pertaining to not admitted and 127 observations pertained to students admitted in the program.

Paired t test tabled value vs p value

Class Imbalance

barplot(prop.table(table(data$admit)),
        col = rainbow(2),
        ylim = c(0, 0.7),
        main = "Class Distribution")

Based on the plot it clearly evident that 70% of the data in one class and the remaining 30% in another class.

So big difference observed in the amount of data available. If we are making a model based on these a dataset accuracy predicting students not admitted will be higher compared to students who are admitted.

Data Partition

Lets partition the dataset into train dataset and test dataset based on set.seed.

set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]

Predictive Model Data

Let create a model based on the training dataset and look at the classification in the training dataset.

table(train$admit)
 0  1
188 97

You can see that 188 observations in class 0 and 97 observations in class 1.

Discriminant analysis in R

prop.table(table(train$admit))

Based on proportion table 65% in one class and 34% in another class.

summary(train)
admit        gre             gpa             rank      
  0:188   Min.   :220.0   Min.   :2.260   Min.   :1.000  
  1: 97   1st Qu.:500.0   1st Qu.:3.120   1st Qu.:2.000  
          Median :580.0   Median :3.400   Median :2.000  
          Mean   :582.4   Mean   :3.383   Mean   :2.502  
          3rd Qu.:660.0   3rd Qu.:3.640   3rd Qu.:3.000  
          Max.   :800.0   Max.   :4.000   Max.   :4.000

Predictive Model

The Random Forest model is using for prediction purposes.

rftrain <- randomForest(admit~., data = train)

Predictive Model Evaluation with test data

Let’s cross-validate based on test data. In this case, we are mentioning positive=1.

confusionMatrix(predict(rftrain, test), test$admit, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 69 22
         1 16  8
               Accuracy : 0.6696         
                 95% CI : (0.5757, 0.7544)
    No Information Rate : 0.7391         
    P-Value [Acc > NIR] : 0.9619         
                  Kappa : 0.0839         
Mcnemar's Test P-Value : 0.4173         
            Sensitivity : 0.26667        
            Specificity : 0.81176        
         Pos Pred Value : 0.33333        
         Neg Pred Value : 0.75824        
            Prevalence : 0.26087        
         Detection Rate : 0.06957        
   Detection Prevalence : 0.20870        
      Balanced Accuracy : 0.53922        
       'Positive' Class : 1  

Now you can see that model is around 66% and based on 95% confidence interval accuracy is lies between 57 & to 75%.

How to do data reshape in R?

Sensitivity also is around only 30%, we can clearly mention that one of the classes is dominated over another class. Suppose if you’re interested in zero class then this model is quite a good one and if you want to predict class 1 then need to improvise the model.

Over Sampling

over <- ovun.sample(admit~., data = train, method = "over", N = 376)$data
table(over$admit)
0   1
188 188

This is based on resampling and now both the classes are equal.

summary(over)
admit        gre             gpa             rank     
 0:188   Min.   :220.0   Min.   :2.260   Min.   :1.000 
 1:188   1st Qu.:520.0   1st Qu.:3.167   1st Qu.:2.000 
         Median :580.0   Median :3.450   Median :2.000 
         Mean   :589.8   Mean   :3.417   Mean   :2.441 
         3rd Qu.:665.0   3rd Qu.:3.650   3rd Qu.:3.000 
         Max.   :800.0   Max.   :4.000   Max.   :4.000

Random Forest Model

rfover <- randomForest(admit~., data = over)
confusionMatrix(predict(rfover, test), test$admit, positive = '1')

Confusion Matrix and Statistics

Rank order analysis in R

          Reference
Prediction  0  1
         0 54 14
         1 31 16
               Accuracy : 0.6087         
                 95% CI : (0.5133, 0.6984)
    No Information Rate : 0.7391         
    P-Value [Acc > NIR] : 0.99922        
                  Kappa : 0.1425         
 Mcnemar's Test P-Value : 0.01707        
            Sensitivity : 0.5333         
            Specificity : 0.6353         
         Pos Pred Value : 0.3404         
         Neg Pred Value : 0.7941         
             Prevalence : 0.2609         
         Detection Rate : 0.1391         
   Detection Prevalence : 0.4087         
      Balanced Accuracy : 0.5843                
       'Positive' Class : 1 

Now the accuracy is 60% and sensitivity is increased to 50%. Suppose our interest is predicting class 1 this model is much better than the previous one.

Under Sampling

under <- ovun.sample(admit~., data=train, method = "under", N = 194)$data
table(under$admit)
 0  1
97 97

Instead of using all observations will take relevant observations from class zero respected to class1.

Suppose class 1 contain97 observations we need to take only 97 observations from class 0.

rfunder <- randomForest(admit~., data=under)
confusionMatrix(predict(rfunder, test), test$admit, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 48 11
         1 37 19
               Accuracy : 0.5826        
                 95% CI : (0.487, 0.6739)
    No Information Rate : 0.7391        
    P-Value [Acc > NIR] : 0.999911      
                  Kappa : 0.1547        
 Mcnemar's Test P-Value : 0.000308      
            Sensitivity : 0.6333        
            Specificity : 0.5647        
         Pos Pred Value : 0.3393        
         Neg Pred Value : 0.8136        
             Prevalence : 0.2609        
         Detection Rate : 0.1652        
   Detection Prevalence : 0.4870        
      Balanced Accuracy : 0.5990        
       'Positive' Class : 1

Now you can see that accuracy reduced by 58% and sensitivity increased to 63%.

Under-sampling is not suggested because the number of data points less in our model and reduces the overall accuracy.

How to integrate SharePoint & R

Both (Over & Under)

both <- ovun.sample(admit~., data=train, method = "both",
                    p = 0.5,
                    seed = 222,
                    N = 285)$data
table(both$admit)
0   1
134 151

This table is not exactly equal but similar to original situation class1 is higher than class 0.

rfboth <-randomForest(admit~., data=both)
confusionMatrix(predict(rfboth, test), test$admit, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 40  9
         1 45 21
               Accuracy : 0.5304         
                 95% CI : (0.4351, 0.6241)
    No Information Rate : 0.7391         
    P-Value [Acc > NIR] : 1               
                  Kappa : 0.1229         
 Mcnemar's Test P-Value : 1.908e-06      
            Sensitivity : 0.7000         
            Specificity : 0.4706         
         Pos Pred Value : 0.3182         
         Neg Pred Value : 0.8163         
             Prevalence : 0.2609         
         Detection Rate : 0.1826         
   Detection Prevalence : 0.5739         
      Balanced Accuracy : 0.5853         
       'Positive' Class : 1  

Model accuracy is 53% and sensitivity is 70%.

Now sensitivity is increased into 70% compared to previous model.

Naïve Bayes Classification in R

ROSE Function

rose <- ROSE(admit~., data = train, N = 500, seed=111)$data
table(rose$admit)
0   1
234 266
summary(rose)

When we do rose function closely watch the minimum and maximum values of each variable.

admit        gre             gpa   
  
 0:234   Min.   :130.0   Min.   :2.186 
 1:266   1st Qu.:502.9   1st Qu.:3.127 
         Median :587.0   Median :3.401 
         Mean   :589.7   Mean   :3.389 
         3rd Qu.:684.2   3rd Qu.:3.673 
         Max.   :887.2   Max.   :4.595 
      rank       
 Min.   :-0.6079 
 1st Qu.: 1.5553 
 Median : 2.3204 
 Mean   : 2.3655 
 3rd Qu.: 3.1457 
 Max.   : 4.9871 

Recollect when we created summary based on original data gre maximum value is 800 based on the rose function it’s increased into 887.

This won’t make any changes in the model but we need-aware about these changes and we can make use of this function for further analysis.

LSTM Networks in R

rfrose <- randomForest(admit~., data=rose)
confusionMatrix(predict(rfrose, test), test$admit, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 36 12
         1 49 18
               Accuracy : 0.4696         
                 95% CI : (0.3759, 0.5649)
    No Information Rate : 0.7391         

   P-Value [Acc > NIR] : 1              
                  Kappa : 0.0168         
 Mcnemar's Test P-Value : 4.04e-06       
            Sensitivity : 0.6000         
            Specificity : 0.4235         
         Pos Pred Value : 0.2687         
         Neg Pred Value : 0.7500         
             Prevalence : 0.2609         
         Detection Rate : 0.1565         
   Detection Prevalence : 0.5826         
      Balanced Accuracy : 0.5118         
       'Positive' Class : 1

Based on this model accuracy is come down and sensitivity also reduced in this model.

But some other data set models can perform better. For getting repetitive results every time you can make use of seed function everywhere.

Conclusions.

Sensitivity is always closer to 100 is better, this is the way we can handle class imbalance problems efficiently & smartly.

Decision Trees in R

The post Class Imbalance-Handling Imbalanced Data in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Methods – finnstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)