Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Class Imbalance classification refers to a classification predictive modeling problem where the number of observations in the training dataset for each class is not balanced.

In other words, the class distribution is not equal or close and it is skewed into one particular class. So, the prediction model will be accurate for skewed classes and we want to predict another class then the existing model won’t be appropriate.

The imbalance problems may be due to biased sampling methods or may be due to some measurement errors or unavailability of the classes.

Let’s look at one of the datasets and how to handle the same in R.

Stock prediction in R

library(ROSE)
library(randomForest)
library(caret)
library(e1071)

### Getting Data

str(data)

The dataset you can access from here. Total 400 observations and 4 variables contains in the dataset.

'data.frame': 400 obs. of 4 variables:
$admit: int 0 1 1 1 0 1 1 0 1 0 …$ gre : int 380 660 800 640 520 760 560 400 540 700 …
$gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 …$ rank : int 3 3 1 4 4 2 1 2 3 2 …

Let’s convert the admit variable into factor variable for further analysis.

data$admit <- as.factor(data$admit)
summary(data)
0:273 Min. :220.0 Min. :2.260 Min. :1.000
1:127 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :580.0 Median :3.395 Median :2.000
Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :800.0 Max. :4.000 Max. :4.000

Based on summary data 273 observations pertaining to not admitted and 127 observations pertained to students admitted in the program.

Paired t test tabled value vs p value

barplot(prop.table(table(data$admit)), col = rainbow(2), ylim = c(0, 0.7), main = "Class Distribution") Based on the plot it clearly evident that 70% of the data in one class and the remaining 30% in another class. So big difference observed in the amount of data available. If we are making a model based on these a dataset accuracy predicting students not admitted will be higher compared to students who are admitted. ### Data Partition Lets partition the dataset into train dataset and test dataset based on set.seed. set.seed(123) ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3)) train <- data[ind==1,] test <- data[ind==2,] ### Predictive Model Data Let create a model based on the training dataset and look at the classification in the training dataset. table(train$admit)
0  1
188 97

You can see that 188 observations in class 0 and 97 observations in class 1.

Discriminant analysis in R

prop.table(table(train$admit)) Based on proportion table 65% in one class and 34% in another class. summary(train) admit gre gpa rank 0:188 Min. :220.0 Min. :2.260 Min. :1.000 1: 97 1st Qu.:500.0 1st Qu.:3.120 1st Qu.:2.000 Median :580.0 Median :3.400 Median :2.000 Mean :582.4 Mean :3.383 Mean :2.502 3rd Qu.:660.0 3rd Qu.:3.640 3rd Qu.:3.000 Max. :800.0 Max. :4.000 Max. :4.000 ### Predictive Model The Random Forest model is using for prediction purposes. rftrain <- randomForest(admit~., data = train) Predictive Model Evaluation with test data Let’s cross-validate based on test data. In this case, we are mentioning positive=1. confusionMatrix(predict(rftrain, test), test$admit, positive = '1')

Confusion Matrix and Statistics

Reference
Prediction  0  1
0 69 22
1 16  8
Accuracy : 0.6696
95% CI : (0.5757, 0.7544)
No Information Rate : 0.7391
P-Value [Acc > NIR] : 0.9619
Kappa : 0.0839
Mcnemar's Test P-Value : 0.4173
Sensitivity : 0.26667
Specificity : 0.81176
Pos Pred Value : 0.33333
Neg Pred Value : 0.75824
Prevalence : 0.26087
Detection Rate : 0.06957
Detection Prevalence : 0.20870
Balanced Accuracy : 0.53922
'Positive' Class : 1

Now you can see that model is around 66% and based on 95% confidence interval accuracy is lies between 57 & to 75%.

How to do data reshape in R?

Sensitivity also is around only 30%, we can clearly mention that one of the classes is dominated over another class. Suppose if you’re interested in zero class then this model is quite a good one and if you want to predict class 1 then need to improvise the model.

#### Over Sampling

over <- ovun.sample(admit~., data = train, method = "over", N = 376)$data table(over$admit)
0   1
188 188

This is based on resampling and now both the classes are equal.

summary(over)
0:188   Min.   :220.0   Min.   :2.260   Min.   :1.000
1:188   1st Qu.:520.0   1st Qu.:3.167   1st Qu.:2.000
Median :580.0   Median :3.450   Median :2.000
Mean   :589.8   Mean   :3.417   Mean   :2.441
3rd Qu.:665.0   3rd Qu.:3.650   3rd Qu.:3.000
Max.   :800.0   Max.   :4.000   Max.   :4.000

#### Random Forest Model

rfover <- randomForest(admit~., data = over)
confusionMatrix(predict(rfover, test), test$admit, positive = '1') Confusion Matrix and Statistics Rank order analysis in R Reference Prediction 0 1 0 54 14 1 31 16 Accuracy : 0.6087 95% CI : (0.5133, 0.6984) No Information Rate : 0.7391 P-Value [Acc > NIR] : 0.99922 Kappa : 0.1425 Mcnemar's Test P-Value : 0.01707 Sensitivity : 0.5333 Specificity : 0.6353 Pos Pred Value : 0.3404 Neg Pred Value : 0.7941 Prevalence : 0.2609 Detection Rate : 0.1391 Detection Prevalence : 0.4087 Balanced Accuracy : 0.5843 'Positive' Class : 1 Now the accuracy is 60% and sensitivity is increased to 50%. Suppose our interest is predicting class 1 this model is much better than the previous one. #### Under Sampling under <- ovun.sample(admit~., data=train, method = "under", N = 194)$data
table(under$admit) 0 1 97 97 Instead of using all observations will take relevant observations from class zero respected to class1. Suppose class 1 contain97 observations we need to take only 97 observations from class 0. rfunder <- randomForest(admit~., data=under) confusionMatrix(predict(rfunder, test), test$admit, positive = '1')

Confusion Matrix and Statistics

Reference
Prediction  0  1
0 48 11
1 37 19
Accuracy : 0.5826
95% CI : (0.487, 0.6739)
No Information Rate : 0.7391
P-Value [Acc > NIR] : 0.999911
Kappa : 0.1547
Mcnemar's Test P-Value : 0.000308
Sensitivity : 0.6333
Specificity : 0.5647
Pos Pred Value : 0.3393
Neg Pred Value : 0.8136
Prevalence : 0.2609
Detection Rate : 0.1652
Detection Prevalence : 0.4870
Balanced Accuracy : 0.5990
'Positive' Class : 1

Now you can see that accuracy reduced by 58% and sensitivity increased to 63%.

Under-sampling is not suggested because the number of data points less in our model and reduces the overall accuracy.

How to integrate SharePoint & R

#### Both (Over & Under)

both <- ovun.sample(admit~., data=train, method = "both",
p = 0.5,
seed = 222,
N = 285)$data table(both$admit)
0   1
134 151

This table is not exactly equal but similar to original situation class1 is higher than class 0.

confusionMatrix(predict(rfboth, test), test$admit, positive = '1') Confusion Matrix and Statistics Reference Prediction 0 1 0 40 9 1 45 21 Accuracy : 0.5304 95% CI : (0.4351, 0.6241) No Information Rate : 0.7391 P-Value [Acc > NIR] : 1 Kappa : 0.1229 Mcnemar's Test P-Value : 1.908e-06 Sensitivity : 0.7000 Specificity : 0.4706 Pos Pred Value : 0.3182 Neg Pred Value : 0.8163 Prevalence : 0.2609 Detection Rate : 0.1826 Detection Prevalence : 0.5739 Balanced Accuracy : 0.5853 'Positive' Class : 1 Model accuracy is 53% and sensitivity is 70%. Now sensitivity is increased into 70% compared to previous model. Naïve Bayes Classification in R #### ROSE Function rose <- ROSE(admit~., data = train, N = 500, seed=111)$data
table(rose$admit) 0 1 234 266 summary(rose) When we do rose function closely watch the minimum and maximum values of each variable. admit gre gpa 0:234 Min. :130.0 Min. :2.186 1:266 1st Qu.:502.9 1st Qu.:3.127 Median :587.0 Median :3.401 Mean :589.7 Mean :3.389 3rd Qu.:684.2 3rd Qu.:3.673 Max. :887.2 Max. :4.595 rank Min. :-0.6079 1st Qu.: 1.5553 Median : 2.3204 Mean : 2.3655 3rd Qu.: 3.1457 Max. : 4.9871 Recollect when we created summary based on original data gre maximum value is 800 based on the rose function it’s increased into 887. This won’t make any changes in the model but we need-aware about these changes and we can make use of this function for further analysis. LSTM Networks in R rfrose <- randomForest(admit~., data=rose) confusionMatrix(predict(rfrose, test), test$admit, positive = '1')

Confusion Matrix and Statistics

Reference
Prediction  0  1
0 36 12
1 49 18
Accuracy : 0.4696
95% CI : (0.3759, 0.5649)
No Information Rate : 0.7391

P-Value [Acc > NIR] : 1
Kappa : 0.0168
Mcnemar's Test P-Value : 4.04e-06
Sensitivity : 0.6000
Specificity : 0.4235
Pos Pred Value : 0.2687
Neg Pred Value : 0.7500
Prevalence : 0.2609
Detection Rate : 0.1565
Detection Prevalence : 0.5826
Balanced Accuracy : 0.5118
'Positive' Class : 1

Based on this model accuracy is come down and sensitivity also reduced in this model.

But some other data set models can perform better. For getting repetitive results every time you can make use of seed function everywhere.

## Conclusions.

Sensitivity is always closer to 100 is better, this is the way we can handle class imbalance problems efficiently & smartly.

Decision Trees in R

The post Class Imbalance-Handling Imbalanced Data in R appeared first on finnstats.