Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When we are dealing with unbalanced classes in Machine Learning projects there are many approaches that you can follow. Just to main some of them:

• Undersampling: We try to reduce the observations from the majority class so that the final dataset to be balanced
• Oversampling: We try to generate more observations from the minority class usually by replicating the samples from the minority class so that the final dataset to be balanced.
• Synthetic Data Generation (SMOTE): We generate artificial data using bootstrapping and k-Nearest Neighbors algorithms.

## Generate the Unbalanced Data

The scenario is that we are dealing with 3 email campaigns that have different CTRs and we want to apply undersampling to normalize the CTR by the campaign so that to avoid any skewness and biased when we will build the Machine Learning model. The hypothetical dataset is the following:

• Campaign A: 5000 Observations with 10% CTR (approx)
• Campaign B: 10000 Observations with 20% CTR (approx)
• Campaign C: 1000 Observations with 30% CTR (approx)

Let’s try to generate this random sample in R.

library(tidyverse)

set.seed(5)
df = rbind(data.frame(Campaign = "A", Click = rbinom(n=5000, size=1, prob=0.1)),
data.frame(Campaign = "B", Click = rbinom(n=10000, size=1, prob=0.2)),
data.frame(Campaign = "C", Click = rbinom(n=1000, size=1, prob=0.3)))



Output:

  Campaign Click
1        A     0
2        A     0
3        A     1
4        A     0
5        A     0
6        A     0

Let’s get the CTR by Campaign

df%>%group_by(Campaign)%>%
summarise(CTR=mean(Click))



Output:

# A tibble: 3 x 2
Campaign   CTR
<chr>    <dbl>
1 A        0.106
2 B        0.198
3 C        0.302

As we can see the A campaign has 10.6% CTR, the B 19.8% and the C 30.2%. Let’s add also a random column called attribute which takes the values “X”, “Y”, “Z” since we will deal with datasets with more than two columns.

df$Attribute<-sample(c("X","Y", "Z"), size = dim(df), replace = TRUE, prob = c(0.2, 0.6, 0.2)) head(df) Campaign Click Attribute 1 A 0 Y 2 A 0 Y 3 A 1 Z 4 A 0 Y 5 A 0 Z 6 A 0 Z Now, our goal is to apply undersampling so that each campaign will have around 50% CTR ## Undersampling by Group We will use the map2 function from the purrr package which belongs to the tidyverse family: campaign_summary <- df %>% group_by(Campaign)%>% summarize(rr=sum(Click)/n(), pos= sum(Click)) df_neg_sample<- df %>% filter(Click==0) %>% group_by(Campaign) %>% nest() %>% #group all data by campaign name ungroup() %>% inner_join(campaign_summary, by="Campaign") sampled_df_neg<-df_neg_sample %>% mutate(samp = map2(data, pos, sample_n, replace = FALSE)) %>%# sample based on the campaing summary select(-data) %>% #remove original nested data unnest(samp) %>% select(c(-"rr",-"pos")) df_pos <- df %>% filter(Click==1) #positive samples new_df <- rbind(df_pos,sampled_df_neg) #balanced set positive negative within each campaign head(new_df) Campaign Click Attribute 1 A 1 Z 2 A 1 Y 3 A 1 Y 4 A 1 Y 5 A 1 Y 6 A 1 Y tail(new_df) Campaign Click Attribute 5613 C 0 Y 5614 C 0 Y 5615 C 0 Z 5616 C 0 Y 5617 C 0 Y 5618 C 0 X Let’s check if the new_df is balanced by campaign. We will group by campaign and we will show the CTR and the number of observations: new_df%>%group_by(Campaign)%>%summarise(CTR=mean(Click), Observations=n()) Campaign CTR Observations <chr> <dbl> <int> 1 A 0.5 1064 2 B 0.5 3950 3 C 0.5 604 As we can see, we sacrificed a sample but we have a balanced number of classes for every campaign (50-50). ## Undersampling by Group using the ROSE Package We can use also the ROSE package. Below we will apply a for loop by campaign so that to get a balanced sample using the undersampling technique. library(ROSE) balanced_sample = NULL for (c in unique(df$Campaign)) {
tmp_df = df%>%filter(Campaign==c)
tmp<-ovun.sample(Click ~ ., data = tmp_df, method = "under", p = 0.5, seed = 5)\$data
balanced_sample<-rbind(balanced_sample, tmp)
}


Let’s check if the balanced_sample is actually balanced.

balanced_sample%>%group_by(Campaign)%>%summarise(CTR=mean(Click), Observations=n())



Output:

  Campaign   CTR Observations
<chr>    <dbl>        <int>
1 A        0.504         1056
2 B        0.496         3978
3 C        0.510          592

Awesome. We showed two different approaches of how you can apply undersampling by group.