Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In many cases, there is a need to split a userbase into 2 or more buckets. For example:

• UCG: Many companies that run promotional campaigns, in order to quantify and evaluate the performance of the campaigns, create a Universal Control Group (UCG) which is a random sample of the userbase and does not receive any offer or message.
• Bucketize: For testing purposes, it is common to split the userbase into buckets so that to be able to compare them in a long term.
• Samples for Machine Learning: A userbase can become too large for a machine learning model to run and for that reason, it is common to get random samples.

## The requirements

For the cases that we mentioned above, the splitting algorithm must satisfy the following two requirements:

1. There should be a mapping function so that every time we encounter an existing user to be assigned to the same group. For instance, if the UserID 152514 was initially assigned to UCG, then it will always be to UCG group.
2. There should be a mapping function so that every new user to be assigned to a group.

We can fulfil the requirements above by applying the modulo operation.

## Example of Splitting the Userbase with Modulo

Let’s see how we can split the Userbase into two buckets. Let’s say that we want the 20% of the users to be in UCG and the rest 80% to be Control. Usually the UserIDs will be hashed, according to GDPR compliance. Below we generate some random data:

library(tidyverse)
library(digest)
library(Rmpfr)
set.seed(5)

df<-tibble(Row_Number = seq(1,100000))

df<-df%>%rowwise%>%mutate(Hash_Name = digest(paste(sample(LETTERS, 10, replace = TRUE), collapse = ""),
algo="md5", serialize=F),
Event_Date = lubridate::as_datetime( runif(1, 1546290000, 1577739600)))

Output:

# A tibble: 6 x 3
# Rowwise:
Row_Number Hash_Name                        Event_Date

2          2 9a449c58ac6baed3b3648f0f3b5f8084 2019-03-27 21:38:34
3          3 e28e89ab554739a982c862cccf024464 2019-12-02 15:43:48
4          4 45b9aea890d3b98419cae72bb497e94b 2019-10-18 18:58:23
5          5 c4ce7434621d08f5195fbd1bfc1c20c2 2019-08-09 06:14:45
6          6 0b8a304be1015cacfcf31dd40ef6a381 2019-04-10 08:07:28

In order to generate random numbers, it is better to choose prime number for the modulo operation. For this example we will take the 997 which is a prime number. The other thing that we need to do, is to convert the MD5 Hashed to numeric. We can do it with the Rmpfr library in R. To sum up:

• We will convert the MD5 to numeric
• We will divide the above number by 997 and we will keep store the remainder
df$Remainder <- as.numeric(mpfr(df$Hash_Name, base=16) %% 997)

### Is it Random

This approach generates pseudo-random numbers. Let’s see if the distribution of the numbers (from 0 to 996) is random.

hist(df$Remainder) We can apply a Chi-Square test too. chisq.test(table(df$Remainder))

Output:

Chi-squared test for given probabilities

Output:

Control     UCG
0.80002 0.19998

## Conclusion

We can use the modulo function to split a userbase in a reproducible and efficient way.