# How to Split Randomly a Userbase using Modulo

**R – Predictive Hacks**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In many cases, there is a need to split a userbase into 2 or more buckets. For example:

**UCG:**Many companies that run promotional campaigns, in order to quantify and evaluate the performance of the campaigns, create a Universal Control Group (UCG) which is a random sample of the userbase and does not receive any offer or message.**Bucketize**: For testing purposes, it is common to split the userbase into buckets so that to be able to compare them in a long term.**Samples for Machine Learning**: A userbase can become too large for a machine learning model to run and for that reason, it is common to get random samples.

## The requirements

For the cases that we mentioned above, the splitting algorithm must satisfy the following two requirements:

- There should be a mapping function so that every time we encounter an
**existing user**to be assigned to the same group. For instance, if the**UserID**`152514`

was initially assigned to UCG, then it will always be to UCG group. - There should be a mapping function so that every
**new user**to be assigned to a group.

We can fulfil the requirements above by applying the modulo operation.

## Example of Splitting the Userbase with Modulo

Let’s see how we can split the Userbase into two buckets. Let’s say that we want the **20%** of the users to be in **UCG **and the rest **80%** to be **Control**. Usually the UserIDs will be hashed, according to GDPR compliance. Below we generate some random data:

library(tidyverse) library(digest) library(Rmpfr) set.seed(5) df<-tibble(Row_Number = seq(1,100000)) df<-df%>%rowwise%>%mutate(Hash_Name = digest(paste(sample(LETTERS, 10, replace = TRUE), collapse = ""), algo="md5", serialize=F), Event_Date = lubridate::as_datetime( runif(1, 1546290000, 1577739600))) head(df)

**Output:**

# A tibble: 6 x 3 # Rowwise: Row_Number Hash_Name Event_Date <int> <chr> <dttm> 1 1 275db34231203750f10adb24c76b9619 2019-06-10 06:15:33 2 2 9a449c58ac6baed3b3648f0f3b5f8084 2019-03-27 21:38:34 3 3 e28e89ab554739a982c862cccf024464 2019-12-02 15:43:48 4 4 45b9aea890d3b98419cae72bb497e94b 2019-10-18 18:58:23 5 5 c4ce7434621d08f5195fbd1bfc1c20c2 2019-08-09 06:14:45 6 6 0b8a304be1015cacfcf31dd40ef6a381 2019-04-10 08:07:28

In order to generate random numbers, it is better to choose prime number for the modulo operation. For this example we will take the **997** which is a prime number. The other thing that we need to do, is to convert the MD5 Hashed to numeric. We can do it with the `Rmpfr`

library in R. To sum up:

- We will convert the MD5 to numeric
- We will divide the above number by
**997**and we will keep store the remainder

df$Remainder <- as.numeric(mpfr(df$Hash_Name, base=16) %% 997)

**Is it Random**

This approach generates pseudo-random numbers. Let’s see if the distribution of the numbers (from 0 to 996) is random.

hist(df$Remainder)

We can apply a Chi-Square test too.

chisq.test(table(df$Remainder))

**Output:**

Chi-squared test for given probabilities data: table(df$Remainder) X-squared = 995.2, df = 996, p-value = 0.5012

The** P-value is 0.5012** which implies that the generated numbers can be considered random.

Now, we can split our UB into **UCG **and **Control** as follows:

**If the remainder is less than 200 then UCG else Control**

df$Group <- ifelse(df$Remainder<200, 'UCG', 'Control') df

**Check the Proportions**

Finally, we want to make sure that the proportion is 80% vs 20% for Control and UCG respectively.

prop.table(table(df$Group))

**Output:**

Control UCG 0.80002 0.19998

## Conclusion

We can use the modulo function to split a userbase in a reproducible and efficient way.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Predictive Hacks**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.