Assign n Email Addresses to x Cells, Intrinsically (Part II)

March 27, 2014
By

(This article was first published on You Know, and kindly contributed to R-bloggers)

Part I showed the concept and general technique of a method of assigning n email addresses to x cells pseudo-randomly, without the need for maintaining a log of each assignment.

The earlier post considered the basic case of each cell being assigned approximately the same quantity of email addresses. In practice, cell sizes often vary. Below is a technique that works well when the total number of email addresses needed is less than the product of the cell sizes' greatest common divisor and the average email address length. For example, cell sizes are 500, 500, & 1,000; so 2,000 < 500*25ish.

Assign n Email Addresses to x Cells, Intrinsically; Part 2 (variable Cell Sizes)

Assign n Email Addresses to x Cells, Intrinsically; Part 2 (Variable Cell Sizes)

Sample Use Case:
Marketing requests that an email address list be divided randomly into a given number of cells so that each cell would receive a different version of copy.
Below is a technique that takes n email addresses and pseudo-randomly assigns each to one of x cells. The advantage of this method is that the user does not need to maintain a log of each email address's assigned cell since the cell assignment can be reproduced at any time.
This technique is extended from Part 1 to accommodate cells of varying sizes.
First, load in a randomly generated list of email addresses.
set.seed(4444)
library(numbers)

fict.email <- function(n = 5) {
fict.emails <- data.frame(email = NA)
for (i in 1:n) {
fict.emails[i, "email"] <- paste0(paste(sample(letters, sample(3:25,
1, TRUE), TRUE), collapse = ""), "@", paste(sample(letters, sample(3:15,
1, TRUE), TRUE), collapse = ""), ".", paste(sample(letters, sample(2:3,
1, TRUE), TRUE), collapse = ""))
}
fict.emails
}
emails <- sample(fict.email(10000))
Next, assign the cell sizes.
cell.sizes <- c(500, 500, 1500, 2000)
Get the number of characters of each email address; this is important because this will remain constant for each entry. Next, find the greatest common divisor of the cell sizes. Use the modulo function to calculate the remainders.
cells <- length(cell.sizes)
cell.gcd <- mGCD(cell.sizes)
em.len <- sapply(emails, nchar)
em.mod <- em.len%%(sum(cell.sizes)/cell.gcd)
Combine mod values into cell numbers.
ranges <- data.frame(start = 0, end = 0)
for (j in 1:cells) {
ranges[j, "start"] <- (sum(cell.sizes[1:j]) - cell.sizes[j])/cell.gcd +
1
ranges[j, "end"] <- sum(cell.sizes[1:j])/cell.gcd
}

for (k in 1:cells) {
emails$cell[em.mod >= ranges$start[k] & em.mod <= ranges$end[k]] <- k
}
Split the data frame into the required cell sizes. These lists are the final output.
email.lists <- split(emails, emails$cell)
for (l in 1:cells) {
email.lists[[l]] <- email.lists[[l]][[1]][1:cell.sizes[l]]
}
Now each email address has been assigned to a specific cell.
Each email address will always belong to the current cell because the number of characters it has will not change.

To leave a comment for the author, please follow the link and comment on his blog: You Know.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.