Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

Last week, I joined a team to work on a collaborative project. The team was already established for a few months, with several scientists working together on the project. For simplicity, they used to sign documents, mention colleagues in emails, etc. with their initials (the first letter of their first name followed by the first letter of their last name).

A couple of days after joining the project, when I needed to sign my first document with my initials, we realized that another person in the team had the exact same initials than me.

This was not really an issue, as we decided that I would write my initials backward, that is, “SA” instead of “AS”, and the other person would keep signing with “AS” as usual.

It could have stopped here. However, the idea to write a post about this rather trivial anecdote came to me when the team leader claimed, in the middle of a meeting: “That’s very unfortunate that you two have the same initials! What are the chances of this happening to us?!”.

We spent a couple of minutes trying to estimate this probability, which in the end were mostly based on our intuitions rather than on a formal calculation. This piqued my curiosity.

This problem could be answered using probability theory, but given that the project requires the use of simulations, I thought that it would be nice to answer this question via simulations in R.

Furthermore, I thought that it would be a nice way to illustrate methods not often presented in my posts: for loops, replications and writing functions in R.

# How likely is it?

Before computing this probability, there are two things to note:

1. Although the team leader was curious to know the probability that two persons have the same initials, we are actually more interested in the probability that at least two persons have the same initials (as the problem also occurs if more than two persons within a team have the same initials).
2. The team consists of 8 people.

In this post, we will show how to compute this probability:

• in our context, that is, for a team of 8 persons, and
• for completeness, for teams of all sizes from 2 to 100 persons.

## For our team

We start by creating a vector of size 8, corresponding to the initials of a team of 8 persons randomly sampled among all 26 letters of the Latin alphabet:

# number of persons
n_persons <- 8

# create vector of initials
initials <- replicate(
n = n_persons, # number of replications
paste0(sample(LETTERS, size = 1), sample(LETTERS, size = 1)) # sample letters
)

# display initials
initials
## [1] "UJ" "MN" "XD" "CY" "BB" "ZB" "CU" "HQ"
# are there duplicates?
any(duplicated(initials))
## [1] FALSE

As we can see, everyone has different initials in this simulated team of 8 persons, but this will not always be the case.

To estimate, via simulations, how likely is that at least two persons have the same initials among the team, we need to replicate this vector of 8 sampled initials a large number of times (say 1,000 replications):1

# number of replications
reps <- 1000

# create and save replications
dat <- replicate(
n = reps, # number of replications
replicate(n_persons, paste0(sample(LETTERS, size = 1), sample(LETTERS, size = 1)))
)

# dimensions
dim(dat)
## [1]    8 1000
# display first 4 simulated teams
dat[, 1:4]
##      [,1] [,2] [,3] [,4]
## [1,] "VA" "BU" "LU" "PT"
## [2,] "JG" "SM" "HM" "OL"
## [3,] "BY" "NA" "VJ" "OT"
## [4,] "RT" "CM" "WT" "YT"
## [5,] "PS" "CT" "NB" "QJ"
## [6,] "MG" "KR" "SV" "US"
## [7,] "PL" "SN" "PN" "XW"
## [8,] "NJ" "BR" "DD" "ZC"

The result is a matrix of 8 rows and 1000 columns, where:

• each rows corresponds to the sampled initials of a person, and
• each column corresponds to one simulated team of 8 people.

• the row names as M1 to M8, corresponding to persons 1 to 8, and
• the column names as T1 to T1000, corresponding to teams 1 to 1000.
# rename rows
rownames(dat) <- paste0("M", 1:n_persons)

# rename columns
colnames(dat) <- paste0("T", 1:reps)

# display first 4 simulated teams
dat[, 1:4]
##    T1   T2   T3   T4
## M1 "VA" "BU" "LU" "PT"
## M2 "JG" "SM" "HM" "OL"
## M3 "BY" "NA" "VJ" "OT"
## M4 "RT" "CM" "WT" "YT"
## M5 "PS" "CT" "NB" "QJ"
## M6 "MG" "KR" "SV" "US"
## M7 "PL" "SN" "PN" "XW"
## M8 "NJ" "BR" "DD" "ZC"

We now need to compute, among the 1000 teams simulated, how many of them have at least two persons with the same initials:

# transform to data frame
dat <- as.data.frame(dat)

# save which teams have duplicates
duplicates <- rep(NA, reps) # create empty vector
for (i in 1:reps) { # for loop over i from 1 to 1,000
duplicates[i] <- any(duplicated(dat[, i])) # save results TRUE/FALSE in duplicates vector
}

# count how many teams have duplicates
sum(duplicates)
## [1] 41

Here, for each column of our data frame dat (from the first to the 1000th column), we ask whether there are duplicates or not. This is done repeatedly over all columns thanks to a for loop. For each column, the result is TRUE if there are duplicates, otherwise it is FALSE. The result of each iteration is saved in the duplicates vector. As TRUE = 1 and FALSE = 0 in R, we can then count how many columns (and thus teams) have duplicates by summing the number of TRUE in the duplicates vector.

As we can see from the output above, among the 1000 simulated teams, 41 of them have duplicates, that is, 41 of them have at least two persons with the same initials.

Therefore, based on the simulations, we can expect the probability that at least two persons with the same initials in a team of 8 persons to be close to 4.1%.

This is a good starting point. Notice, however, that I wrote close to 4.1% because this probability will vary each time it is computed via simulations.

For instance, if we repeat the exact same process a second time:

# create and save replications
dat <- replicate(
n = reps, # number of replications
replicate(n_persons, paste0(sample(LETTERS, size = 1), sample(LETTERS, size = 1)))
)

# transform to data frame
dat <- as.data.frame(dat)

# save which teams have duplicates
duplicates <- rep(NA, reps) # create empty vector
for (i in 1:reps) { # for loop over i from 1 to 1,000
duplicates[i] <- any(duplicated(dat[, i])) # save results in the duplicates vector (as TRUE/FALSE)
}

# count how many teams have duplicates
sum(duplicates)
## [1] 44

We now find a probability of 4.4%. This is not an error, but it is due to randomness when sampling initials.

Luckily, we can make the computation of this probability more robust thanks to replications. Intuitively, it works as follows. We repeat the same computation multiple times, giving us a range of possible probabilities. This allows us to assess the uncertainty of our result, and understand how the probability might vary due to taking different random samples of initials.

So the goal is to compute our probability multiple times (say 100 times), and see its distribution.

To repeat the same computation multiple times, it is best to write a function in order to avoid copy pasting the same code over and over. So we first write a function (called initials) which computes the probability that at least two persons share the same initials among a team of $$n$$ people:

initials <- function(n_persons, reps = 1000) {
# simulate data
dat <- as.data.frame(replicate(
reps,
replicate(n_persons, paste0(sample(LETTERS, size = 1), sample(LETTERS, size = 1)))
))

# save which teams have duplicates
duplicates <- rep(NA, reps)
for (i in 1:reps) {
duplicates[i] <- any(duplicated(dat[, i]))
}

# proportion of teams with duplicates
return(mean(duplicates))
}

A function in R requires to include:

• the parameters inside (), and
• the computation inside {}.

We can then use our function to compute the probability that at least two persons share the same initials among a team of 8 people. And we combine it with the replicate() function to compute this probability 100 times.

# compute and save probabilities
probs <- replicate(100, initials(n_persons = 8))

# display probabilities
probs
##   [1] 0.032 0.037 0.040 0.043 0.033 0.042 0.039 0.047 0.045 0.038 0.052 0.042
##  [13] 0.042 0.040 0.023 0.044 0.041 0.039 0.036 0.048 0.041 0.037 0.027 0.030
##  [25] 0.052 0.038 0.043 0.035 0.038 0.045 0.047 0.044 0.030 0.036 0.036 0.048
##  [37] 0.038 0.045 0.044 0.034 0.031 0.043 0.045 0.034 0.049 0.047 0.051 0.036
##  [49] 0.051 0.040 0.043 0.044 0.038 0.049 0.043 0.050 0.035 0.043 0.048 0.038
##  [61] 0.041 0.044 0.039 0.045 0.033 0.057 0.036 0.043 0.041 0.041 0.041 0.041
##  [73] 0.038 0.044 0.031 0.034 0.049 0.041 0.040 0.034 0.032 0.036 0.049 0.047
##  [85] 0.048 0.038 0.038 0.037 0.036 0.037 0.043 0.040 0.026 0.049 0.046 0.044
##  [97] 0.048 0.038 0.026 0.029

Finally, we visualize the distribution of these 100 probabilities thanks to a histogram and a boxplot (with the {ggplot2} package):

# visualize distribution of the computed probabilities
# build and save plots
library(ggplot2)

p1 <- ggplot(mapping = aes(x = probs)) +
geom_histogram(color = "black", fill = "steelblue", bins = 8) +
labs(
x = "Probabilities",
y = "Frequencies"
) +
scale_x_continuous(labels = scales::percent) # format x-axis in %

p2 <- ggplot(mapping = aes(x = probs)) +
geom_boxplot(color = "black", fill = "steelblue") +
labs(x = "Probabilities") +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
scale_x_continuous(labels = scales::percent) # format x-axis in %

# combine plots
library(patchwork)

p1 + p2

These two plots show that the probability that at least two persons share the same initials among a team of 8 people is most likely between 3.5% and 4.5%.

For the record, during the meeting at the root of all this thinking, most of us thought that it was much less likely. Indeed, I believe we were tempted to compute the probability that someone who joins the team has “AS” as initials. This is indeed much less likely, as the probability is only $$\frac{1}{26} \times \frac{1}{26} \simeq 0.15\%$$.

However, this does not take into account the fact:

• that the newcomer can have the same initials as any other person, and
• that it is not only the newcomer who can have the same initials as another person (2 people already working in the team when the newcomer arrives could have the same initials as well).

If you are puzzled by this finding, I recommend reading about the birthday’s paradox. The birthday’s paradox states that the probability of two people sharing the same birthday becomes surprisingly high with a relatively small group of individuals. In practice, in a group of just 23 people, there is a greater than 50% chance that at least two individuals share the same birthday, illustrating our counterintuitive intuitions about the likelihood of such coincidences. This phenomenon arises due to the multitude of possible birthday pairs within the group, similar to the multitude of possible pairs if initials within a team.

## For teams of different sizes

We are now interested in computing this probability not just for a team of 8 persons, but for teams of different sizes. We can do this with the help of our function defined earlier.

For the illustration, let’s compute the probability that at least two persons have the same initials, for teams of 2 and up to 100 persons:

# set lower and upper bounds of number of persons
min_persons <- 2
max_persons <- 100

# create empty vector of probabilities
probs <- rep(NA, length(min_persons:max_persons))

# compute and save probabilities for teams of size 2 to 100
for (i in min_persons:max_persons) {
probs[i] <- initials(n_persons = i)
}

# display probabilities
probs
##   [1]    NA 0.001 0.005 0.013 0.012 0.019 0.036 0.040 0.047 0.057 0.074 0.083
##  [13] 0.103 0.128 0.158 0.166 0.178 0.215 0.232 0.260 0.275 0.296 0.300 0.329
##  [25] 0.357 0.392 0.405 0.405 0.439 0.478 0.495 0.536 0.535 0.563 0.578 0.599
##  [37] 0.653 0.656 0.686 0.693 0.715 0.711 0.767 0.760 0.786 0.784 0.814 0.817
##  [49] 0.825 0.826 0.842 0.845 0.867 0.893 0.901 0.920 0.919 0.911 0.917 0.942
##  [61] 0.950 0.951 0.946 0.947 0.969 0.959 0.965 0.964 0.977 0.984 0.977 0.977
##  [73] 0.985 0.978 0.986 0.981 0.989 0.991 0.989 0.988 0.992 0.993 0.994 0.996
##  [85] 0.997 0.995 0.994 0.999 0.999 0.999 1.000 0.999 0.997 1.000 0.999 0.999
##  [97] 0.999 1.000 1.000 0.999

We are left with storing these probabilities together with the number of persons in the team in a data frame:

# create data frame with saved probabilities and number of persons
dat_plot <- data.frame(
n_persons = (min_persons - 1):max_persons,
prob = probs
)

# display first 6 rows
##   n_persons  prob
## 1         1    NA
## 2         2 0.001
## 3         3 0.005
## 4         4 0.013
## 5         5 0.012
## 6         6 0.019

Of course, two people having the same initials in a team of 1 (if we can call this a team…) is impossible, so the probability is 0. We impute this probability in our data frame, in the first row:

# set proba = 1 when n_person = 1
dat_plot[1, 2] <- 0

# display first 6 rows
##   n_persons  prob
## 1         1 0.000
## 2         2 0.001
## 3         3 0.005
## 4         4 0.013
## 5         5 0.012
## 6         6 0.019

Finally, we visualize these probabilities in function of the number of persons in the team:

# visualize probabilities
ggplot(dat_plot) +
aes(x = n_persons, y = probs) +
geom_line(linewidth = 1) +
labs(
x = "# of persons in the team",
y = "Probabilites of >= 2 persons with same initials"
) +
scale_y_continuous(labels = scales::percent) # format y-axis in %

From the plot above, we see that the probability that at least two persons have the same initials reaches 50% when the team exceeds around 30 people.

Moreover, notice that this probability becomes close to 100% when the team reaches around 75 people.

# Conclusion

The initial question, raised during a meeting, was “What is the probability that, among our team consisting of 8 persons, two have the same initials?”.

In this post, we showed how to compute this probability through simulations in R. Furthermore, we illustrated how for loops, replications and writing a function can be used in R to answer other probability problems.

As a side note, it is important to keep in mind that in this post, we assumed the following:

• All letters of the alphabet had the same probability of occurring (which is probably not the case in reality). This bias could be limited by specifying different weights when sampling initials.
• We restricted ourselves to initials of two letters. Therefore, for compound names, only the first letter is considered.

Last but not least, note that you will find slightly different results than mine, even if you use the exact same code. This is due to randomness. To replicate results as shown in this post, use set.seed(6).

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.

1. You can always use a larger number of replications, but in our case the final result is similar with more replications, and the aim of the post is more to show the development than the final answer.↩︎