Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is the “opposite” of sampling without replacement? In a classical urn model sampling without replacement means that you don’t replace the ball that you have drawn. Therefore the probability of drawing that colour becomes smaller. How about the opposite, i.e. that the probability becomes bigger? Then you have a so-called Pólya urn model!

Many real-world processes have this self-reinforcing property, e.g. leading to the distribution of wealth or the number of followers on social media. If you want to learn how to simulate such a process with R and encounter some surprising results, read on!

The basic idea of the Pólya urn model is to return two balls of the same colour to the urn which you have just drawn. Technically this process results asymptotically in a Dirichlet distribution. While sampling with and without replacement already exists in R (the sample() function and its argument replace = TRUE or FALSE) it is a nice little coding exercise to implement this model.

The following function takes as arguments the colours of the balls at the start cols and the number of draws n (technically n-1 draws because the initial filling of the urn is defined as the first step):

# needs R 4.1.0 or higher to run
polya_urn <- function(cols = c("black", "blue", "green", "red", "yellow"), n = 100) {
urn <- cols |> table() |> as.matrix() |> t()
urn <- rbind(urn, matrix(NA, nrow = n-1, ncol = ncol(urn)))
cols_unique <- colnames(urn)

# n-1 draws from Dirichelet distribution
for (i in seq_len(n-1)) {
urn[i+1, ] <- urn[i, ]
col_sample <- sample(cols_unique, size = 1, prob = (urn[i, ] / sum(urn[i, ]))) # sample ball
urn[i+1, col_sample] <- urn[i+1, col_sample] + 1 # add ball with same colour
}

plot(x = NA, xlim = c(1, n), ylim = c(1, ceiling(max(urn[n, ], na.rm = TRUE) / 10)*10), main = "Polya urn model", xlab = "no. of draws", ylab = "no. of balls", type = "n")
sapply(cols_unique, \(x) lines(urn[ , x], col = x, lwd = 2))
invisible(urn)
}

polya_urn()


As expected little random differences result in one colour dominating the rest, in this case, green. Please note that results can and will differ because of the involved randomness. To get a feel for the possible, sometimes surprisingly complex, dynamics, let’s run some more simulations:

polya_urn()


Again, green but this time with a strong contender, black! Now, let us increase the number of draws to 200:

polya_urn(n = 200)


The effect is now much more pronounced, blue leaves all other colours clearly behind. Let’s do this one more time:

polya_urn(n = 200)


This one is really interesting! Black and red are neck and neck! Obviously, there is more going on than just one colour dominating the rest from start to finish.

Now, let us make the initial filling more pronounced, three balls of blue instead of one:

polya_urn(cols = c("black", "blue", "blue", "blue", "green", "red", "yellow"))


Clearly, blue is the winner. But that does not have to be the case, here with three red balls instead of one:

polya_urn(cols = c("black", "blue", "green", "red", "red", "red", "yellow"))


At the beginning (till about step 20) red and blue are neck and neck, then blue starts to dominate and red and green are neck and neck. After about step 90 green even overtakes red!

Sometimes, even in these “biased” models, randomness takes its toll and we see more complicated dynamics playing out. This interplay of randomness and the self-reinforcing property makes these models so interesting and fascinating.

I hope that you enjoyed this post and it got you thinking. Please share your thoughts and the results of your own experiments in the comments below!