Simulating Allele Counts in a population using R

[This article was first published on Doodling with Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is inspired by the Week 7 lectures of the Coursera course “Introduction to Genetics and Evolution” (I highly recommend this course for anyone interested in genetics, BTW.) Professor Noor uses a Univ Washington software called AlleleA1 for trying out scenarios.

We can just as well use R to get an intuitive feel for how Alleles and Genotypes propagate or die out in populations.

Basic Scenario

There are N individuals in an isolated island. Say, we are interested in two specific Alleles (Big “A”, or small “a”). This in turn means that they can have 3 types of genotypes: AA, Aa or aa. The individuals mate in pairs, and produce two offspring and die out. (Thus the total population remains the same generation after generation.)
The genotype of the offspring depends on those of the parents. A ‘gamete’ has only one parental allele, depending on what the parent’s genotype was. AA type parent can only product gamete type A, aa parent can only produce gamete type a, but Aa can produce either type of gamete.

A Punnett square of parents gametes to offspring’s genotypes. 

  | A  | a
A | AA | Aa 
a | Aa | aa 

With these simple rules, we can use R Simulation scripts to observe what happens to the Allele Frequencies over generations. (The goal here is to learn to use R for Monte Carlo simulations.)

Writing the R Script from scratch

 I toyed around with the idea of using character strings for the genotypes and the alleles. But then I realized that are only three types and I could just as easily represent them with the numbers 1, 2, 3 as a simple R vector.

With that done, we can write very simple functions for the procreation process.
With these useful functions, we can take one generation and produce another, 2 offspring for each set of 2 parents.
Putting it all together to generate multiple trials:

We also need to compute the Allele counts for each generation, and for plotting I use ggplot.

Using this simple Monte Carlo “toy” we can develop quite a bit of intuition.

For small starting populations, either the big A or the small a allele takes over the entire population fairly quickly. Given large enough number of generations, invariably one of the alleles gets wiped out.

As one example, we can see that even a small increase in the probability of Allele A to be 0.53 (up from 0.5) makes it take over quite dramatically.

Conversely, setting it to any value under 0.5 means that the Big A allele gets wiped out of the entire population.

The entire R script can be found here. You can download the code and try playing with various starting scenarios, changing the starting population counts, generations and probabilities.

  1. (Introduction to Genetics and Evolution by Md. Noor, Week 7 lectures) 
  2. AlleleA1 software at Univ Washington

To leave a comment for the author, please follow the link and comment on their blog: Doodling with Data. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)