In general, the standard practice for correcting for population stratification in genetic studies is to use principal components analysis (PCA) to categorize samples along different ethnic axes. Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies. One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry. Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.
Once computed, each sample will have values that correspond to a position in the new coordinate system that effectively clusters samples together by ethnic similarity. The results of this analysis are usually plotted/visualized to identify ethnic outliers or to simply examine the structure of the data. A common problem however is that it may take more than the first two principal components to identify groups.
To illustrate, I will plot some PCs generated based on 125 AIMs markers for a recent study of ours. I generated these using GCTA software and loaded the top 5 PCs into R using the read.table() function. I loaded the top 5, but for continental ancestry, I've found that the top 3 are usually enough to separate groups. The values look something like this:
new_ruid pc1 pc2 pc3 pc4 pc5
1 11596 4.10996e-03 -0.002883830 0.003100840 -0.00638232 0.00709780
2 5415 3.22958e-03 -0.000299851 -0.005358910 0.00660643 0.00430520
3 11597 -4.35116e-03 0.013282400 0.006398130 0.01721600 -0.02275470
4 5416 4.01592e-03 0.001408180 0.005077310 0.00159497 0.00394816
5 3111 3.04779e-03 -0.002079510 -0.000127967 -0.00420436 0.01257460
6 11598 6.15318e-06 -0.000279919 0.001060880 0.00606267 0.00954331
I loaded this into a dataframe called pca, so I can plot the first two PCs using this command:
We might also want to look at the next two PCs:
Its probably best to look at all of them together:
So this is where my mind plays tricks on me. I can't make much sense out of these plots -- there should be four ethnic groups represented, but its hard to see who goes where. To look at all of these dimensions simultaneously, we need a 3D plot. Now 3D plots (especially 3D scatterplots) aren't highly regarded -- in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots -- but in this case I found them quite useful.
Using a library called rgl, I generated a 3D scatterplot like so:
Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out. Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data. Different parameters obviously produce different cluster patterns, but I found that using an "ellipsoidal model with equal variances" and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.
fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")
plot3d(pca[2:4], col = fit$classification)
plot(pca[2:3], cex = fit$uncertainty*10)