# A New Dimension to Principal Components Analysis

**Getting Genetics Done**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

*ethnic axes*. Price et al. published on this in 2006, and since then PCA plots are a common component of many published GWAS studies. One key advantage to using PCA for ethnicity is that each sample is given coordinates in a multidimensional space corresponding to the varying components of their ethnic ancestry. Using either full GWAS data or a set of ancestral informative markers (AIMs), PCA can be easily conducted using available software packages like EIGENSOFT or GCTA. HapMap samples are sometimes included in the PCA analysis to provide a frame of reference for the ethnic groups.

new_ruid pc1 pc2 pc3 pc4 pc5

1 11596 4.10996e-03 -0.002883830 0.003100840 -0.00638232 0.00709780

2 5415 3.22958e-03 -0.000299851 -0.005358910 0.00660643 0.00430520

3 11597 -4.35116e-03 0.013282400 0.006398130 0.01721600 -0.02275470

4 5416 4.01592e-03 0.001408180 0.005077310 0.00159497 0.00394816

5 3111 3.04779e-03 -0.002079510 -0.000127967 -0.00420436 0.01257460

6 11598 6.15318e-06 -0.000279919 0.001060880 0.00606267 0.00954331

plot(pca$pc1, pca$pc2)

We might also want to look at the next two PCs:

plot(pca$pc2, pca$pc3)

Its probably best to look at all of them together:

pairs(pca[2:4])

So this is where my mind plays tricks on me. I can’t make much sense out of these plots — there should be four ethnic groups represented, but its hard to see who goes where. To look at all of these dimensions simultaneously, we need a 3D plot. Now 3D plots (especially 3D *scatterplots) *aren’t highly regarded — in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots — but in this case I found them quite useful.

Using a library called rgl, I generated a 3D scatterplot like so:

plot3d(pca[2:4])

Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out. Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data. Different parameters obviously produce different cluster patterns, but I found that using an “ellipsoidal model with equal variances” and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.

fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")

plot3d(pca[2:4], col = fit$classification)

Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group. We are still a bit confused as to why the Asian descent samples don’t form a more concise cluster — it may be due to relatively poor performance of these AIMs in Asian descent groups. Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups. The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty. We can plot this like so:

plot(pca[2:3], cex = fit$uncertainty*10)

I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles. We will likely drop these samples from any stratified analyses. We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.

**leave a comment**for the author, please follow the link and comment on their blog:

**Getting Genetics Done**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.