new_ruid pc1 pc2 pc3 pc4 pc5
1 11596 4.10996e-03 -0.002883830 0.003100840 -0.00638232 0.00709780
2 5415 3.22958e-03 -0.000299851 -0.005358910 0.00660643 0.00430520
3 11597 -4.35116e-03 0.013282400 0.006398130 0.01721600 -0.02275470
4 5416 4.01592e-03 0.001408180 0.005077310 0.00159497 0.00394816
5 3111 3.04779e-03 -0.002079510 -0.000127967 -0.00420436 0.01257460
6 11598 6.15318e-06 -0.000279919 0.001060880 0.00606267 0.00954331
We might also want to look at the next two PCs:
Its probably best to look at all of them together:
So this is where my mind plays tricks on me. I can’t make much sense out of these plots — there should be four ethnic groups represented, but its hard to see who goes where. To look at all of these dimensions simultaneously, we need a 3D plot. Now 3D plots (especially 3D scatterplots) aren’t highly regarded — in fact I hear that some poor soul at the University of Washington gets laughed at for showing his 3D plots — but in this case I found them quite useful.
Using a library called rgl, I generated a 3D scatterplot like so:
Now, using the mouse I could rotate and play with the cloud of data points, and it became more clear how the ethnic groups sorted out. Just to double check my intuition, I ran a model-based clustering algorithm (mclust) on the data. Different parameters obviously produce different cluster patterns, but I found that using an “ellipsoidal model with equal variances” and a cluster size of 4 identified the groups I thought should be there based on the overlay with the HapMap samples.
fit <- Mclust(pca[2:4], G=4, modelNames = "EEV")
plot3d(pca[2:4], col = fit$classification)
Basically, the red sphere corresponds to the European descent group, the green indicates the admixed African American group, the black group corresponds to the Hispanic group, and the blue identifying the Asian descent group. We are still a bit confused as to why the Asian descent samples don’t form a more concise cluster — it may be due to relatively poor performance of these AIMs in Asian descent groups. Whatever the case, you might notice several individuals falling either outside a clear cluster or at the interface between two groups. The ethnic assignment for these individuals is questionable, but the clustering algorithm gives us a very nice measure of cluster assignment uncertainty. We can plot this like so:
plot(pca[2:3], cex = fit$uncertainty*10)
I had to scale the uncertainty factor by 10 to make the questionable points more visible in this plot, shown as the hollow circles. We will likely drop these samples from any stratified analyses. We can export the cluster assignment by accessing the fit$classification column, and we have our samples assigned to an ethnic group.