Predicting height based on DNA mutations

Posted on October 7, 2018 by Florian Privé in R bloggers | 0 Comments

[This article was first published on Florian Privé, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I show some results of predicting height based on DNA mutations. This analysis aims at reproducing the analysis of this paper using my own analysis tools in.

I use a new dataset composed of 500,000 adults from UK, and genotyped over hundreds of thousands of DNA positions. This dataset is called the UK biobank, and also provide some baseline characteristics such as sex and birth date, which are useful for predicting height.

Model

Data

After some quality control and filtering, I end up with a dataset composed of 394,436 individuals genotyped over 564,148 DNA positions. This results in a matrix that counts the number of mutations (0, 1, or 2) for a given individual (observation) and a given DNA position (variable). Even if I can store each element of this matrix using one byte only, this results in ~200 GB of data.

I use 20,000 individuals as test set, and the rest as training set.

Baseline model

Before using DNA mutations, let us make a model with only two variables:

sex
birth date (year and month)

Coefficients:
               Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  154.503399   0.091744 1684.075   <2e-16 ***
date           0.157243   0.001749   89.920   <2e-16 ***
sexMale       13.436380   0.132919  101.087   <2e-16 ***
date:sexMale  -0.003000   0.002544   -1.179    0.238

So basically, men are ~13.4 cm taller than women in average. There is also an effect due to societal changes (“Flynn Effects”) that results in people being 1 cm taller in average every ~6 years (I don’t mean people getting older, but people being born more recently).

Base prediction of height based on sex and birth date.

Model when adding tens of thousands of DNA mutations

Because of the size of the data and because effects of mutations are usually very small and additive, I use a linear model with lasso penalty to take into account many but not all DNA mutations (as in the paper I linked above).

Final prediction of height based on DNA mutations, sex and birth date.

Predicted height correlates with actual height at 65.5% for women and 65% for men; 61.8% and 61.4% if using DNA mutations only (without birth date). Note that it is estimated that prediction based on DNA mutations only could achieve a correlation of 70% at most.

Residuals from baseline and final predictions of height.

So, with this model combining sex, birth date and DNA mutations, predictions are precise within 8 cm for 90% of people and within 4 cm for 60% of people.

Conclusion

The UK biobank contains lots of information that could be used to predict e.g. people height. There are also some information about e.g. environmental factors that could be useful to predict height. Tell me if you have any idea and I’ll try to add new variables to the current model.

To leave a comment for the author, please follow the link and comment on their blog: Florian Privé.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Predicting height based on DNA mutations

Model

Data

Baseline model

Model when adding tens of thousands of DNA mutations

Conclusion

Related

Model

Data

Baseline model

Model when adding tens of thousands of DNA mutations

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)