Model for nothing – and the bootstrap for free

[This article was first published on Timothée Poisot » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reconstructing phylogenies is an interesting task, sadly one that often requires to navigate between a multitude of software. To add an unnecessary layer of complexity to the whole thing, most of these softwares speaks different languages, and requires the user to do endless conversions from fasta to phylip to nexus to whatever new format they may have came up with during the thirty seconds I needed to write this down. I can deal with complexity if it has some purpose (or when it is unavoidable), but having to go through the multitude of file format changes is always getting on my fragile little nerves.

My usual routine to transform an alignment into a ML tree (I am mostly working on 16S rDNA, by the way) is to convert the sequences from fasta (I use mostly MUSCLE to align, and the wonderfully intuitive Se-Al to check my alignments) to nexus, load the sequences in PAUP* and fire up the ModelTest block until the best model is selected. Once this is done, I convert the fasta file to phylip, load it in PhyML, spend some time setting the model only to discover that there is a missing space between a sequence ID and the sequence itself, at what point PhyML crashes, and I start asking myself why I thought it was a good idea to do science to begin with.

To be honest, I’m now used to these steps. For a reason that eludes me, I am using the command line version of these apps instead of the web-based ones (I was never a big fan of copying and pasting). Not to mention that I appreciate the strict division of labor between them. But if there is one thing I really appreciate (in a geeky kind of way), it is R. Thanks to the excellent package ape, most of these tasks can be achieved in R, with a great advantage. Instead of manipulating files, you are manipulating R objects, and the package takes care of the conversion for you!

This entry is more of a notebook for me, as you will see that some points remain unresolved. Any people with the required experience is welcome to comment and give other ideas!

Loading a fasta file in R using ape is not complicated. The instructions to use are

sequences <- read.dna(‘file.fasta’,format=‘fasta’)

What you need to do is to convert your file in phylip format, using


Now, you are ready to start finding the best model of evolution! To do so, just make sure that the phyml executable is in the current directory of R, then just do


At this point, R is instructing PhyML to go through 28 different models of substitution (ModelTest does 54). Once this is completed (depending on the size of your sample), you can either print or plot the resulting object. In my case


gives the following result:

where the models are presented according to their AIC. I shall copy here one advertisement given in the help page for this function

It is important to note that the models fitted by this function is only a small fraction of the models possible with PhyML. For instance, it is possible to vary the number of categories in the (discretized) gamma distribution of substitution rates, and many parameters can be fixed by the user. The results from the present function should rather be taken as indicative of a best model.

For my sequences, it appears that either Tamura-Ney or Hasegawa, Kishino and Yano, both with a specified α parameter to correct the Γ distribution (their AICs differs only by 0.06, so any of them will do…). The ape package offers a dist.dna function, but it is not very advanced, so we will use the phangorn, that can do a lot of different analyses (and the included vignettes are highly informative!).

Fortunately, because R users are such clever people, the two packages rely on the same structure to encode DNA alignments. We can use phangorn to build a quick tree using NJ or UPGMA. This is done with

dm <- dist.dna(sequences)
tree <- upgma(dm)

We can check the parsimony of the current tree, and the best possible parsimony using NNI, with


On with maximum likelihood, now. First, we will fit our data using the pml function

fit <- update(pml(tree,as.phyDat(sequences),model=‘HKY’),k=4)

The update part is just to add the gamma distribution to the HKY model. If we plot the object fit, there is a tree!

Now, we would like to optimize several aspects of our phylogeny. The gamma shape parameter (defaults to 1), and the edge lengths. This is done with

optree <- optim.pml(fit,optGamma=TRUE,optRooted=TRUE)

Which, finally, gives me the phylogenetic tree I was looking for!

The last step is to perform non-parametric bootstrap on this tree, using the bootstrap.pml function, which, provided you have access to distributed computing, will take advantage of it using the multicore package. Just keep in mind that bootstrap can take a lot of time… The function to perform and plot the results of the bootstrapping is

bs <- bootstrap.pml(optree,bs=50)

To conclude, using only free software, it is possible to do some phylogenetic analysis (I did not talked about the parsimony or distance-based methods also available within ape and phangorn). While not all the models are available, the selection is good, and it is definitely less cumbersome to do everything in R, rather than having to deal with multiple files in multiple formats.

To leave a comment for the author, please follow the link and comment on their blog: Timothée Poisot » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)