Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used
Introduction
In the maximum likelihood (ML)1 method of the phylogenetic inferences the likelihood of observing a given set of sequence data for a specific substitution model is maximized for each topology, and the topology with the highest maximum likelihood is chosen as the final tree (Felsenstein, 1981; Nei and Kumar, 2000). Construction of ML trees is extremely time-consuming, especially when complex substitution models are used. Although there are heuristic algorithms that speed up computation (e.g., fastDNAml method (Olsen et al., 1994), NJML method (Ota and Li, 2000, Ota and Li, 2001), TrExML (Wolf et al., 2000)), the computational time required is still substantial (Lemmon and Milinkovitch, 2002; Rogers and Swofford, 1998; Salter, 2001). Because the actual pattern of nucleotide substitutions is very complicated, many investigators tend to use complex and therefore time-consuming substitution models rather than simple ones (Hedin and Maddison, 2001; Posada and Crandall, 2001; Reyes et al., 2000; Rice et al., 1997). However, the probability of getting the true topology does not depend on the computational time, and use of complex models may not produce the true topology with a higher probability than use of simpler ones (Nei et al., 1998; Sullivan and Swofford, 2001; Takahashi and Nei, 2000). It has been shown that when the number of nucleotides relative to the number of sequences used is small, simple model such as the Jukes–Cantor (JC) shows almost the same or even better performance than more complex models such as the Hasegawa, Kishino, and Yano + Gamma (HKY + Γ) model with which the simulated sequence data were obtained (Takahashi and Nei, 2000). Yet, the log likelihood score for the HKY + Γ model was always much higher than that for the JC model. As a consequence, the tree inferred under the HKY + Γ model was always selected as the ML tree, although in terms of topological distance this ML tree was often farther away from the true tree than the tree inferred under the JC model. However, these simulations were performed using only a relatively small number of nucleotides (300 bp). Therefore, we decided to investigate the efficiencies of different nucleotide substitution models in more details, using the relatively long nucleotide sequences. The relative efficiencies of various heuristic algorithms were also compared.
Section snippets
Model trees and nucleotide substitution models
DNA sequences were randomly generated according to a given model tree (Fig. 1) and a given substitution model (see below). These sequences were subsequently used for tree construction using different substitution models. The simulation scheme generally corresponds to the one described in Takahashi and Nei (2000). Because of the prohibitive amount of time required to perform ML analysis on the large data sets, we concentrated our study primarily on the case of 24 sequences. Two model topologies
Efficiency of search algorithms
The results of our simulations are summarized in Table 1, Table 2, Table 3, Table 4, Table 5 (see also supplementary Tables A–E). The dT values shown are the average topological distance values from the true tree to each of the topologies found. To make tables more comprehensive, the negative log likelihood values are not presented. However, we should note that in all cases examined the best log likelihood score belongs to the topology obtained under the combination of the most complex model
Search algorithms
Relative efficiency of different substitution models used in the ML phylogenetic inferences and performance of different heuristic search algorithms under the ML criterion was examined using computer simulations. The results showed that when relatively large number of sequences is used, the overall performance of computationally less extensive SA + NNI search is essentially the same as the performance of more extensive SA + TBR search. Similarity in the performance of these two heuristic searches
Acknowledgments
I thank Masatoshi Nei for our numerous inspirational discussions. I am also grateful to Wen-Hsiung Li and two anonymous reviewers for their comments on earlier version of this manuscript. This work was supported by Grants from NIH (GM20293) and NASA (NCC2-1057) to Masatoshi Nei.
References (35)
- et al.
A combined molecular approach to phylogeny of the jumping spider subfamily Dendryphantinae (Araneae: Salticidae)
Mol. Phylogenet. Evol.
(2001) - et al.
Evolution of protein molecules
- et al.
Long-branch attraction phenomenon and the impact of among-site rate variation on rodent phylogeny
Gene
(2000) - et al.
Comparison of phylogenetic trees
Math. Biosci.
(1981) - et al.
Evolution of the Adh locus in the Drosophila willistoni group: the loss of an intron, and shift in codon usage
Mol. Biol. Evol.
(1993) - et al.
Topological bias and inconsistency of maximum likelihood using wrong models
Mol. Biol. Evol.
(1999) Evolutionary trees from DNA sequences: a maximum likelihood approach
J. Mol. Evol.
(1981)- et al.
Dating of the human-ape splitting by a molecular clock of mitochondrial DNA
J. Mol. Evol.
(1985) - et al.
Phylogeny estimation and hypothesis testing using maximum likelihood
Annu. Rev. Ecol. Syst.
(1997) - et al.
Limitations of the evolutionary parsimony method of phylogenetic analysis
Mol. Biol. Evol.
(1990)
A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences
J. Mol. Evol.
The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation
Proc. Natl. Acad. Sci. USA
Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae
Nucleic Acids Res.
Molecular Evolution and Phylogenetics
The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small
Proc. Natl. Acad. Sci. USA
fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood
Comput. Appl. Biosci.
NJML: a hybrid algorithm for the neighbor-joining and maximum-likelihood methods
Mol. Biol. Evol.
Cited by (23)
Applying a multiobjective metaheuristic inspired by honey bees to phylogenetic inference
2013, BioSystemsCitation Excerpt :Other methods such as Tree Bisection and Reconnection (TBR) and Sectorial Searches (SS) (Goloboff, 1999) have become increasingly popular for maximum parsimony reconstruction. However, the complexity required to apply them has led researchers to prefer NNI and SPR for maximum likelihood analyses, achieving similar performances (Zwickl, 2006; Piontkivska, 2004). On the other hand, binary operators generate new phylogenies by combining information about evolutionary relationships from different sources.
A new approach for estimating the efficiencies of the nucleotide substitution models
2007, Theory in BiosciencesPhylogenetics and systematics in a nutshell
2020, Avian Malaria and Related Parasites in the Tropics: Ecology, Evolution and SystematicsCauses, consequences and solutions of phylogenetic incongruence
2014, Briefings in BioinformaticsMolecular evolution of virulence genes of swine influenza virus subtype-A H1N1: An analysis of host radiation
2011, International Journal of Pharma and Bio Sciences