Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

https://doi.org/10.1016/j.ympev.2003.10.011Get rights and content

Abstract

Choice of a substitution model is a crucial step in the maximum likelihood (ML) method of phylogenetic inference, and investigators tend to prefer complex mathematical models to simple ones. However, when complex models with many parameters are used, the extent of noise in statistical inferences increases, and thus complex models may not produce the true topology with a higher probability than simple ones. This problem was studied using computer simulation. When the number of nucleotides used was relatively large (1000 bp), the HKY + Γ model showed smaller dT (topological distance between the inferred and the true trees) than the JC and Kimura models. In the cases of shorter sequences (300 bp) simpler model and search algorithm such as JC model and SA + NNI search were found to be as efficient as more complicated searches and models in terms of topological distances, although the topologies obtained under HKY + Γ model had the highest likelihood values. The performance of relatively simple search algorithm SA + NNI was found to be essentially the same as that of more extensive SA + TBR search under all models studied. Similarly to the conclusions reached by Takahashi and Nei [Mol. Biol. Evol. 17 (2000) 1251], our results indicate that simple models can be as efficient as complex models, and that use of complex models does not necessarily give more reliable trees compared with simple models.

Introduction

In the maximum likelihood (ML)1 method of the phylogenetic inferences the likelihood of observing a given set of sequence data for a specific substitution model is maximized for each topology, and the topology with the highest maximum likelihood is chosen as the final tree (Felsenstein, 1981; Nei and Kumar, 2000). Construction of ML trees is extremely time-consuming, especially when complex substitution models are used. Although there are heuristic algorithms that speed up computation (e.g., fastDNAml method (Olsen et al., 1994), NJML method (Ota and Li, 2000, Ota and Li, 2001), TrExML (Wolf et al., 2000)), the computational time required is still substantial (Lemmon and Milinkovitch, 2002; Rogers and Swofford, 1998; Salter, 2001). Because the actual pattern of nucleotide substitutions is very complicated, many investigators tend to use complex and therefore time-consuming substitution models rather than simple ones (Hedin and Maddison, 2001; Posada and Crandall, 2001; Reyes et al., 2000; Rice et al., 1997). However, the probability of getting the true topology does not depend on the computational time, and use of complex models may not produce the true topology with a higher probability than use of simpler ones (Nei et al., 1998; Sullivan and Swofford, 2001; Takahashi and Nei, 2000). It has been shown that when the number of nucleotides relative to the number of sequences used is small, simple model such as the Jukes–Cantor (JC) shows almost the same or even better performance than more complex models such as the Hasegawa, Kishino, and Yano + Gamma (HKY + Γ) model with which the simulated sequence data were obtained (Takahashi and Nei, 2000). Yet, the log likelihood score for the HKY + Γ model was always much higher than that for the JC model. As a consequence, the tree inferred under the HKY + Γ model was always selected as the ML tree, although in terms of topological distance this ML tree was often farther away from the true tree than the tree inferred under the JC model. However, these simulations were performed using only a relatively small number of nucleotides (300 bp). Therefore, we decided to investigate the efficiencies of different nucleotide substitution models in more details, using the relatively long nucleotide sequences. The relative efficiencies of various heuristic algorithms were also compared.

Section snippets

Model trees and nucleotide substitution models

DNA sequences were randomly generated according to a given model tree (Fig. 1) and a given substitution model (see below). These sequences were subsequently used for tree construction using different substitution models. The simulation scheme generally corresponds to the one described in Takahashi and Nei (2000). Because of the prohibitive amount of time required to perform ML analysis on the large data sets, we concentrated our study primarily on the case of 24 sequences. Two model topologies

Efficiency of search algorithms

The results of our simulations are summarized in Table 1, Table 2, Table 3, Table 4, Table 5 (see also supplementary Tables A–E). The dT values shown are the average topological distance values from the true tree to each of the topologies found. To make tables more comprehensive, the negative log likelihood values are not presented. However, we should note that in all cases examined the best log likelihood score belongs to the topology obtained under the combination of the most complex model

Search algorithms

Relative efficiency of different substitution models used in the ML phylogenetic inferences and performance of different heuristic search algorithms under the ML criterion was examined using computer simulations. The results showed that when relatively large number of sequences is used, the overall performance of computationally less extensive SA + NNI search is essentially the same as the performance of more extensive SA + TBR search. Similarity in the performance of these two heuristic searches

Acknowledgments

I thank Masatoshi Nei for our numerous inspirational discussions. I am also grateful to Wen-Hsiung Li and two anonymous reviewers for their comments on earlier version of this manuscript. This work was supported by Grants from NIH (GM20293) and NASA (NCC2-1057) to Masatoshi Nei.

References (35)

  • M. Kimura

    A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences

    J. Mol. Evol.

    (1980)
  • A.R. Lemmon et al.

    The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation

    Proc. Natl. Acad. Sci. USA

    (2002)
  • A.T. Lloyd et al.

    Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae

    Nucleic Acids Res.

    (1992)
  • M. Nei et al.

    Molecular Evolution and Phylogenetics

    (2000)
  • M. Nei et al.

    The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small

    Proc. Natl. Acad. Sci. USA

    (1998)
  • G.J. Olsen et al.

    fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood

    Comput. Appl. Biosci.

    (1994)
  • S. Ota et al.

    NJML: a hybrid algorithm for the neighbor-joining and maximum-likelihood methods

    Mol. Biol. Evol.

    (2000)
  • Cited by (23)

    • Applying a multiobjective metaheuristic inspired by honey bees to phylogenetic inference

      2013, BioSystems
      Citation Excerpt :

      Other methods such as Tree Bisection and Reconnection (TBR) and Sectorial Searches (SS) (Goloboff, 1999) have become increasingly popular for maximum parsimony reconstruction. However, the complexity required to apply them has led researchers to prefer NNI and SPR for maximum likelihood analyses, achieving similar performances (Zwickl, 2006; Piontkivska, 2004). On the other hand, binary operators generate new phylogenies by combining information about evolutionary relationships from different sources.

    • Phylogenetics and systematics in a nutshell

      2020, Avian Malaria and Related Parasites in the Tropics: Ecology, Evolution and Systematics
    View all citing articles on Scopus
    View full text