Elsevier

Gene

Volume 385, 30 December 2006, Pages 103-110
Gene

Theoretical foundation to estimate the relative efficiencies of the Jukes–Cantor+gamma model and the Jukes–Cantor model in obtaining the correct phylogenetic tree

https://doi.org/10.1016/j.gene.2006.03.027Get rights and content

Abstract

This paper deals with the theoretical foundation to estimate the relative efficiency (probability of inferring the true tree) of different nucleotide substitution models. A novel theoretical approach has been developed to estimate the relative efficiency of nucleotide substitution models based on the neighbor-relation method. The theory was developed directly by using the four-point condition. Initially theoretical formulas for the Jukes–Cantor+gamma (JC+Γ) model and the Jukes–Cantor (JC) model were developed to estimate their relative efficiencies. Theoretical formulas were used on several model topologies for both models to obtain a true tree. Extensive simulations were performed on the same set of trees to test the strength of the theoretical approach. Simulation results demonstrated a good agreement with results obtained by the theoretical estimation. Overall the theoretical foundation for estimating model efficiencies is very accurate.

Introduction

A basic process in the evolution of DNA sequences is the change in nucleotides with time. This process deserves detailed consideration because changes in nucleotide sequences are used both for estimating the rate of evolution and for reconstructing the evolutionary history of organisms. As more and more sequences are determined, attempts to refine models seem ever more worthwhile. Better models will lead to more accurate estimations of the evolutionary history of the species concerned and to a better understanding of the forces and mechanisms that affected the evolution of the sequences. For this reason, many models have been proposed for studying this process (Jukes and Cantor, 1969, Kimura, 1980, Kimura, 1981, Lanave et al., 1984, Li, 1997, Swofford et al., 1996). Each of these models has some advantages and disadvantages, and the overall relative efficiency of the models in recovering the correct phylogenetic tree is still controversial.

The distance methods of phylogenetic inference seem to be quite efficient in obtaining the correct tree, compared with maximum parsimony and several other methods. The efficiency of a method is usually studied by using either computer simulation or empirical tests. Gojobori et al. (1982), Tajima and Nei (1984), Zharkikh (1994), and others have conducted extensive simulation studies to compare the efficiencies of various methods for estimating the expected number of nucleotide substitutions between two sequences (δij). On the other hand some researchers conducted theoretical studies to compare the efficiency of various methods for estimating the expected number of nucleotide substitutions between two sequences. Cavalli-Sforza and Edwards (1967) proposed the use of the ordinary least squares, whereas Fitch and Margoliash (1967) used a weighted least squares. Bulmer (1991) implemented the generalized least-squares method, in which the weighted sum of squares and cross products is minimized. On the other hand, Rzhetsky and Nei (1992) used the ordinary least-squares and minimum-evolution methods of phylogenetic inference in the four-taxa case. In the case of more than the four taxa, the neighbor-joining (NJ) method of phylogenetics inference is quite efficient in obtaining the correct tree (Saitou and Nei, 1987). Several computer simulations have been conducted to estimate the performance of the NJ method by using a large number of taxon (Kumar and Gadagkar, 2000, Takahashi and Nei, 2000), but no such theoretical estimation for more than four-taxa.

In this article a new theoretical approach is presented to estimate the efficiency of the nucleotide substitution models. In this paper, the efficiency of a model is defined as the probability of the correct topology inferred when the model is used in the tree inference algorithm. This approach is entirely different from the least-squares approach. A unique theoretical foundation is developed to estimate the efficiency of the nucleotide substitution models directly from the four-point condition (Buneman, 1971) by using simple statistical techniques. Initially the Jukes–Cantor+gamma (JC+Γ) model and the Jukes–Cantor (JC) model are considered. In this analysis, it is assumed that the JC+Γ model is a true model. The primary goal of this work is to establish a theoretical foundation to estimate the relative efficiencies of the JC+Γ model and the JC model in obtaining the correct tree. The neighbor-relation method is used for tree estimation (Buneman, 1971), and then the above two sequence evolution models are inferred by using the four-point condition (Eqs. (8), (9)). Computer simulations have been conducted to evaluate the relative efficiencies of these models, but there is no mathematical framework to study the success of each model. One reason for the lack of such a mathematical study seems to be the mathematical complexity of the problem, even for the case in which a moderate number of DNA or amino acids sequences are used. In this work four sequences are used.

The current work is divided into two parts: establish a theoretical foundation to estimate the relative efficiencies of the JC+Γ model and the JC model in obtaining the true tree, and then simulations are made to assess the accuracy of the theoretical results as well as to compare the efficiency of both models.

Section snippets

Constant rate of substitution: JC model

Consider two homologous nucleotide sequences that diverged from a common ancestral sequence t years ago, and also consider the case where the rate of nucleotide substitution is the same for all pairs of nucleotides and equal to λ per site per year. The expected number of nucleotide substitutions per site between the two sequences (say i & j) for this case is given byδij=2λtIf the proportion (pij) of the nucleotide sites is known, the expected number of nucleotide substitutions (also called

Estimation of efficiency

Analytical formulas of the expectations, variances and correlation coefficients were derived for the JC+Γ model and the JC model (see Appendix B). For example, for topology 1 under the JC+Γ model E(D) = E(D′) = 0.020, Var(D) = Var(D′) = 0.0006983 and ρDD = 0.57086. We performed this operation for all topologies, and for both models to compute the expectations (E(D) & E(D′)), variances (Var(D) & Var(D′)) and correlation coefficient (ρDD) from the theoretical derivation.

The theoretical efficiency of a

Discussion

Theoretical efficiencies of the JC+Γ model and the JC model were estimated assuming that the JC+Γ model is a true model. Simulation experiments were also conducted to assess the accuracy of the theoretical estimation, and justify the appropriateness of the normality assumption. The efficiencies of both models were estimated from the simulation experiments in two different ways: (i) direct estimation, and (ii) normality assumption. In both ways the results are very close (Fig. 3, Fig. 4). As an

Acknowledgements

I thank Olivier Gascuel, Indranil Mukhopadhyay and two anonymous referees for their helpful comments. I also thank Sudhir Kumar for helpful discussions. This work was supported by a grant from CNRS to Olivier Gascuel and National Institutes of Health to Sudhir Kumar.

References (32)

  • T.H. Jukes et al.

    Evolution of protein molecules

  • M. Bulmer

    Use of the method of generalized least squares in reconstructing phylogenies from sequence data

    Mol. Biol. Evol.

    (1991)
  • P. Buneman

    The recovery of trees from measurements of dissimilarity

  • L.L. Cavalli-Sforza et al.

    Phylogenetics analysis: models and estimation procedures

    Am. J. Hum. Genet.

    (1967)
  • J. Felsenstein

    Cases in which parsimony or compatibility methods will be positively misleading

    Syst. Zool.

    (1978)
  • J. Felsenstein

    PHYLIP—Phylogeny Inference Package (version 3.2)

    Cladistics

    (1989)
  • W.M. Fitch et al.

    Construction of phylogenetics trees

    Science

    (1967)
  • T. Gojobori et al.

    Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide

    J. Mol. Evol.

    (1982)
  • M. Hasewaga et al.

    Dating of the human–ape splitting by a molecular clock of mitochondrial DNA

    J. Mol. Evol.

    (1985)
  • J.P. Huelsenbeck

    Is the Felsenstein zone a fly trap?

    Syst. Biol.

    (1997)
  • M. Kimura

    A simple method for estimating evolutionary rates base substitutions through comparative studies of nucleotide sequences

    J. Mol. Evol.

    (1980)
  • M. Kimura

    Estimation of evolutionary distances between homologous nucleotide sequences

    Proc. Natl. Acad. Sci. U. S. A.

    (1981)
  • M. Kimura et al.

    On the stochastic model for estimation of mutational distance between homologous proteins

    J. Mol. Evol.

    (1972)
  • S. Kumar et al.

    Efficiency of the neighbor-joining method in reconstructing deep and shallow evolutionary relationships in large phylogenies

    J. Mol. Evol.

    (2000)
  • C. Lanave et al.

    A new method for calculating evolutionary substitution rates

    J. Mol. Evol.

    (1984)
  • W.H. Li

    Molecular Evolution

    (1997)
  • Cited by (18)

    • A likely autotetraploidization event shaped the Chinese mahogany (Toona sinensis) genome

      2023, Horticultural Plant Journal
      Citation Excerpt :

      ClustalW (Thompson et al., 2002) was employed to align multiple gene CDSs with the default parameters. Because nucleotide substitutions may frequently occur at the same site in a sequence, we used the Jukes-Cantor (JC) model to correct the Ka and Ks values (Som, 2006; Wang et al., 2009). The Ks values of homologous genes from different genomes can reflect the occurrence times of divergence and speciation.

    • Evaluation of various distance computation methods for construction of haplotype-based phylogenies from large MLST dataset

      2022, Molecular Phylogenetics and Evolution
      Citation Excerpt :

      Alignment-based phylogenetic techniques are generally divided into two types; distance-based and character-based methods. Distance-based methods require selection from a range of substitution models to compute genetic distances (Rzhetsky and Nei, 1995; Som, 2006). A hierarchical clustering method such as UPGMA, WPGMA, or Neighbor-Joining (Lin, 1982; Saitou and Nei, 1987) is then used to generate a tree from these distances, where isolate pairs separated by small distances are placed on branch tips that share a node, and each node represents a common ancestor (Mount, 2008a).

    • Bioinformatics in Drug Design and Delivery

      2022, Computer Aided Pharmaceutics and Drug Delivery: an Application Guide for Students and Researchers of Pharmaceutical Sciences
    View all citing articles on Scopus
    View full text