Does gene flow destroy phylogenetic signal? The performance of three methods for estimating species phylogenies in the presence of gene flow

https://doi.org/10.1016/j.ympev.2008.09.008Get rights and content

Abstract

Incomplete lineage sorting has been documented across a diverse set of taxa ranging from song birds to conifers. Such patterns are expected theoretically for species characterized by certain life history characteristics (e.g. long generation times) and those influenced by certain historical demographic events (e.g. recent divergences). A number of methods to estimate the underlying species phylogeny from a set of gene trees have been proposed and shown to be effective when incomplete lineage sorting has occurred. The further effects of gene flow on those methods, however, remain to be investigated. Here, we focus on the performance of three methods of species tree inference, ESP-COAL, minimizing deep coalescence (MDC), and concatenation, when incomplete lineage sorting and gene flow jointly confound the relationship between gene and species trees. Performance was investigated using Monte Carlo coalescent simulations under four models (n-island, stepping stone, parapatric, and allopatric) and three magnitudes of gene flow (Nem = 0.01, 0.10, 1.00). Although results varied by the model and magnitude of gene flow, methods incorporating aspects of the coalescent process (ESP-COAL and MDC) performed well, with probabilities of identifying the correct species tree topology typically increasing to greater than 0.75 when five more loci are sampled. The only exceptions to that pattern included gene flow at moderate to high magnitudes under the n-island and stepping stone models. Concatenation performs poorly relative to the other methods. We extend these results to a discussion of the importance of species and population phylogenies to the fields of molecular systematics and phylogeography using an empirical example from Rhododendron.

Introduction

The fundamental goal of systematics is to understand the process of lineage divergence that leads to the formation of new species. Since Maddison (1997) there has been growing acceptance among systematists that gene genealogies are not always congruent with species phylogenies (e.g. the actual pattern of lineage splitting and descent from common ancestors). It is now widely recognized that processes such as gene duplication (Fitch, 1970), lateral transfer (Cummings, 1994) and incomplete lineage sorting (Tajima, 1983, Takahata and Nei, 1985, Hudson, 1992) can lead to incongruence between gene trees and species trees, and empirical examples of each process exist (cf. Syring et al., 2007 for an example of incomplete lineage sorting). This realization has prompted the development of approaches designed to estimate species phylogenies despite the process that presumably caused the incongruence. For example, gene tree parsimony (Slowinski and Page, 1999) was developed to account for gene duplication, while the minimization of deep coalescence (MDC; Maddison, 1997), COAL (Degnan and Salter, 2005), and BEST (Edwards et al., 2007, Liu and Pearl, 2007) were designed in part to estimate species phylogeny when the discord between the gene trees and species tree is a result of the incomplete sorting of ancestral polymorphisms.

At the initial stages of divergence, incomplete lineage sorting is ubiquitous and likely produces the majority of gene-species tree discord among closely related lineages. This is a direct outcome of population-level processes; consequently, the developers of methods have incorporated statistical models derived from the coalescent (Kingman, 1982, Hudson, 1990) into species-level phylogenetic analyses to account for these processes. However, for many empirical systems it is also these lineages that exchange migrants, particularly when they occur in sympatry. Since genetic polymorphism shared among lineages can result from either retained ancestral polymorphism or a gene copy introduced into the population via gene flow (Slatkin and Maddison, 1989), it is often difficult to determine which process produced the shared polymorphism. Fully statistical treatments of coalescence, gene flow, and divergence are currently available only for pairwise comparisons between two lineages (Nielsen and Wakeley, 2001, Hey and Nielsen, 2004, Hey and Nielsen, 2007, Hey, 2006).

It is an understatement to suggest that the biologist who wishes to estimate species phylogeny in a system where details such as (a) the number of lineages, (b) the relationship among lineages, and (c) the amount of gene flow are unclear is currently faced with a difficult task. Methods that estimate a species phylogeny using some approach derived from the coalescent must be robust to at least moderate levels of gene flow (e.g. levels that not be easily recognizable) to be of any use to the majority of empirical biologists, or the use of such methods may result in spurious conclusions about the actual pattern of lineage divergence. The data we present in this manuscript were collected out of a desire to explore how the phylogenetic signal contained in DNA sequence data is affected by gene flow in recently diverged lineages. Does gene flow destroy phylogenetic signal entirely, or are some methods able to accurately estimate species phylogeny when some of the shared polymorphisms result from gene flow? In order to explore this issue, we evaluate approaches based on the coalescent that use estimated gene trees as input in an attempt to isolate gene flow as the sole factor affecting phylogenetic accuracy.

Section snippets

Statistical inference of species trees from gene trees

A renewed interest exists in the development and interpretation of statistical methods for the inference of species trees from gene trees (Maddison and Knowles, 2006). A myriad of innovative approaches have been developed (Slatkin and Maddison, 1989, Maddison, 1997, Page and Charleston, 1997, Slowinski and Page, 1999, Liu and Pearl, 2006, Edwards et al., 2007, Carstens and Knowles, 2007), as well as applied to empirical questions in phylogeography and systematics (Knowles and Carstens, 2007,

Performance of ESP-COAL

The type and magnitude of gene flow affected the ability to infer the correct ST topology using ESP-COAL (Fig. 2A). In general, models of historical gene flow did not greatly degrade the phylogenetic accuracy, regardless of the magnitude (Nem = 0.01–1.00) or duration (0.1xNe or 0.5xNe generations) of gene flow. The probability of identifying the correct ST never dipped below 0.70 for any parameter combination for either the parapatric or allopatric models. In contrast, phylogenetic accuracy was

Explanation of results

Incomplete lineage sorting has emerged as a common problem for phylogenetic inference at the species level. Given the volume of mathematical theory predicting this phenomenon (cf. Pamilo and Nei, 1988, Rosenberg, 2002, Rosenberg, 2003), this may not be surprising. Several methods of inferring species phylogenies from gene trees have incorporated the stochastic process of incomplete lineage sorting (Maddison, 1997, Degnan and Salter, 2005, Liu and Pearl, 2007). While these methods are clearly at

Acknowledgments

We thank Benjamin Hall and Wennie Chou for providing the Rhododendron DNA sequences. Special thanks to Gabriel Rosa and John Liechty for assistance with the Department of Plant Sciences computing cluster located at the University of California, Davis and with PERL scripting. We thank Amy Litt, Jeffrey Oliver, and one anonymous reviewer for providing insightful comments that significantly improved this manuscript.

References (57)

  • R.T. Brumfield et al.

    Comparison of species tree methods for reconstructing the phylogeny of bearded manakins (Aves: Pipridae: Manacus) from multilocus sequence data

    Syst. Biol.

    (2008)
  • M.D. Carling et al.

    Integrating phylogenetic and population genetic analyses of multiple loci to test species divergence hypotheses in Passerina buntings

    Genetics

    (2008)
  • B.C. Carstens et al.

    Estimating phylogeny from gene tree probabilities in Melanoplus grasshoppers despite incomplete lineage sorting

    Syst. Biol.

    (2007)
  • P.A. Cox et al.

    The Encyclopedia of Rhododendron Species

    (1997)
  • J.H. Degnan et al.

    Gene tree distributions under the coalescent process

    Evolution

    (2005)
  • J.H. Degnan et al.

    Discordance of species trees with their most likely gene trees

    PLoS Genet.

    (2006)
  • S.V. Edwards et al.

    High-resolution species trees without concatenation

    Proc. Natl. Acad. Sci. USA

    (2007)
  • L. Excoffier et al.

    SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography

    J. Hered.

    (2000)
  • J. Felsenstein

    Confidence limits on phylogenies. An approach using the bootstrap

    Evolution

    (1985)
  • J. Felsenstein

    Inferring Phylogenies

    (2004)
  • W.M. Fitch

    Distinguishing homologous from analogous proteins

    Syst. Zool.

    (1970)
  • L. Goetsch et al.

    The molecular systematics of Rhododendron (Ericaceae): a phylogeny based upon RPB2 gene sequences

    Syst. Bot.

    (2005)
  • M.J. Hickerson et al.

    MsBayes: a flexible pipeline for comparative phylogeographic inference using approximate Bayesian computation (ABC)

    BMC Bioinformatics

    (2007)
  • J. Hey et al.

    Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. Persimilis

    Genetics

    (2004)
  • J. Hey et al.

    Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics

    Proc. Natl. Acad. Sci. USA

    (2007)
  • R.R. Hudson

    Testing the constant-rate neutral allele model with protein sequence data

    Evolution

    (1983)
  • R.R. Hudson

    Gene genealogies and the coalescent process

  • R.R. Hudson

    Gene trees, species trees and the segregation of ancestral alleles

    Genetics

    (1992)
  • Cited by (122)

    • Phylotranscriptomic evidence for pervasive ancient hybridization among Old World salamanders

      2021, Molecular Phylogenetics and Evolution
      Citation Excerpt :

      In addition, ancient introgression can involve now-extinct species and thus be more difficult to detect. While the application of phylogenetic inference methods that account for ILS is now common, primarily in the framework of the Multi-Species Coalescent (MSC), introgression has been widely ignored in large scale phylogenetic studies (Eckert & Carstens 2008). The extension of the MSC into the Multi-Species Network Coalescent (Degnan 2018) allowed the development of models accounting for both ILS and introgression as sources of variation among gene trees.

    View all citing articles on Scopus
    View full text