Skip to main content
Log in

Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Gene duplication and gene loss as well as other biological events can result in multiple copies of genes in a given species. Because of these gene duplication and loss dynamics, in addition to variation in sequence evolution and other sources of uncertainty, different gene trees ultimately present different evolutionary histories. All of this together results in gene trees that give different topologies from each other, making consensus species trees ambiguous in places. Other sources of data to generate species trees are also unable to provide completely resolved binary species trees. However, in addition to gene duplication events, speciation events have provided some underlying phylogenetic signal, enabling development of algorithms to characterize these processes. Therefore, a soft parsimony algorithm has been developed that enables the mapping of gene trees onto species trees and modification of uncertain or weakly supported branches based on minimizing the number of gene duplication and loss events implied by the tree. The algorithm also allows for rooting of unrooted trees and for removal of in-paralogues (lineage-specific duplicates and redundant sequences masquerading as such). The algorithm has also been made available for download as a software package, Softparsmap.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig 1
Fig 2
Fig 3
Fig 4
Fig 5

Similar content being viewed by others

Sujatha Thankeswaran Parvathy, Varatharajalu Udayasuriyan & Vijaipal Bhadana

References

  • Arvestad L, Berglund AC, Lagergren J, Sennblad B (2003) Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19:I7–I15

    Article  PubMed  Google Scholar 

  • Arvestad L, Berglund AC, Lagergen J, Sennblad B (2004) Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence analysis. RECOMB 2004:326–335

    Google Scholar 

  • Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) Genbank. Nucleic Acids Res 33:D34–D38

    Article  PubMed  CAS  Google Scholar 

  • Blanchette M, Green ED, Miller W, Haussler D (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res 14:2412–2423

    Article  PubMed  CAS  Google Scholar 

  • Chen K, Durand D, Farach-Colton M (2000) Notung: a program for dating gene duplications and optimizing gene family trees. J Comput Biol 7:429–447

    Article  PubMed  CAS  Google Scholar 

  • Cotton JA, Page RDM (2002) Going nuclear: Gene family evolution and vertebrate phylogeny reconciled. Proc Roy Soc London 269:1555–1561

    Article  CAS  Google Scholar 

  • Durand D, Halldorsson BV, Vernot B (2005) A hybrid micro-macroevolutionary approach to gene tree reconstruction. RECOMB 2005:250–264

    Google Scholar 

  • Eulenstein O, Mirkin B, Vingron M (1998) Duplication-based measures of difference between gene and species trees. J Comput Biol 5:135–148

    PubMed  CAS  Google Scholar 

  • Francino MP (2005) An adaptive radiation model for the origin of new gene functions. Nature Genet 37:573–577

    Article  PubMed  CAS  Google Scholar 

  • Galtier N (2001) Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol 18:866–873

    PubMed  CAS  Google Scholar 

  • Garey MR, Johnson DS (1979) Computers and intractability, a guide to the theory of NP-completeness. Freeman, New York

    Google Scholar 

  • Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 28:132–163

    Article  CAS  Google Scholar 

  • Grasso C, Lee C (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20:1546–1556

    Article  PubMed  CAS  Google Scholar 

  • Guigo R, Muchnik I, Smith TF (1996) Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213

    Article  PubMed  CAS  Google Scholar 

  • Hallett MT, Lagergren J (2000) New algorithms for the duplication-loss model. RECOMB 2000:138–146

    Google Scholar 

  • Hallet M, Lagergren J, Tofigh A (2004) Simultaneous identification of duplications and lateral transfers. RECOMB 2004:347–356

    Google Scholar 

  • Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755

    Article  PubMed  CAS  Google Scholar 

  • Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2004) A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 5(2):R7

    Article  PubMed  Google Scholar 

  • Liberles DA, Schreiber DR, Govindarajan S, Chamberlin SG, Benner SA (2001) The Adaptive Evolution Database (TAED). Genome Biol 2(8):research0028.1-0028.6

    PubMed  CAS  Google Scholar 

  • Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein evolution. Mol Biol Evol 19:1–7

    PubMed  CAS  Google Scholar 

  • Lynch M, O’Hely M, Walsh B, Force A (2001) The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789–1804

    PubMed  CAS  Google Scholar 

  • Ma B, Li M, Zhang LX (2000) From gene trees to species trees. SIAM J Comput 30:729–752

    Article  Google Scholar 

  • Maddison WP (1989) Reconstructing character evolution on polytomous cladograms. Cladistics 5:365–377

    Article  Google Scholar 

  • Ohno S (1970) Evolution by gene duplication. Springer-Verlag, New York

    Google Scholar 

  • Page RDM (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol 43:58–77

    Google Scholar 

  • Page RDM (2000) Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol 14:89–106

    Article  PubMed  CAS  Google Scholar 

  • Page RDM, Cotton (2000) GeneTree: a tool for exploring gene family evolution. In: Sankoff D, Nadeau J (eds) Map alignment, and the evolution of gene families. Kluwer Academic, Dordrecht, pp 525–536

    Google Scholar 

  • Rastogi S, Liberles DA (2005) Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28

    Article  PubMed  CAS  Google Scholar 

  • Roth C, Betts MJ, Steffansson P, Saelensminde G, Liberles DA (2005) The Adaptive Evolution Database (TAED): a phylogeny-based tool for comparative genomics. Nucleic Acids Res 33:D495–D497

    Article  PubMed  CAS  Google Scholar 

  • Siltberg J, Liberles DA (2002) A simple covarion-based approach to analyse nucleotide substitution rates. J Evol Biol 15:588–594

    Article  CAS  Google Scholar 

  • Zhang LX (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol 4:177–187

    Article  PubMed  CAS  Google Scholar 

  • Zmasek CM, Eddy SR (2001a) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics 17:383–384

    Article  CAS  Google Scholar 

  • Zmasek CM, Eddy SR (2001b) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828

    Article  CAS  Google Scholar 

  • Zmasek CM, Eddy SR (2002) RIO: Analyzing proteomes by automated phylogenomics using resamples inference of orthologs. BMC Bioinform 3:14

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to Jens Lagergren for helpful discussions and also to three anonymous reviewers for their comments. This work was funded by Vetenskapsrådet, the Swedish Foundation for Strategic Research, and FUGE, the Norwegian national functional genomics platform.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David A. Liberles.

Additional information

[Reviewing Editor: Dr. Rafael Zardoya]

Ann-Charlotte Berglund-Sonnhammer and Pär Steffansson contributed equally to this work.

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Definitions and Notation

A tree T is a connected graph with no cycles, and consists of a vertex set V(T) and an edge set E(T). The leaves of a tree are the vertices with degree 1, and the leaf set of T is denoted L(T), which is a subset of V(T). A tree T is rooted if there is exactly one distinguished vertex, the root, which is denoted r(T). For a rooted tree T, a child of a vertex uV(T) is denoted \( c_i^T (u) \), and the set of children for a vertex u in T is denoted C T(u). The rooted subtree of T rooted at uV(T) is denoted T u . The planted subtree rooted at u containing only the child vertex c i (u) is denoted \( T_{uc_i } (u) \). Further, for a nonroot vertex vV(T), let p T(v) be the parent of v. For any u, vV(T), let vT u if vV(T u ) and let v <T u if \( v \in V(T_u )\backslash \{ u\} \). A rooted tree T is binary if all interior vertices have two children and an unrooted tree is binary if all interior vertices are connected to three other vertices. For a tree T, d = (L 1, L 2) is a split in T if there exists an edge eE(T) such that L 1, L 2 are the two leaf sets of the trees formed when e is removed. Given a tree T and a set of vertices \( V^{\prime} \subseteq V(T) \), let LCA T(V′) be the last common ancestor of the vertices in V′. Tree indexes are omitted whenever it is clear from the context, and trees are rooted if else is not specified.

A species tree S is a phylogenetic tree that describes the relationship between extant species. A gene tree G is a phylogenetic tree with a leaf labeling function σ:L(G)→L(S). The leaves in G represent genes and the leaves in S represent extant species, and thus a gene gL(G) belongs to the genome of species σ(g). We denote the set of all extant genomes that are represented by the leaves of the gene tree \( \sum {(G)} = \cup _{g \in L(G)} \sigma (g) \). For two rooted gene trees G and G′, the leaf g 1L(G) equals the leaf g 2L(G′) if g 1 and g 2 are representing the same gene. The internal vertex g 1 V(G)\L(G) equals g 2V(G′)\L(G′) if \( L(G_{g_1 } ) = L(G^{\prime}_{g_2 } ). \) Further, G is said to refine G′ if for every g 2V(G′) there exists a g 1V(G) such that g 1 = g 2. Goodman et al. (1979) introduced a mapping m, mapping the vertices in the gene tree to the vertices in the species tree. More precisely, given a species tree S and a gene tree G, m is a surjective mapping such that for all, \( g\in V(G),m(g) = LCA^S (\sum {(G_g )} ) \).

Gene Duplication and Gene Loss

Here we introduce a new soft parsimony mapping for uncertain species trees and binary gene trees. Further, we define how to compute gene duplication and gene loss events using this mapping.

Given a species tree S and a rooted binary gene tree G such that Σ(G) = L(S), let M be the mapping M: V(G)→2V(S) such that

  1. 1.

    For any gV(G) such that m(g) ∈ L(S), M(g) = {m(g)}.

  2. 2.

    For any gV(G) such that m(g) ∉ L(S), \( M(g)= {\left\{ x \in C^S (m(g))|\exists g^{\prime} \in V(G), g^{\prime} < ^G g \wedge m(g^{\prime}) \le^s x \right\}} \) and define \( Z(g) = \cup _{s \in M(g)} L(S_s ) \).

Each mapping M for a given gene tree and species tree describes a hypothesis of how the gene tree evolved with respect to the species tree and, thus, which gene tree vertices are duplication events and which are speciation events.

For any interior vertex g 1V(G)\L(G) the number of duplications associate with g is

$$ dup^{(S,G)} (g) =\left\{\matrix{1 \quad {\rm if} \, |Z(c_1 (g)) \cap Z(c_2 (g))| > 0 \cr 0 \quad {\rm otherwise}}\right. $$
(A1)

and the total number of duplications for G is

$$ Dup^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } dup^{(S,G)} (g) $$
(A2)

The number of losses associated with a vertex gV(G) and one of its two children c i (g)∈V(G) is defined as

$$ loss_1 (g,c_i (g)) = |\{ s \in V(S)\} |m(c_i (g)) <^s s <^s m(g)\} | $$
(A3)
$$ loss_2 (g,c_i (g)) =\left\{\matrix{1\quad {\rm if}\, m(c_i(g)) < ^s m(g) \wedge |C^s \cr (m(c_i (g)))\backslash M(c_i (g))| > 0 \cr 0 \quad {\rm otherwise}}\right. $$
(A4)
$$ loss_3 (g,c_i (g))=\left\{\matrix{1\quad {\rm if}\ dup^{(S,G)}\ (g) = 1 \cr \wedge M(g) \ne M(c_i (g)) \cr 0\quad {\rm otherwise}}\right. $$
(A5)

Then the number of losses assigned to an interior vertex gV(G)\L(G) and the total number of losses in the gene tree is

$$ \eqalign{loss^{(S,G)} (g) = &\sum\nolimits_{i = 1}^2 {(loss_1 (g,c_i (g))} \cr &+ loss_2 (g,c_i (g)) + loss_3 (g,c_i (g)))} $$
(A6)

and

$$ Loss^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } loss^{(S,G)} (g) $$
(A7)

Uncertain Gene Trees

In many real gene trees, vertices with more then two children exist. Here we present a heuristic algorithm for computing the minimum number of duplications and losses over the set of all gene trees refining the given gene tree. Duplications are prioritized before losses.

Let G be any gene tree, and let S be a species tree, such that Σ(G) = L(S). The set of binary gene trees that refine G is denoted BG G. Moreover, the set of trees in BG G that have the minimum number of duplications is denoted \( BG_{min }^{(S,\;G)} \) and can be expressed as \( BG_{min}^{(S,\;G)}=\{ H^{\prime} \in BG^G\mathop{|\forall}\limits^{\ \ min} H \in BG^G,\, {Dup^{(S,\;H^{\prime})}} \le {Dup^{(S,\;H)}}\mathop{\}}\limits^{min}\)

The minimization of the number of losses is constrained by the minimum number of duplications. Thus, the numbers of duplications and losses are defined as

\( Dup_{min }^{(S,\;G)} =Dup_{}^{(S,\;H)} \;{\rm where} \;H \in BG^{(S,\;G)} \) and \( Loss_{min}^{(S,\;G)} =Minimum_{H \in BG_{min}^{(S,\;G) Loss^{(S,\;H)}}} \)

Computing the number of duplications \( Dup_{min }^{(S,\;G)} \) is NP-complete (see Theorem A4), but for our heuristic approach, described below, duplications can be computed in polynomial time.

Algorithms

Since our model assumes that the number of duplications at a vertex gV(G)\L(G) is independent of the number of duplications at any other vertex g′ ∈ V(G)\L(G), we can calculate the number of duplications at the vertices of G in any order. The algorithm computing duplications is based on Theorem A1.

Theorem A1. Let S be a species tree, let G be a gene tree such that Σ(G) = L(S). For all gV(G)\L(G), denote A(g) to be all possible partitions of the set C G(g) = {g 1 ,...,g n }, let \( A^{\prime} (g) = \{ PA \in A(g)\left| {\forall D \in PA,\;\forall g_i ,\;g_j } \right. \in D;\;Z(g_i ) \cap Z(g_j ) = \phi \; \vee i = j\} \) and let \( A^{\prime} _{min}(g) = \{ PA \in A^{\prime} (g)\left| {\forall PA^{\prime} \in \;A^{\prime} } \right.(g);PA \le PA{^\prime} \} \). Then \( Dup_{min}^{(S,\;G)} = {\sum_{g \in V(G)\backslash L(G)} {\left| {PA(g)} \right|} - 1} \) where PA(g)A min (g). In order to prove Theorem A1 the following lemmas are needed.

Lemma A2. Given a gene tree G and a species tree S such that Σ(G) ⫅ ⊆ L(S), then for all g, g 1 = V (G), such that g = p (g 1 ) it holds that Z(g 1) ⫅ ⊆ Z(g).

Proof. Let G be a gene tree and S a species tree such that Σ(G) ⫅ ⊆ L(S), and let g, g 1V(G), such that g = p (g 1). Then the vertex set of \( G_{g_1 } \) is a subset of the vertex set of G g , which by the definition of the set Σ gives Σ(\( G_{g_1 } \)) ⫅ ⊆ Σ(G g ), and further m(g 1) ≤S m(g). Consequently, for all sM (g 1) there exists s′∈M(g) for which SS S′, and thus Z(g 1) ⫅ ⊆ Z(g).

For a species tree S, a gene tree G, a gene tree G′∈BG G, let the notation (g 11,g 12:g 1)g 2 :g 0 be an up triplet in G′ if g 0, g 1, g 2, g 11, g 12∈V(G′), g 1 = p (g 11) = p(g 12), g0 = p(g 1) = p(g 2), g 1∈V(G′)\V(G), dup (S, G′) (g 1) = 1, and dup (S, G′)(g 0) = 0.

Lemma A3. Let S be a species tree and let G be a gene tree such that Σ(G) = L(S), then for every gene tree G′∈BG G there exists a gene tree G′′∈BG G such that Dup (S, G′) = Dup (S, G′′) and G′′ do not contain any up triplets.

Proof. G′′ is computed through following steps.

  1. l.

    Let G n = G′, where n = 1.

  2. 2.

    Find an up triplet (g 11, g 12:g 1) g 2:g 0 in G n. If no up triplet exists, then G′′ = G n.

  3. 3.

    We know that \( dup^{( {S,G^n})} ( {g_1 } ) = 1 \), \( dup^{( {S,G^n })} ({g_0 }) = 0 \), and since \( dup^{( {S,G^n})} ( {g_1 } ) = 1 \) it follows that \( \left| {Z\left( {g_{11} } \right) \cap Z\left( {g_{12} } \right)} \right| > 0 \). According to Lemma A2 \( Z\left( {g_{12} } \right) \subseteq Z\left( {g_1 } \right) \), and since \( dup^{( {S,G^n })} ({g_0 }) = 0 \), it follows that \( 0 = \left| {Z\left( {g_1 } \right) \cap Z\left( {g_2 } \right)} \right| \ge \left| {Z\left( {g_{12} } \right) \cap Z\left( {g_2 } \right)} \right| = 0 \). Consequently, it is possible to move the duplication associated with vertex g 1 one edge closer to the root of G n by creating the following tree. Let G n+1 denote the tree obtained when we take G n and exchange places with the rooted subtrees \( G_{g_{11} }^n \) and \( G_{g_2 }^n \). Note that \( Dup^{\left( {S,G^n } \right)} = Dup^{\left( {S,G{n + 1} } \right)} \). Set n: = n + 1 and resume at 2.

Theorem A1 can now be proven.

Proof. According to Lemma A3, we know that for any \( G^{\prime} \in BS_{min}^{\left( {S,G} \right)} \) it is possible to compute gene tree \( G^{\prime \prime }\in BS_{min}^{\left( {S,G} \right)} \) such that no up triplet exists in G′′. Now for every gene vertex gV(G), let PA(g) be the partition computed using G′′ in following steps.

  1. 1.

    Locate the gene vertices g′, g1, ⋖ , g n N(G′′) such that g′ = g, g1 = g 1, ⋖ , g n = g n where C G(g) = {g 1, ⋖ , g n }.

  2. 2.

    If \( dup^{( {S, G^{\prime \prime}})} ( g^{\prime} ) = 0 \), then let PA(g) = {g 1 , ⋖ , g n } and we are done.

  3. 3.

    Let L be list such that L = {g 1 , ⋖ , g n } and PA′ (g) be an empty set.

  4. 4.

    Set g t to be the first gene vertex in L.

  5. 5.

    If \( dup^{( {S{\rm{,}}G{^{\prime \prime }} } )} ( {p( {g^{\prime} _t } )} ) = 0 \), set g t to be p(g t ) and resume at 5.

  6. 6.

    Let \( D^{\prime} _i = \left\{ {g^{\prime \prime} \in \left\{ {g^{\prime} _1 , \ldots ,g^{\prime} _n } \right\}\left| {g^{{\prime \prime}}\le } \right.g^{\prime }_t } \right\} \), add D i , to PA′(g), and remove the gene vertices in D i from L. If |L| > 0, resume at 4.

From Lemma A3 nd the above steps it holds that PA (g) ∈ A′(g). Furthermore, if PA (g) ∉ A min (g), then a binary gene tree with fewer duplications then G′′ could be constructed, but \( G^{{\prime \prime }}\in BS_{min}^{\left( {S,G} \right)} \) and thus it must hold that PA (g) ∈ A min (g). Consequently, for every gene tree \( G^{\prime }\in BS_{min}^{\left( {S,G} \right)} \) and for all gN (G)\L(G), there exists a minimum partition PA (g) ∈ A min (g) such that \( Dup_{\min }^{\left( {S,G} \right)} = Dup_{\min }^{\left( {S,G^{\prime} } \right)} = \sum\nolimits_{g \in N\left( G \right)\backslash L\left( G \right)} {\left| {PA\left( g \right)} \right|} - 1 \).

Inferring the number of duplications at a gene tree vertex

Let G be gene tree and S species tree, such that σ(G) = L(S). Then for gene tree vertex gV (G)\L(G), let A g = C G(g) and let P g be an empty list of sets of gene tree vertices.

  1. 1.

    For all gene tree vertices in A g ,

  2. 2.

    Pick a gene tree vertex g i A g such that for all g j A g it holds that \( | {Z\left( {g_j } \right)} | \le | {Z\left( {g_i } \right)} | \).

    1. 2.1.

      Put g i in the first set p k P g such that for all other gene tree vertices g k in p k it holds tha \( Z\left( {g_k } \right) \cap Z\left( {g_i } \right) = \phi \).

    2. 2.2.

      Let A g = A g \{g i }.

    3. 2.3.

      If A g is empty goto 4, otherwise goto 2.

  3. 3.

    Set A g = C G(g) and P g to an empty list of sets. For all gene tree vertices in A g ,

    1. 3.1.

      pick a gene tree vertex g i A g at random.

    2. 3.2.

      Put g i in the first set p k of P g such that for all other gene tree vertices g k in p k it holds that \( Z\left( {g_k } \right) \cap Z\left( {g_i } \right) = \phi \).

  4. 4.

    Let g i(k) denote gene vertex g i in set p k , if \( \left| { \cap _k \left( { \cup _i Z\left( {g_{i\left( k \right)} } \right)} \right)} \right| \ge 1 \) then a minimum partition is reached. The number of duplications equals the number of sets in P g minus one.

  5. 5.

    If it is possible to create a set L g of gene vertices by taking exactly one gene vertex from every set p k in P g such that for any two gene vertices \( g_i ,\;g_j \in L_g \left| {Z\left( {g_k } \right) \cap Z\left( {g_i } \right)} \right| \ge 1 \), then a minimum partition is reached. The number of duplications equals the number of sets in P g minus one.

  6. 6.

    If we have not reached the maximum number of rebuilds, goto 3. Otherwise the method has failed.

Inferring the number of losses at a gene tree vertex

Let G be a gene tree and S a species tree, such that Σ(G) = L(S). Then for a gene tree vertex gV(G)\L(G), let P g be the partition as given from the duplication algorithm. The objective is to build a rooted binary gene tree \( G_{g_0 } \) with \( L\left( {G_{g_0 } } \right) = C^G \left( g \right) \), such that for every \( g_i \in L\left( {G_{g_0 } } \right) \) and its corresponding g j C G(g), it holds that M(g i ) = M(g j ). We wish to construct a gene tree \( G_{g_0 } \) such that the number of losses is less then or equal to any other binary gene tree constructed with the vertices in C G(g) as leaves. However, the method proposed here is approximate, which means that there might exist another binary gene tree constructed with the same set of leaves that gives fewer losses. The gene tree \( G_{g_0 } \) is then used in equations (A3)–(A5) to calculate the minimum number of losses associated with the gene tree vertex g. Let L g be an empty sorted set of gene vertices such that for any two gene vertices g i , g j , ∈ L g , g i is before g j if \( \left| {Z( {g_i }) } \right| | {Z( {g_j } )} | \).

  1. 1.

    For every set in P g ,

    1. 1.1.

      build a gene tree \( G_{g_i } \) using the splits found in the species tree. Denote the root g i and add g i to L g .

  2. 2.

    If there is more then one gene vertex in L g resume at 3. Else,

    1. 2.1.

      let g 0 be the gene vertex in L g ,

    2. 2.2.

      set \( G_{g_0 } = G_{g_i } \) and find locally the mapping M that puts the duplications as close to the leaves as possible. Resume at 5.

  3. 3.

    Let j = 1.

    1. 3.1.

      Let g 0, g j L g be the roots of the first pair of subtrees for which a new gene tree \( G_{g_k } \) can be constructed by adding a gene tree vertex g k and two edges such that g k = p(g 0) = p(g j ).

    2. 3.2.

      For \( G_{g_k } \) find locally the mapping M that puts the duplications one step closer to the leaves. If no such mapping can be found, let j = j + l and resume at 3. l.

    3. 3.3.

      Remove g 0, g j from L g and add g k to L g . Resume at 2.

  4. 4.

    Let g 0, g 1L g be the roots of the first pair of subtrees in L g . Construct a new gene tree \( G_{g_k } \) by adding a gene tree vertex g k and two edges such that g k = p(g 0) = p(g 1).

    1. 4.1.

      Remove g 0, g 1 from L g and add g k to L g . Resume at 2.

  5. 5.

    Compute the losses of \( G_{g_0 } \) by using equation (A7).

In different steps of the loss algorithm it says that we find locally the mapping M that puts duplications as close to the leaves as possible. That is done in the following way. For each triplet (g 11, g 12:g 1) g 2:g 0, i.e., rooted tree with three leaves, in \( G_{g_0 } \) the mapping M is found such that duplications are assigned to vertices as close to the leaves as possible. This is successful if the root of the triplet is labeled as a duplication event and g 1 is not, and if we can swap g 1i and g 2 without increasing the number of duplication events occurring in the triplet.

NP-Completeness

The problem of computing the minimum number of duplications over the set of all gene trees refining a given gene tree using the soft parsimony mapping can be formulated as follows.

Uncertain Gene Tree (UGT)

Instance: A species tree S, a gene tree G such that Σ(G) = L(S), and an integer D. Question: Does a gene tree G′ exist such that G′ refines G and Dup (S,G′)D?

Theorem A4. UGT is NP-complete.

Proof. UGT belong to NP since given an instance of the problem and a certificate in the form of a binary gene tree G′, the answer of the question is Dup (S,G′)D which can be computed in polynomial time. To prove it to be NP-hard, we reduce the Partition into Cliques problem (Garey and Johnson 1979) to UGT.

The Partition into Cliques problem

Given an undirected graph P = (V, E) and a positive integer K ≤ |V|, can the vertices in P be partitioned into kK disjoint sets V l,V 2,⋖,V k such that for each V i , i = l, 2,⋖,k, the subgraph P i = (V i , E i ) induced by V i is a clique (Garey and Johnson 1979)?

Let P = (V, E) be any graph. Moreover, let S be a species tree and G a gene tree, for which there exists a bijection γ:VC(r(G)) such that for all v i , v j V, it holds that \( \left| {Z\left( {\gamma \left( {v_i } \right)} \right) \cap Z\left( {\gamma \left( {v_j } \right)} \right)} \right| = 0 \) if and only if the edge (v i , v j ) exists in the edge set of P. A rooted gene tree G and a rooted species tree S that satisfy these conditions can be constructed as follows. Let P′ = (V′, E′) be the complement of P. Let the species tree S have one internal vertex, the root s = r(S), such that the child set of s is the leaves of S, i.e., C S(s) = L(S), and define an injective function ρ that maps the edges in the graph P to the leaves of the species tree, i.e., ρ:E′→L(S). The gene tree G has root vertex g = r(G), where \( \left| {C^G \left( g \right)} \right| = \left| {V^{\prime }} \right| \) and γ(v i ) = g i for all g i C G(g). For every g i C G(g) add a gene vertex g ir for every edge e r = (v i , v j ) ∈ E′ and set σ(g ir ) = ρ(e r ). For all g i C G(g) such that \( \left| {C^G \left( {g_i } \right)} \right| < 2 \) add a gene vertex g ij under g i and a species vertex s j under the root vertex s, and set σ(g ij ) = s j . Furthermore, for any certificate V l,V 2,⋖,V k , a certificate G′ can be constructed such that \( Dup^{(S,G^{\prime })} + 1 \le \left| {\left\{ {V_{\rm{1}} ,V_2 , \ldots ,V_k } \right\}} \right| \). Construct a gene tree G′′ refining G, by, for every \( V_i \in \left\{ {V_1 ,V_2 , \ldots,V_k} \right\} \), adding a gene vertex g i under the root g′ = r(G′′) and, for every v r V i , adding \( G_{\gamma \left( {v_r } \right)} \) under g i . Then let G′ be any binary gene tree refining G′′. Thus, for any graph P and any positive integer K ≤ |V|, an integer D = K − 1, a gene tree G, and a species tree S can be constructed in polynomial time such that UGT gives the same answer as Partition into Cliques.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berglund-Sonnhammer, AC., Steffansson, P., Betts, M.J. et al. Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony. J Mol Evol 63, 240–250 (2006). https://doi.org/10.1007/s00239-005-0096-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-005-0096-1

Keywords

Navigation