Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Berglund-Sonnhammer, Ann-Charlotte; Steffansson, Pär; Betts, Matthew J.; Liberles, David A.

doi:10.1007/s00239-005-0096-1

Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Published: 07 July 2006

Volume 63, pages 240–250, (2006)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

Ann-Charlotte Berglund-Sonnhammer^1,2,
Pär Steffansson¹,
Matthew J. Betts³^nAff4 &
…
David A. Liberles^1,3^nAff5

346 Accesses
60 Citations
Explore all metrics

Abstract

Gene duplication and gene loss as well as other biological events can result in multiple copies of genes in a given species. Because of these gene duplication and loss dynamics, in addition to variation in sequence evolution and other sources of uncertainty, different gene trees ultimately present different evolutionary histories. All of this together results in gene trees that give different topologies from each other, making consensus species trees ambiguous in places. Other sources of data to generate species trees are also unable to provide completely resolved binary species trees. However, in addition to gene duplication events, speciation events have provided some underlying phylogenetic signal, enabling development of algorithms to characterize these processes. Therefore, a soft parsimony algorithm has been developed that enables the mapping of gene trees onto species trees and modification of uncertain or weakly supported branches based on minimizing the number of gene duplication and loss events implied by the tree. The algorithm also allows for rooting of unrooted trees and for removal of in-paralogues (lineage-specific duplicates and redundant sequences masquerading as such). The algorithm has also been made available for download as a software package, Softparsmap.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV bacterial viruses subcommittee

Article Open access 23 January 2023

Dann Turner, Andrey N. Shkoporov, … Evelien M. Adriaenssens

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Codon usage bias

Article 25 November 2021

Sujatha Thankeswaran Parvathy, Varatharajalu Udayasuriyan & Vijaipal Bhadana

References

Arvestad L, Berglund AC, Lagergren J, Sennblad B (2003) Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19:I7–I15
Article PubMed Google Scholar
Arvestad L, Berglund AC, Lagergen J, Sennblad B (2004) Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence analysis. RECOMB 2004:326–335
Google Scholar
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) Genbank. Nucleic Acids Res 33:D34–D38
Article PubMed CAS Google Scholar
Blanchette M, Green ED, Miller W, Haussler D (2004) Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res 14:2412–2423
Article PubMed CAS Google Scholar
Chen K, Durand D, Farach-Colton M (2000) Notung: a program for dating gene duplications and optimizing gene family trees. J Comput Biol 7:429–447
Article PubMed CAS Google Scholar
Cotton JA, Page RDM (2002) Going nuclear: Gene family evolution and vertebrate phylogeny reconciled. Proc Roy Soc London 269:1555–1561
Article CAS Google Scholar
Durand D, Halldorsson BV, Vernot B (2005) A hybrid micro-macroevolutionary approach to gene tree reconstruction. RECOMB 2005:250–264
Google Scholar
Eulenstein O, Mirkin B, Vingron M (1998) Duplication-based measures of difference between gene and species trees. J Comput Biol 5:135–148
PubMed CAS Google Scholar
Francino MP (2005) An adaptive radiation model for the origin of new gene functions. Nature Genet 37:573–577
Article PubMed CAS Google Scholar
Galtier N (2001) Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol 18:866–873
PubMed CAS Google Scholar
Garey MR, Johnson DS (1979) Computers and intractability, a guide to the theory of NP-completeness. Freeman, New York
Google Scholar
Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G (1979) Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 28:132–163
Article CAS Google Scholar
Grasso C, Lee C (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20:1546–1556
Article PubMed CAS Google Scholar
Guigo R, Muchnik I, Smith TF (1996) Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213
Article PubMed CAS Google Scholar
Hallett MT, Lagergren J (2000) New algorithms for the duplication-loss model. RECOMB 2000:138–146
Google Scholar
Hallet M, Lagergren J, Tofigh A (2004) Simultaneous identification of duplications and lateral transfers. RECOMB 2004:347–356
Google Scholar
Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755
Article PubMed CAS Google Scholar
Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA (2004) A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 5(2):R7
Article PubMed Google Scholar
Liberles DA, Schreiber DR, Govindarajan S, Chamberlin SG, Benner SA (2001) The Adaptive Evolution Database (TAED). Genome Biol 2(8):research0028.1-0028.6
PubMed CAS Google Scholar
Lopez P, Casane D, Philippe H (2002) Heterotachy, an important process of protein evolution. Mol Biol Evol 19:1–7
PubMed CAS Google Scholar
Lynch M, O’Hely M, Walsh B, Force A (2001) The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789–1804
PubMed CAS Google Scholar
Ma B, Li M, Zhang LX (2000) From gene trees to species trees. SIAM J Comput 30:729–752
Article Google Scholar
Maddison WP (1989) Reconstructing character evolution on polytomous cladograms. Cladistics 5:365–377
Article Google Scholar
Ohno S (1970) Evolution by gene duplication. Springer-Verlag, New York
Google Scholar
Page RDM (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol 43:58–77
Google Scholar
Page RDM (2000) Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Mol Phylogenet Evol 14:89–106
Article PubMed CAS Google Scholar
Page RDM, Cotton (2000) GeneTree: a tool for exploring gene family evolution. In: Sankoff D, Nadeau J (eds) Map alignment, and the evolution of gene families. Kluwer Academic, Dordrecht, pp 525–536
Google Scholar
Rastogi S, Liberles DA (2005) Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28
Article PubMed CAS Google Scholar
Roth C, Betts MJ, Steffansson P, Saelensminde G, Liberles DA (2005) The Adaptive Evolution Database (TAED): a phylogeny-based tool for comparative genomics. Nucleic Acids Res 33:D495–D497
Article PubMed CAS Google Scholar
Siltberg J, Liberles DA (2002) A simple covarion-based approach to analyse nucleotide substitution rates. J Evol Biol 15:588–594
Article CAS Google Scholar
Zhang LX (1997) On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol 4:177–187
Article PubMed CAS Google Scholar
Zmasek CM, Eddy SR (2001a) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics 17:383–384
Article CAS Google Scholar
Zmasek CM, Eddy SR (2001b) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828
Article CAS Google Scholar
Zmasek CM, Eddy SR (2002) RIO: Analyzing proteomes by automated phylogenomics using resamples inference of orthologs. BMC Bioinform 3:14
Article Google Scholar

Download references

Acknowledgments

We are grateful to Jens Lagergren for helpful discussions and also to three anonymous reviewers for their comments. This work was funded by Vetenskapsrådet, the Swedish Foundation for Strategic Research, and FUGE, the Norwegian national functional genomics platform.

Author information

Matthew J. Betts
Present address: EMBL Heidelberg, Meyerhoffstrasse 1, 69117, Heidelberg, Germany
David A. Liberles
Present address: Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, 82071, USA

Authors and Affiliations

Stockholm Bioinformatics Center, Stockholm University, 10691, Stockholm, Sweden
Ann-Charlotte Berglund-Sonnhammer, Pär Steffansson & David A. Liberles
Linnaeus Centre for Bioinformatics, Uppsala University, BMC Box 598, 75124, Uppsala, Sweden
Ann-Charlotte Berglund-Sonnhammer
Computational Biology Unit, BCCS, University of Bergen, 5020, Bergen, Norway
Matthew J. Betts & David A. Liberles

Authors

Ann-Charlotte Berglund-Sonnhammer
View author publications
You can also search for this author in PubMed Google Scholar
Pär Steffansson
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Betts
View author publications
You can also search for this author in PubMed Google Scholar
David A. Liberles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David A. Liberles.

Additional information

[Reviewing Editor: Dr. Rafael Zardoya]

Ann-Charlotte Berglund-Sonnhammer and Pär Steffansson contributed equally to this work.

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Definitions and Notation

A tree T is a connected graph with no cycles, and consists of a vertex set V(T) and an edge set E(T). The leaves of a tree are the vertices with degree 1, and the leaf set of T is denoted L(T), which is a subset of V(T). A tree T is rooted if there is exactly one distinguished vertex, the root, which is denoted r(T). For a rooted tree T, a child of a vertex u ∈ V(T) is denoted $ c_i^T (u) $, and the set of children for a vertex u in T is denoted C ^T(u). The rooted subtree of T rooted at u ∈ V(T) is denoted T _u. The planted subtree rooted at u containing only the child vertex c _i(u) is denoted $ T_{uc_i } (u) $. Further, for a nonroot vertex v ∈ V(T), let p ^T(v) be the parent of v. For any u, v ∈ V(T), let v ≤ ^T u if v ∈ V(T _u) and let v <^T u if $ v \in V(T_u )\backslash \{ u\} $. A rooted tree T is binary if all interior vertices have two children and an unrooted tree is binary if all interior vertices are connected to three other vertices. For a tree T, d = (L ₁, L ₂) is a split in T if there exists an edge e ∈ E(T) such that L ₁, L ₂ are the two leaf sets of the trees formed when e is removed. Given a tree T and a set of vertices $ V^{\prime} \subseteq V(T) $, let LCA ^T(V′) be the last common ancestor of the vertices in V′. Tree indexes are omitted whenever it is clear from the context, and trees are rooted if else is not specified.

A species tree S is a phylogenetic tree that describes the relationship between extant species. A gene tree G is a phylogenetic tree with a leaf labeling function σ:L(G)→L(S). The leaves in G represent genes and the leaves in S represent extant species, and thus a gene g ∈ L(G) belongs to the genome of species σ(g). We denote the set of all extant genomes that are represented by the leaves of the gene tree $ \sum {(G)} = \cup _{g \in L(G)} \sigma (g) $. For two rooted gene trees G and G′, the leaf g ₁∈L(G) equals the leaf g ₂ ∈ L(G′) if g ₁ and g ₂ are representing the same gene. The internal vertex g ₁∈V(G)\L(G) equals g ₂∈V(G′)\L(G′) if $ L(G_{g_1 } ) = L(G^{\prime}_{g_2 } ). $ Further, G is said to refine G′ if for every g ₂∈V(G′) there exists a g ₁∈V(G) such that g ₁ = g ₂. Goodman et al. (1979) introduced a mapping m, mapping the vertices in the gene tree to the vertices in the species tree. More precisely, given a species tree S and a gene tree G, m is a surjective mapping such that for all, $ g\in V(G),m(g) = LCA^S (\sum {(G_g )} ) $.

Gene Duplication and Gene Loss

Here we introduce a new soft parsimony mapping for uncertain species trees and binary gene trees. Further, we define how to compute gene duplication and gene loss events using this mapping.

Given a species tree S and a rooted binary gene tree G such that Σ(G) = L(S), let M be the mapping M: V(G)→2^V(S) such that

1.
For any g ∈ V(G) such that m(g) ∈ L(S), M(g) = {m(g)}.
2.
For any g ∈ V(G) such that m(g) ∉ L(S), $ M(g)= {\left\{ x \in C^S (m(g))|\exists g^{\prime} \in V(G), g^{\prime} < ^G g \wedge m(g^{\prime}) \le^s x \right\}} $ and define $ Z(g) = \cup _{s \in M(g)} L(S_s ) $.

Each mapping M for a given gene tree and species tree describes a hypothesis of how the gene tree evolved with respect to the species tree and, thus, which gene tree vertices are duplication events and which are speciation events.

For any interior vertex g ₁∈V(G)\L(G) the number of duplications associate with g is

$$ dup^{(S,G)} (g) =\left\{\matrix{1 \quad {\rm if} \, |Z(c_1 (g)) \cap Z(c_2 (g))| > 0 \cr 0 \quad {\rm otherwise}}\right. $$

(A1)

and the total number of duplications for G is

$$ Dup^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } dup^{(S,G)} (g) $$

(A2)

The number of losses associated with a vertex g∈V(G) and one of its two children c _i(g)∈V(G) is defined as

$$ loss_1 (g,c_i (g)) = |\{ s \in V(S)\} |m(c_i (g)) <^s s <^s m(g)\} | $$

(A3)

$$ loss_2 (g,c_i (g)) =\left\{\matrix{1\quad {\rm if}\, m(c_i(g)) < ^s m(g) \wedge |C^s \cr (m(c_i (g)))\backslash M(c_i (g))| > 0 \cr 0 \quad {\rm otherwise}}\right. $$

(A4)

$$ loss_3 (g,c_i (g))=\left\{\matrix{1\quad {\rm if}\ dup^{(S,G)}\ (g) = 1 \cr \wedge M(g) \ne M(c_i (g)) \cr 0\quad {\rm otherwise}}\right. $$

(A5)

Then the number of losses assigned to an interior vertex g∈V(G)\L(G) and the total number of losses in the gene tree is

$$ \eqalign{loss^{(S,G)} (g) = &\sum\nolimits_{i = 1}^2 {(loss_1 (g,c_i (g))} \cr &+ loss_2 (g,c_i (g)) + loss_3 (g,c_i (g)))} $$

(A6)

and

$$ Loss^{(S,G)} = \sum {_{g \in V(G)\backslash L(G)} } loss^{(S,G)} (g) $$

(A7)

Uncertain Gene Trees

In many real gene trees, vertices with more then two children exist. Here we present a heuristic algorithm for computing the minimum number of duplications and losses over the set of all gene trees refining the given gene tree. Duplications are prioritized before losses.

Let G be any gene tree, and let S be a species tree, such that Σ(G) = L(S). The set of binary gene trees that refine G is denoted BG ^G. Moreover, the set of trees in BG ^G that have the minimum number of duplications is denoted $ BG_{min }^{(S,\;G)} $ and can be expressed as $ BG_{min}^{(S,\;G)}=\{ H^{\prime} \in BG^G\mathop{|\forall}\limits^{\ \ min} H \in BG^G,\, {Dup^{(S,\;H^{\prime})}} \le {Dup^{(S,\;H)}}\mathop{\}}\limits^{min}$

The minimization of the number of losses is constrained by the minimum number of duplications. Thus, the numbers of duplications and losses are defined as

$ Dup_{min }^{(S,\;G)} =Dup_{}^{(S,\;H)} \;{\rm where} \;H \in BG^{(S,\;G)} $ and $ Loss_{min}^{(S,\;G)} =Minimum_{H \in BG_{min}^{(S,\;G) Loss^{(S,\;H)}}} $

Computing the number of duplications $ Dup_{min }^{(S,\;G)} $ is NP-complete (see Theorem A4), but for our heuristic approach, described below, duplications can be computed in polynomial time.

Algorithms

Since our model assumes that the number of duplications at a vertex g ∈ V(G)\L(G) is independent of the number of duplications at any other vertex g′ ∈ V(G)\L(G), we can calculate the number of duplications at the vertices of G in any order. The algorithm computing duplications is based on Theorem A1.

Theorem A1. Let S be a species tree, let G be a gene tree such that Σ(G) = L(S). For all g ∈ V(G)\L(G), denote A(g) to be all possible partitions of the set C ^G(g) = {g ₁ ,...,g _n }, let $ A^{\prime} (g) = \{ PA \in A(g)\left| {\forall D \in PA,\;\forall g_i ,\;g_j } \right. \in D;\;Z(g_i ) \cap Z(g_j ) = \phi \; \vee i = j\} $ and let $ A^{\prime} _{min}(g) = \{ PA \in A^{\prime} (g)\left| {\forall PA^{\prime} \in \;A^{\prime} } \right.(g);PA \le PA{^\prime} \} $. Then $ Dup_{min}^{(S,\;G)} = {\sum_{g \in V(G)\backslash L(G)} {\left| {PA(g)} \right|} - 1} $ where PA(g)∈ A′_min(g). In order to prove Theorem A1 the following lemmas are needed.

Lemma A2. Given a gene tree G and a species tree S such that Σ(G) ⫅ ⊆ L(S), then for all g, g ₁ = V (G), such that g = p (g ₁) it holds that Z(g ₁) ⫅ ⊆ Z(g).

Proof. Let G be a gene tree and S a species tree such that Σ(G) ⫅ ⊆ L(S), and let g, g ₁∈V(G), such that g = p (g ₁). Then the vertex set of $ G_{g_1 } $ is a subset of the vertex set of G _g, which by the definition of the set Σ gives Σ($ G_{g_1 } $) ⫅ ⊆ Σ(G _g), and further m(g ₁) ≤^S m(g). Consequently, for all s∈M (g ₁) there exists s′∈M(g) for which S ≤^S S′, and thus Z(g ₁) ⫅ ⊆ Z(g).

For a species tree S, a gene tree G, a gene tree G′∈BG ^G, let the notation (g ₁₁,g ₁₂:g ₁)g ₂ :g ₀ be an up triplet in G′ if g ₀, g ₁, g ₂, g ₁₁, g ₁₂∈V(G′), g ₁ = p (g ₁₁) = p(g ₁₂), g₀ = p(g ₁) = p(g ₂), g ₁∈V(G′)\V(G), dup ^{(S, G′)} (g ₁) = 1, and dup ^{(S, G′)}(g ₀) = 0.

Lemma A3. Let S be a species tree and let G be a gene tree such that Σ(G) = L(S), then for every gene tree G′∈BG ^G there exists a gene tree G′′∈BG ^G such that Dup ^{(S, G′)} = Dup ^{(S, G′′)} and G′′ do not contain any up triplets.

Proof. G′′ is computed through following steps.

l.
Let G ⁿ = G′, where n = 1.
2.
Find an up triplet (g ₁₁, g ₁₂:g ₁) g ₂:g ₀ in G ⁿ. If no up triplet exists, then G′′ = G ⁿ.
3.
We know that $ dup^{( {S,G^n})} ( {g_1 } ) = 1 $, $ dup^{( {S,G^n })} ({g_0 }) = 0 $, and since $ dup^{( {S,G^n})} ( {g_1 } ) = 1 $ it follows that $ \left| {Z\left( {g_{11} } \right) \cap Z\left( {g_{12} } \right)} \right| > 0 $. According to Lemma A2 $ Z\left( {g_{12} } \right) \subseteq Z\left( {g_1 } \right) $, and since $ dup^{( {S,G^n })} ({g_0 }) = 0 $, it follows that $ 0 = \left| {Z\left( {g_1 } \right) \cap Z\left( {g_2 } \right)} \right| \ge \left| {Z\left( {g_{12} } \right) \cap Z\left( {g_2 } \right)} \right| = 0 $. Consequently, it is possible to move the duplication associated with vertex g ₁ one edge closer to the root of G ⁿ by creating the following tree. Let G ⁿ⁺¹ denote the tree obtained when we take G ⁿ and exchange places with the rooted subtrees $ G_{g_{11} }^n $ and $ G_{g_2 }^n $. Note that $ Dup^{\left( {S,G^n } \right)} = Dup^{\left( {S,G{n + 1} } \right)} $. Set n: = n + 1 and resume at 2.

Theorem A1 can now be proven.

Proof. According to Lemma A3, we know that for any $ G^{\prime} \in BS_{min}^{\left( {S,G} \right)} $ it is possible to compute gene tree $ G^{\prime \prime }\in BS_{min}^{\left( {S,G} \right)} $ such that no up triplet exists in G′′. Now for every gene vertex g ∈ V(G), let PA(g) be the partition computed using G′′ in following steps.

1.
Locate the gene vertices g′, g′_{1, ⋖ ,} g′_n ∈ N(G′′) such that g′ = g, g′₁ = g _{1, ⋖ ,} g′_n = g _n where C ^G(g) = {g _{1, ⋖ ,} g _n}.
2.
If $ dup^{( {S, G^{\prime \prime}})} ( g^{\prime} ) = 0 $, then let PA(g) = {g ₁,_{⋖ ,} g _n} and we are done.
3.
Let L be list such that L = {g′₁,_{⋖ ,} g′_n} and PA′ (g) be an empty set.
4.
Set g′_t to be the first gene vertex in L.
5.
If $ dup^{( {S{\rm{,}}G{^{\prime \prime }} } )} ( {p( {g^{\prime} _t } )} ) = 0 $, set g′_t to be p(g′_t) and resume at 5.
6.
Let $ D^{\prime} _i = \left\{ {g^{\prime \prime} \in \left\{ {g^{\prime} _1 , \ldots ,g^{\prime} _n } \right\}\left| {g^{{\prime \prime}}\le } \right.g^{\prime }_t } \right\} $, add D′_i, to PA′(g), and remove the gene vertices in D′_i from L. If |L| > 0, resume at 4.

From Lemma A3 nd the above steps it holds that PA (g) ∈ A′(g). Furthermore, if PA (g) ∉ A′_min(g), then a binary gene tree with fewer duplications then G′′ could be constructed, but $ G^{{\prime \prime }}\in BS_{min}^{\left( {S,G} \right)} $ and thus it must hold that PA (g) ∈ A′_min(g). Consequently, for every gene tree $ G^{\prime }\in BS_{min}^{\left( {S,G} \right)} $ and for all g∈N (G)\L(G), there exists a minimum partition PA (g) ∈ A′_min(g) such that $ Dup_{\min }^{\left( {S,G} \right)} = Dup_{\min }^{\left( {S,G^{\prime} } \right)} = \sum\nolimits_{g \in N\left( G \right)\backslash L\left( G \right)} {\left| {PA\left( g \right)} \right|} - 1 $.

Inferring the number of duplications at a gene tree vertex

Let G be gene tree and S species tree, such that σ(G) = L(S). Then for gene tree vertex g∈V (G)\L(G), let A _g = C ^G(g) and let P _g be an empty list of sets of gene tree vertices.

1.
For all gene tree vertices in A _g,
2.
Pick a gene tree vertex g _i∈A _g such that for all g _j∈A _g it holds that $ | {Z\left( {g_j } \right)} | \le | {Z\left( {g_i } \right)} | $.
1. 2.1.
  Put g _i in the first set p _k ∈ P _g such that for all other gene tree vertices g _k in p _k it holds tha $ Z\left( {g_k } \right) \cap Z\left( {g_i } \right) = \phi $.
2. 2.2.
  Let A _g = A _g\{g _i}.
3. 2.3.
  If A _g is empty goto 4, otherwise goto 2.
3.
Set A _g = C ^G(g) and P _g to an empty list of sets. For all gene tree vertices in A _g,
1. 3.1.
  pick a gene tree vertex g _i∈A _g at random.
2. 3.2.
  Put g _i in the first set p _k of P _g such that for all other gene tree vertices g _k in p _k it holds that $ Z\left( {g_k } \right) \cap Z\left( {g_i } \right) = \phi $.
4.
Let g _i(k) denote gene vertex g _i in set p _k, if $ \left| { \cap _k \left( { \cup _i Z\left( {g_{i\left( k \right)} } \right)} \right)} \right| \ge 1 $ then a minimum partition is reached. The number of duplications equals the number of sets in P _g minus one.
5.
If it is possible to create a set L _g of gene vertices by taking exactly one gene vertex from every set p _k in P _g such that for any two gene vertices $ g_i ,\;g_j \in L_g \left| {Z\left( {g_k } \right) \cap Z\left( {g_i } \right)} \right| \ge 1 $, then a minimum partition is reached. The number of duplications equals the number of sets in P _g minus one.
6.
If we have not reached the maximum number of rebuilds, goto 3. Otherwise the method has failed.

Inferring the number of losses at a gene tree vertex

Let G be a gene tree and S a species tree, such that Σ(G) = L(S). Then for a gene tree vertex g ∈ V(G)\L(G), let P _g be the partition as given from the duplication algorithm. The objective is to build a rooted binary gene tree $ G_{g_0 } $ with $ L\left( {G_{g_0 } } \right) = C^G \left( g \right) $, such that for every $ g_i \in L\left( {G_{g_0 } } \right) $ and its corresponding g _j ∈ C ^G(g), it holds that M(g _i) = M(g _j). We wish to construct a gene tree $ G_{g_0 } $ such that the number of losses is less then or equal to any other binary gene tree constructed with the vertices in C ^G(g) as leaves. However, the method proposed here is approximate, which means that there might exist another binary gene tree constructed with the same set of leaves that gives fewer losses. The gene tree $ G_{g_0 } $ is then used in equations (A3)–(A5) to calculate the minimum number of losses associated with the gene tree vertex g. Let L _g be an empty sorted set of gene vertices such that for any two gene vertices g _i, g _j, ∈ L _g, g _i is before g _j if $ \left| {Z( {g_i }) } \right| | {Z( {g_j } )} | $.

1.
For every set in P _g,
1. 1.1.
  build a gene tree $ G_{g_i } $ using the splits found in the species tree. Denote the root g _i and add g _i to L _g.
2.
If there is more then one gene vertex in L _g resume at 3. Else,
1. 2.1.
  let g ₀ be the gene vertex in L _g,
2. 2.2.
  set $ G_{g_0 } = G_{g_i } $ and find locally the mapping M that puts the duplications as close to the leaves as possible. Resume at 5.
3.
Let j = 1.
1. 3.1.
  Let g ₀, g _j ∈ L _g be the roots of the first pair of subtrees for which a new gene tree $ G_{g_k } $ can be constructed by adding a gene tree vertex g _k and two edges such that g _k = p(g ₀) = p(g _j).
2. 3.2.
  For $ G_{g_k } $ find locally the mapping M that puts the duplications one step closer to the leaves. If no such mapping can be found, let j = j + l and resume at 3. l.
3. 3.3.
  Remove g ₀, g _j from L _g and add g _k to L _g. Resume at 2.
4.
Let g ₀, g ₁ ∈ L _g be the roots of the first pair of subtrees in L _g. Construct a new gene tree $ G_{g_k } $ by adding a gene tree vertex g _k and two edges such that g _k = p(g ₀) = p(g ₁).
1. 4.1.
  Remove g ₀, g ₁ from L _g and add g _k to L _g. Resume at 2.
5.
Compute the losses of $ G_{g_0 } $ by using equation (A7).

In different steps of the loss algorithm it says that we find locally the mapping M that puts duplications as close to the leaves as possible. That is done in the following way. For each triplet (g ₁₁, g ₁₂:g ₁) g ₂:g ₀, i.e., rooted tree with three leaves, in $ G_{g_0 } $ the mapping M is found such that duplications are assigned to vertices as close to the leaves as possible. This is successful if the root of the triplet is labeled as a duplication event and g ₁ is not, and if we can swap g _1i and g ₂ without increasing the number of duplication events occurring in the triplet.

NP-Completeness

The problem of computing the minimum number of duplications over the set of all gene trees refining a given gene tree using the soft parsimony mapping can be formulated as follows.

Uncertain Gene Tree (UGT)

Instance: A species tree S, a gene tree G such that Σ(G) = L(S), and an integer D. Question: Does a gene tree G′ exist such that G′ refines G and Dup ^(S,G′) ≤ D?

Theorem A4. UGT is NP-complete.

Proof. UGT belong to NP since given an instance of the problem and a certificate in the form of a binary gene tree G′, the answer of the question is Dup ^(S,G′)≤ D which can be computed in polynomial time. To prove it to be NP-hard, we reduce the Partition into Cliques problem (Garey and Johnson 1979) to UGT.

The Partition into Cliques problem

Given an undirected graph P = (V, E) and a positive integer K ≤ |V|, can the vertices in P be partitioned into k ≤ K disjoint sets V _l,V ₂,⋖,V _k such that for each V _i, i = l, 2,⋖,k, the subgraph P _i = (V _i, E _i) induced by V _i is a clique (Garey and Johnson 1979)?

Let P = (V, E) be any graph. Moreover, let S be a species tree and G a gene tree, for which there exists a bijection γ:V→C(r(G)) such that for all v _i, v _j ∈ V, it holds that $ \left| {Z\left( {\gamma \left( {v_i } \right)} \right) \cap Z\left( {\gamma \left( {v_j } \right)} \right)} \right| = 0 $ if and only if the edge (v _i, v _j) exists in the edge set of P. A rooted gene tree G and a rooted species tree S that satisfy these conditions can be constructed as follows. Let P′ = (V′, E′) be the complement of P. Let the species tree S have one internal vertex, the root s = r(S), such that the child set of s is the leaves of S, i.e., C ^S(s) = L(S), and define an injective function ρ that maps the edges in the graph P to the leaves of the species tree, i.e., ρ:E′→L(S). The gene tree G has root vertex g = r(G), where $ \left| {C^G \left( g \right)} \right| = \left| {V^{\prime }} \right| $ and γ(v _i) = g _i for all g _i ∈ C ^G(g). For every g _i ∈ C ^G(g) add a gene vertex g _ir for every edge e _r = (v _i, v _j) ∈ E′ and set σ(g _ir) = ρ(e _r). For all g _i ∈ C ^G(g) such that $ \left| {C^G \left( {g_i } \right)} \right| < 2 $ add a gene vertex g _ij under g _i and a species vertex s _j under the root vertex s, and set σ(g _ij) = s _j. Furthermore, for any certificate V _l,V ₂,⋖,V _k, a certificate G′ can be constructed such that $ Dup^{(S,G^{\prime })} + 1 \le \left| {\left\{ {V_{\rm{1}} ,V_2 , \ldots ,V_k } \right\}} \right| $. Construct a gene tree G′′ refining G, by, for every $ V_i \in \left\{ {V_1 ,V_2 , \ldots,V_k} \right\} $, adding a gene vertex g′_i under the root g′ = r(G′′) and, for every v _r ∈ V _i, adding $ G_{\gamma \left( {v_r } \right)} $ under g′_i. Then let G′ be any binary gene tree refining G′′. Thus, for any graph P and any positive integer K ≤ |V|, an integer D = K − 1, a gene tree G, and a species tree S can be constructed in polynomial time such that UGT gives the same answer as Partition into Cliques.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berglund-Sonnhammer, AC., Steffansson, P., Betts, M.J. et al. Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony. J Mol Evol 63, 240–250 (2006). https://doi.org/10.1007/s00239-005-0096-1

Download citation

Received: 22 April 2005
Accepted: 15 April 2006
Published: 07 July 2006
Issue Date: August 2006
DOI: https://doi.org/10.1007/s00239-005-0096-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Abstract

Access this article

Similar content being viewed by others

Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV bacterial viruses subcommittee

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Codon usage bias

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Definitions and Notation

Gene Duplication and Gene Loss

Uncertain Gene Trees

Algorithms

Inferring the number of duplications at a gene tree vertex

Inferring the number of losses at a gene tree vertex

NP-Completeness

Uncertain Gene Tree (UGT)

The Partition into Cliques problem

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony

Abstract

Access this article

Similar content being viewed by others

Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV bacterial viruses subcommittee

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Codon usage bias

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Appendix: Mathematical Properties of the Soft Parsimony Algorithm

Definitions and Notation

Gene Duplication and Gene Loss

Uncertain Gene Trees

Algorithms

Inferring the number of duplications at a gene tree vertex

Inferring the number of losses at a gene tree vertex

NP-Completeness

Uncertain Gene Tree (UGT)

The Partition into Cliques problem

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation