A model explaining the size distribution of gene and protein families
Introduction
Gene families comprise genes with a high degree of similarity in structure and function, which are presumed to have evolved from a single ancestral gene. Protein families similarly comprise proteins sharing sequence similarity and function. It has been observed that the size distributions of both gene families [4] and protein families [1] exhibit power-law behaviour over a wide range. While such distributions are characteristic of complex processes that exhibit self-organized criticality [3], they can also result from simpler mechanisms (see [6]). In this article we show the observed power-law behaviour in the distribution of gene and protein family sizes can be explained using a simple birth-and-death process model for the evolution of families. The model can be fitted by maximum likelihood to family size data, and in an example is shown to provide a very good fit. The model assumes that new genes in a family arise from mutations of existing genes (occurring independently and at random at a fixed probability rate) and that individual genes in a family can be eliminated (again independently and at random at a fixed probability rate). Furthermore new families arise from the random splitting of existing families (again at a fixed probability rate). The model is related to one used by the authors to explain genus size distributions [5] and is a development of the model of Yule [7], proposed almost eighty years ago.
Section snippets
The model
Consider a gene family which begins with one gene (at time t=0). Suppose that in time (t, t+h) there is a probability λh+o(h) that any given gene may mutate and create in addition to replicates of itself, a new gene in the family (a speciation); and a probability μh+o(h) that the individual gene alone is selected out of the genome (an extinction). Suppose further that all speciations and extinctions are independent. Under these assumptions, Nt, the number of genes in the family in existence at
Power-law behaviour in the family size distribution
It is shown in this section that in the case λ>μ, the distribution of family size exhibits power-law behaviour in the upper tail.
Fitting the model to data
The model can be fitted to gene family size data by estimating model parameters by maximum likelihood (ML). The log-likelihood is
Although there are three parameters in the model, the p.m.f. {qi} depends only on the two ratios λ/ρ=Λ, say, and μ/ρ=M, say. To calculate numerically the log-likelihood for given values of Λ and M, the recursion (10) can be iterated backwards from the asymptotic value (18) for a suitably large n. ML estimates of Λ and M
Conclusions
Earlier authors have shown empirically that the size distributions of gene and protein families exhibit power-law behaviour. This paper offers a simple evolutionary model which predicts a theoretical size distribution which exhibits such behaviour and which can be fitted to family-size data by maximum likelihood. The model is one of neutral evolution in the sense that mutations creating new genes (or new families) and extinctions of genes occur independently and at random. It can be thought of
References (7)
- et al.
On the size distribution of live genera
J. Theor. Biol.
(2002) - J.S. Bader, Evolutionary implications of a power-law distribution of protein family sizes,...
The Elements of Stochastic Processes
(1964)
Cited by (35)
Diversity and evolution of the P450 family in arthropods
2020, Insect Biochemistry and Molecular BiologyCitation Excerpt :The exception to this power-law pattern of many CYP families with few genes and few families with many genes is the L. polyphemus CYPome of 42 genes (see below, ohnologs), in which the only large clade is formed by four CYP3001T genes. This general power-law pattern has been noted long ago for the sizes of gene families in genomes (Huynen and Van Nimwegen, 1999), and has been described throughout eukaryotic CYPomes as well (Reed and Hughes, 2004; Feyereisen, 2011; Sezutsu et al., 2013). Mathematical birth/death models can closely approximate such patterns (Qian et al., 2001; Karev et al., 2004; Reed and Hughes, 2004).
Fifteen million years of evolution in the Oryza genus shows extensive gene family expansion
2014, Molecular PlantCitation Excerpt :Hence the amplification of K-box-containing MADS-box proteins has allowed the diversification of these genes’ function in plant development through new protein–protein interactions (Hofer and Ellis, 2002). Theoretical models proposed for gene family evolution combine neutral processes like the stochastic birth and death (BD), which predicts that gene families continuously undergo random gain and loss events, and directional processes, related to the functional fates of gene duplicates (Zimmer et al., 1980; Reed and Hughes, 2004; Hahn et al., 2005). Indeed, according to evolutionary models proposed for the fate of duplicated genes, new duplicated copies are randomly fixed by genetic drift and most of them are then randomly lost through recombination-dependent deletion, or the accumulation of loss-of-function mutations (pseudogenization).
Arthropod CYPomes illustrate the tempo and mode in P450 evolution
2011, Biochimica et Biophysica Acta - Proteins and ProteomicsReconsidering the significance of genomic word frequencies
2007, Trends in GeneticsCitation Excerpt :The answer might well lie in duplicative processes. The power-law tail of size distributions [4] for protein domains and gene families can be explained by birth and death models [10,11], in which family size changes by duplication and deletion processes, and new families are introduced by a steady innovation process. A similar model applies to genomic word frequencies.
Evolution of protein families: Is it possible to distinguish between domains of life?
2007, GeneCitation Excerpt :We classify as non-parasitic those organisms that are either free-living or are animal or plant commensals (see Supplementary Material Table 1 for a complete list of all the species considered for each lifestyle group). The overall form of protein-family size distributions can be understood in terms of birth (duplication with or without mutations), death (loss) and innovation (de novo acquisition) (BDI) of genes (Huynen and van Nimwegen, 1998; Yanai et al., 2000; Karev et al., 2002, 2003, 2004; Koonin et al., 2002; Reed and Hughes, 2004). However, our analysis of the empirical distributions reveals that such estimations are in fact misleading.
A discrete model of evolution of small paralog families
2007, Mathematical Models and Methods in Applied Sciences