A model explaining the size distribution of gene and protein families

https://doi.org/10.1016/j.mbs.2003.11.002Get rights and content

Abstract

This article deals with the theoretical size distribution of gene and protein families in complete genomes. A simple evolutionary model for the development of such families in which genes in a family are formed or selected against independently and at random, and in which new families are formed by the random splitting of existing families, is used to derive the resulting size distribution. Mathematically this turns out to be the distribution of the state of a homogeneous birth-and-death process after an exponentially distributed time, which it is shown will under certain conditions exhibit the power-law behaviour observed for gene and protein family sizes.

Introduction

Gene families comprise genes with a high degree of similarity in structure and function, which are presumed to have evolved from a single ancestral gene. Protein families similarly comprise proteins sharing sequence similarity and function. It has been observed that the size distributions of both gene families [4] and protein families [1] exhibit power-law behaviour over a wide range. While such distributions are characteristic of complex processes that exhibit self-organized criticality [3], they can also result from simpler mechanisms (see [6]). In this article we show the observed power-law behaviour in the distribution of gene and protein family sizes can be explained using a simple birth-and-death process model for the evolution of families. The model can be fitted by maximum likelihood to family size data, and in an example is shown to provide a very good fit. The model assumes that new genes in a family arise from mutations of existing genes (occurring independently and at random at a fixed probability rate) and that individual genes in a family can be eliminated (again independently and at random at a fixed probability rate). Furthermore new families arise from the random splitting of existing families (again at a fixed probability rate). The model is related to one used by the authors to explain genus size distributions [5] and is a development of the model of Yule [7], proposed almost eighty years ago.

Section snippets

The model

Consider a gene family which begins with one gene (at time t=0). Suppose that in time (t, t+h) there is a probability λh+o(h) that any given gene may mutate and create in addition to replicates of itself, a new gene in the family (a speciation); and a probability μh+o(h) that the individual gene alone is selected out of the genome (an extinction). Suppose further that all speciations and extinctions are independent. Under these assumptions, Nt, the number of genes in the family in existence at

Power-law behaviour in the family size distribution

It is shown in this section that in the case λ>μ, the distribution of family size exhibits power-law behaviour in the upper tail.

Fitting the model to data

The model can be fitted to gene family size data by estimating model parameters by maximum likelihood (ML). The log-likelihood isℓ=∑i=1Nfilogqĩ=∑i=1Nfilogqilog(1−q0)∑i=1Nfi.

Although there are three parameters in the model, the p.m.f. {qi} depends only on the two ratios λ/ρ=Λ, say, and μ/ρ=M, say. To calculate numerically the log-likelihood for given values of Λ and M, the recursion (10) can be iterated backwards from the asymptotic value (18) for a suitably large n. ML estimates of Λ and M

Conclusions

Earlier authors have shown empirically that the size distributions of gene and protein families exhibit power-law behaviour. This paper offers a simple evolutionary model which predicts a theoretical size distribution which exhibits such behaviour and which can be fitted to family-size data by maximum likelihood. The model is one of neutral evolution in the sense that mutations creating new genes (or new families) and extinctions of genes occur independently and at random. It can be thought of

References (7)

  • W.J. Reed et al.

    On the size distribution of live genera

    J. Theor. Biol.

    (2002)
  • J.S. Bader, Evolutionary implications of a power-law distribution of protein family sizes,...
  • N.T.J. Bailey

    The Elements of Stochastic Processes

    (1964)
There are more references available in the full text version of this article.

Cited by (35)

  • Diversity and evolution of the P450 family in arthropods

    2020, Insect Biochemistry and Molecular Biology
    Citation Excerpt :

    The exception to this power-law pattern of many CYP families with few genes and few families with many genes is the L. polyphemus CYPome of 42 genes (see below, ohnologs), in which the only large clade is formed by four CYP3001T genes. This general power-law pattern has been noted long ago for the sizes of gene families in genomes (Huynen and Van Nimwegen, 1999), and has been described throughout eukaryotic CYPomes as well (Reed and Hughes, 2004; Feyereisen, 2011; Sezutsu et al., 2013). Mathematical birth/death models can closely approximate such patterns (Qian et al., 2001; Karev et al., 2004; Reed and Hughes, 2004).

  • Fifteen million years of evolution in the Oryza genus shows extensive gene family expansion

    2014, Molecular Plant
    Citation Excerpt :

    Hence the amplification of K-box-containing MADS-box proteins has allowed the diversification of these genes’ function in plant development through new protein–protein interactions (Hofer and Ellis, 2002). Theoretical models proposed for gene family evolution combine neutral processes like the stochastic birth and death (BD), which predicts that gene families continuously undergo random gain and loss events, and directional processes, related to the functional fates of gene duplicates (Zimmer et al., 1980; Reed and Hughes, 2004; Hahn et al., 2005). Indeed, according to evolutionary models proposed for the fate of duplicated genes, new duplicated copies are randomly fixed by genetic drift and most of them are then randomly lost through recombination-dependent deletion, or the accumulation of loss-of-function mutations (pseudogenization).

  • Arthropod CYPomes illustrate the tempo and mode in P450 evolution

    2011, Biochimica et Biophysica Acta - Proteins and Proteomics
  • Reconsidering the significance of genomic word frequencies

    2007, Trends in Genetics
    Citation Excerpt :

    The answer might well lie in duplicative processes. The power-law tail of size distributions [4] for protein domains and gene families can be explained by birth and death models [10,11], in which family size changes by duplication and deletion processes, and new families are introduced by a steady innovation process. A similar model applies to genomic word frequencies.

  • Evolution of protein families: Is it possible to distinguish between domains of life?

    2007, Gene
    Citation Excerpt :

    We classify as non-parasitic those organisms that are either free-living or are animal or plant commensals (see Supplementary Material Table 1 for a complete list of all the species considered for each lifestyle group). The overall form of protein-family size distributions can be understood in terms of birth (duplication with or without mutations), death (loss) and innovation (de novo acquisition) (BDI) of genes (Huynen and van Nimwegen, 1998; Yanai et al., 2000; Karev et al., 2002, 2003, 2004; Koonin et al., 2002; Reed and Hughes, 2004). However, our analysis of the empirical distributions reveals that such estimations are in fact misleading.

  • A discrete model of evolution of small paralog families

    2007, Mathematical Models and Methods in Applied Sciences
View all citing articles on Scopus
View full text