Trends in Genetics
Update
Genome AnalysisReconsidering the significance of genomic word frequencies
Genome Analysis
Introduction
To determine what constitutes the surprisingly frequent and rare in a genome is a fundamental and ongoing issue in genomics [1]. Sequence motifs might be unusually rare or frequent because they belong to mobile, structural or regulatory elements, and are thus subject to selective and adaptive forces. After the examination of oligonucleotide occurrences in more than sixty diverse genomes, we found that a Pareto-lognormal distribution captures the crucial features of oligonucleotide frequency distributions in all the studied genomes. Whereas prevailing random sequence models fail to produce such features, a neutral model of random duplications can. We illustrate our claim with a completely random copy-and-paste process that induces a distribution similar to those observed in real life sequences.
Section snippets
Random sequence models
The simplest sequence motif is an oligonucleotide, or DNA ‘word’. A definite word frequency distribution that characterizes a neutrally evolving sequence is necessary to establish whether a word appears unusually often or rarely in a genome sequence. For instance, the statistical significance of the (hypothetical) overrepresentation of a word w is routinely measured by the ‘tail probability’ P{N(w) ≥ n}, where n is the number of times w occurs in the studied sequence, and N(w) is the random
The distribution of oligonucleotide frequencies
To date, genomic spectra have not been fully characterized, aside from the observation of power-law behavior for certain word sizes 4, 5, 6 in the right-hand tail. Here we point out that a parametric distribution describes word frequencies extremely well. The distribution in question is the so-called ‘double Pareto-lognormal’ (DPL) distribution [7]. The DPL distribution fits many real-life size distributions, including that of personal incomes, human settlements, and files on the Internet [8].
Random evolution by duplication
Why would the DPL distribution systematically appear in genomic spectra? The answer might well lie in duplicative processes. The power-law tail of size distributions [4] for protein domains and gene families can be explained by birth and death models 10, 11, in which family size changes by duplication and deletion processes, and new families are introduced by a steady innovation process. A similar model applies to genomic word frequencies. Consider the occurrences of a particular word along the
Practical implications
As we have just suggested, the birth and death model implies that some words occur often simply by chance, and not because of their functionality. Words that are abundant at an early point of evolution tend to stay frequent in the course of random events. Therefore, even the high frequency of a particular word across many related species does not imply functionality on its own, as the word might have been frequent by chance in a common ancestor already.
The success of computational sequence
Conclusion
Word frequencies bear witness to a long history of evolutionary tinkering: copying, deleting and changing different parts of the genome. We argue that global features of genomic spectra arise from duplicative evolutionary processes, and not necessarily from intricate word level selection on point mutations and deletions that are enacting adaptation and conservation, or that are simply obeying structural constraints. In practice, the heavy tail of word frequency distributions means that caution
Acknowledgement
This project was supported by a Natural Sciences and Engineering Research Council of Canada grant, no. 250391-02.
References (21)
- et al.
Oligonucleotide frequencies in DNA follow a Yule distribution
Comput. Chem.
(1996) - et al.
A model explaining the size distribution of gene families
Math. Biosci.
(2004) Standard deviations and correlations of GC levels in DNA sequences
Gene
(2001)Statistical signals in bioinformatics
Proc. Natl. Acad. Sci. U. S. A.
(2005)Probabilistic and statistical properties of words: An overview
J. Comput. Biol.
(2000)The mosaic genome of warmblooded vertebrates
Science
(1985)The dominance of the population by a selected few: power-law behavior applies to a wide variety of genomic properties
Genome Biol.
(2002)Systematic analysis of coding and noncoding DNA sequences using methods of statistical lingustics
Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics
(1995)- et al.
The double Pareto-lognormal distribution - a new parametric model for size distributions
Comm. Statist. Theory Methods
(2004) Dynamic models for file sizes and double Pareto distributions
Internet Math.
(2004)
Cited by (27)
Spectrum structures and biological functions of 8-mers in the human genome
2019, GenomicsCitation Excerpt :After the systematic study of k-mer (4 ≤ k ≤ 13) spectra in Archea, Bacteria and Eukaryots, Chor et al. [32] demonstrated that multimodal k-mer spectra occur in the genomes of the tetrapod clade of mammals, but exons, introns, 3′ untranslated regions (UTRs), 5′UTRs, and gene promoter regions do not all share the same modality in the human genome. Other researchers tried to characterize the spectra by probabilistic models [33,34], and draw the conclusion that such spectra could be the result of completely random evolution through a copying process [34]. The most outstanding property of the multimodal is the two- or three-distinct peak distribution.
Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter
2015, Journal of Theoretical BiologyCitation Excerpt :The heavy tail is not entirely caused by the repeat elements, but also frequent words in large genomes, particularly for short k-mers (Castellini et al., 2012; Csűrös et al., 2007). Several attempts have been done to model the statistical background for the multi-modal spectra using various distributions and approaches such as Bernoulli, copy/insert model, Pareto log normal (PLN), and Markov chain (Chor et al., 2009; Csűrös et al., 2007; Reinert et al., 2000). The Markov chain can model the multi-modal spectra of the human genome generally well although have inherent limitations to fully model the heavy tail and heterogeneity in the genome.
Protein homorepeats: Sequences, structures, evolution, and functions
2010, Advances in Protein Chemistry and Structural BiologyCitation Excerpt :Depending on the statistical models and assumptions, the minimum length used in the genome analyses is 5–7 residues (Karlin, 1995; Katti et al., 2001; Karlin et al., 2002; Faux et al., 2005). These values, based on standard random models of sequence evolution, may not be the most accurate in view of recent data that the frequency distribution of genomic motifs may arise from duplicative evolutionary processes, and not necessarily from selection based on point mutations and deletions (Csuros et al., 2007). Nevertheless, the calculated minimum length of 5–7 residues is close to the limit at which some of the homorepeats start to affect structure and function.
Exhaustive computation of exact duplications via super and non-nested local maximal repeats
2014, Journal of Bioinformatics and Computational BiologyMeasurement of word frequencies in genomic DNA sequences based on partial alignment and fuzzy set
2014, Journal of Bioinformatics and Computational Biology