Trends in Genetics
Volume 23, Issue 11, November 2007, Pages 543-546
Journal home page for Trends in Genetics

Update
Genome Analysis
Reconsidering the significance of genomic word frequencies

https://doi.org/10.1016/j.tig.2007.07.008Get rights and content

By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a functional element. To infer functionality from frequency, it is crucial to precisely characterize occurrences in randomly evolving DNA. We find that the frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates lognormal and power-law features found across all known genomes. Such a distribution could be the result of completely random evolution by a copying process. Our characterization of the entire frequency distribution of genomic words opens a way to a more accurate reasoning about their over- and underrepresentation in genomic sequences.

Introduction

To determine what constitutes the surprisingly frequent and rare in a genome is a fundamental and ongoing issue in genomics [1]. Sequence motifs might be unusually rare or frequent because they belong to mobile, structural or regulatory elements, and are thus subject to selective and adaptive forces. After the examination of oligonucleotide occurrences in more than sixty diverse genomes, we found that a Pareto-lognormal distribution captures the crucial features of oligonucleotide frequency distributions in all the studied genomes. Whereas prevailing random sequence models fail to produce such features, a neutral model of random duplications can. We illustrate our claim with a completely random copy-and-paste process that induces a distribution similar to those observed in real life sequences.

Section snippets

Random sequence models

The simplest sequence motif is an oligonucleotide, or DNA ‘word’. A definite word frequency distribution that characterizes a neutrally evolving sequence is necessary to establish whether a word appears unusually often or rarely in a genome sequence. For instance, the statistical significance of the (hypothetical) overrepresentation of a word w is routinely measured by the ‘tail probability’ P{N(w) ≥ n}, where n is the number of times w occurs in the studied sequence, and N(w) is the random

The distribution of oligonucleotide frequencies

To date, genomic spectra have not been fully characterized, aside from the observation of power-law behavior for certain word sizes 4, 5, 6 in the right-hand tail. Here we point out that a parametric distribution describes word frequencies extremely well. The distribution in question is the so-called ‘double Pareto-lognormal’ (DPL) distribution [7]. The DPL distribution fits many real-life size distributions, including that of personal incomes, human settlements, and files on the Internet [8].

Random evolution by duplication

Why would the DPL distribution systematically appear in genomic spectra? The answer might well lie in duplicative processes. The power-law tail of size distributions [4] for protein domains and gene families can be explained by birth and death models 10, 11, in which family size changes by duplication and deletion processes, and new families are introduced by a steady innovation process. A similar model applies to genomic word frequencies. Consider the occurrences of a particular word along the

Practical implications

As we have just suggested, the birth and death model implies that some words occur often simply by chance, and not because of their functionality. Words that are abundant at an early point of evolution tend to stay frequent in the course of random events. Therefore, even the high frequency of a particular word across many related species does not imply functionality on its own, as the word might have been frequent by chance in a common ancestor already.

The success of computational sequence

Conclusion

Word frequencies bear witness to a long history of evolutionary tinkering: copying, deleting and changing different parts of the genome. We argue that global features of genomic spectra arise from duplicative evolutionary processes, and not necessarily from intricate word level selection on point mutations and deletions that are enacting adaptation and conservation, or that are simply obeying structural constraints. In practice, the heavy tail of word frequency distributions means that caution

Acknowledgement

This project was supported by a Natural Sciences and Engineering Research Council of Canada grant, no. 250391-02.

References (21)

  • C. Martindale et al.

    Oligonucleotide frequencies in DNA follow a Yule distribution

    Comput. Chem.

    (1996)
  • W.J. Reed et al.

    A model explaining the size distribution of gene families

    Math. Biosci.

    (2004)
  • O. Clay

    Standard deviations and correlations of GC levels in DNA sequences

    Gene

    (2001)
  • S. Karlin

    Statistical signals in bioinformatics

    Proc. Natl. Acad. Sci. U. S. A.

    (2005)
  • G. Reinert

    Probabilistic and statistical properties of words: An overview

    J. Comput. Biol.

    (2000)
  • G. Bernardi

    The mosaic genome of warmblooded vertebrates

    Science

    (1985)
  • N.M. Luscombe

    The dominance of the population by a selected few: power-law behavior applies to a wide variety of genomic properties

    Genome Biol.

    (2002)
  • R.N. Mantegna

    Systematic analysis of coding and noncoding DNA sequences using methods of statistical lingustics

    Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics

    (1995)
  • W.J. Reed et al.

    The double Pareto-lognormal distribution - a new parametric model for size distributions

    Comm. Statist. Theory Methods

    (2004)
  • M. Mitzenmacher

    Dynamic models for file sizes and double Pareto distributions

    Internet Math.

    (2004)
There are more references available in the full text version of this article.

Cited by (27)

  • Spectrum structures and biological functions of 8-mers in the human genome

    2019, Genomics
    Citation Excerpt :

    After the systematic study of k-mer (4 ≤ k ≤ 13) spectra in Archea, Bacteria and Eukaryots, Chor et al. [32] demonstrated that multimodal k-mer spectra occur in the genomes of the tetrapod clade of mammals, but exons, introns, 3′ untranslated regions (UTRs), 5′UTRs, and gene promoter regions do not all share the same modality in the human genome. Other researchers tried to characterize the spectra by probabilistic models [33,34], and draw the conclusion that such spectra could be the result of completely random evolution through a copying process [34]. The most outstanding property of the multimodal is the two- or three-distinct peak distribution.

  • Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter

    2015, Journal of Theoretical Biology
    Citation Excerpt :

    The heavy tail is not entirely caused by the repeat elements, but also frequent words in large genomes, particularly for short k-mers (Castellini et al., 2012; Csűrös et al., 2007). Several attempts have been done to model the statistical background for the multi-modal spectra using various distributions and approaches such as Bernoulli, copy/insert model, Pareto log normal (PLN), and Markov chain (Chor et al., 2009; Csűrös et al., 2007; Reinert et al., 2000). The Markov chain can model the multi-modal spectra of the human genome generally well although have inherent limitations to fully model the heavy tail and heterogeneity in the genome.

  • Protein homorepeats: Sequences, structures, evolution, and functions

    2010, Advances in Protein Chemistry and Structural Biology
    Citation Excerpt :

    Depending on the statistical models and assumptions, the minimum length used in the genome analyses is 5–7 residues (Karlin, 1995; Katti et al., 2001; Karlin et al., 2002; Faux et al., 2005). These values, based on standard random models of sequence evolution, may not be the most accurate in view of recent data that the frequency distribution of genomic motifs may arise from duplicative evolutionary processes, and not necessarily from selection based on point mutations and deletions (Csuros et al., 2007). Nevertheless, the calculated minimum length of 5–7 residues is close to the limit at which some of the homorepeats start to affect structure and function.

View all citing articles on Scopus
View full text