- Split View
-
Views
-
Cite
Cite
Chris P. Ponting, Gerton Lunter, Signatures of adaptive evolution within human non-coding sequence, Human Molecular Genetics, Volume 15, Issue suppl_2, 15 October 2006, Pages R170–R175, https://doi.org/10.1093/hmg/ddl182
- Share Icon Share
Abstract
The human genome is often portrayed as consisting of three sequence types, each distinguished by their mode of evolution. Purifying selection is estimated to act on 2.5–5.0% of the genome, whereas virtually all remaining sequence is considered to have evolved neutrally and to be devoid of functionality. The third mode of evolution, positive selection of advantageous changes, is considered rare. Such instances have been inferred only for a handful of sites, and these lie almost exclusively within protein-coding genes. Nevertheless, the majority of positively selected sequence is expected to lie within the wealth of functional ‘dark matter’ present outside of the coding sequence. Here, we review the evolutionary evidence for the majority of human-conserved DNA lying outside of the protein-coding sequence. We argue that within this non-coding fraction lies at least 1 Mb of functional sequence that has accumulated many beneficial nucleotide replacements. Illuminating the functions of this adaptive dark matter will lead to a better understanding of the sequence changes that have shaped the innovative biology of our species.
Although the amount of the human genome that harbours functional, yet non-coding, elements remains ill-determined, models of sequence evolution are unanimous in predicting at least as much functional non-coding sequence as protein-coding material in the genome (1–3). Beyond the repertoire of sequences known to be promoting or regulating transcription and translation, there appears to be a large set of functional sequence [‘dark matter’, (4)] whose importance is yet to be understood. Determining the evolution and function of dark matter is critical to resolving an on-going debate as to whether it specifies much of the morphological diversity of animals, the phenotypic diversity of humans and an individual's susceptibility to disease (5).
Between 10 and 15% of patients with rare Mendelian phenotypes exhibit no changes to a gene's coding sequence, despite incontrovertible evidence of its association with disease (6). In these cases, it must be assumed that the as-yet-unknown mutations lie in the functional non-coding portions of the human genome. Indeed, mutations in intronic elements (6,7), promoters (8) and untranslated regions (UTRs) (9) have, on occasion, been associated with disease. The identification of other disease-associated mutations in the non-coding sequence is hindered by the fact that we currently possess few insights into how to identify functional sequence outside of the coding exons by computational means.
Here, we review recently published evidence for substantial amounts of human functional sequence outside of the protein-coding sequence. For this, we need to consider the nature and extent of neutrally evolved sequence in the human genome because such sequence represents the exact complement of all functional regions. Thereafter, we discuss the amount of constrained (‘conserved’) genomic sequence. Finally, we entertain the possibility that a significant proportion of human sequence has evolved adaptively and thus has diverged by a greater extent than expected from neutral evolution.
NEUTRALLY EVOLVED SEQUENCE
How much human DNA, during evolution, has been purified of deleterious mutations (‘purifying selection’); how much has accepted mutations because of their benefit (‘positive selection’) and thus what remaining proportion of the human genome has accumulated mutations that have not been selected for or against (‘neutral evolution’)? Because of their abundance and their ease of estimation from aligned sequences, nucleotide substitutions have provided the principal mutational signature from which neutrality or selection has been inferred. However, later in this review, we discuss a recently developed approach that harnesses nucleotide insertions and deletions, rather than substitutions, to distinguish between selected and neutrally evolved sequence.
Three distinct classes of nucleotides have often been considered as having evolved neutrally: pseudogenes, the remnants of defunct genes or reverse-transcribed messenger RNA (10,11); ancestral repeats, the debris of transposons present in the last common ancestor of, for example, human and mouse (1,2); and 4-fold degenerate (4D) sites, the third position of codons that encode one particular amino acid whichever base is present (12). Unfortunately, none of these three types of sites is, in fact, universally neutral. In rare cases, pseudogenes appear to have been subject to selection (13,14) and a minority of transposable elements have acquired innovative function (15–17). Furthermore, 4D sites in mammals and invertebrates have been shown to be subject to weak and strong selection, respectively (18,19). Nevertheless, lacking alternatives, these type of sites remain as widely used proxies for unselected sequence.
Substitution rates in these putatively neutral sequences may well be relatively constant in small (<100 kb) regions, but certainly vary dramatically across mammalian genomes (1,12,20,21). Why this is so remains unclear, although there are predicted contributions to this variation from the hypermutability of CpG dinucleotides (22,23), from recombination (24), from the repair of sequence transcribed in the germ-line (25,26) and from base composition not being at equilibrium (27). Whatever the cause, the effect of substitution rate variation across the human genome is that a single neutral rate for the whole mammalian genome does not exist and, thus, that such rates need to be estimated locally.
CONSTRAINED NON-CODING SEQUENCE
Two essentially complementary approaches predict that only a small proportion of the human genome has been subject to strong purifying selection. The first of these, from Chiaromonte and coworkers (1,2), indicates that ∼5% of the mammalian genome has purified substitutions since the mouse and human common ancestor. Theirs is a relatively simple model and thus unlikely to yield more than a rough estimate (28,29). It is predicated upon two assumptions: that substitutions in ancestral repeats were accumulated without selection, and that neutral rates in ancestral repeats and neighbouring sequence are equivalent, despite base composition differences (30).
We recently introduced a second, complementary, approach, one which considers insertions and deletions (‘indels’) between human, dog and mouse sequences rather than substitutions (3). The method, which also accounts for neutral rate variation genome-wide, predicts that between 2.56 and 3.25% of the human genome has been selectively purged of indels and thus is functional. Moreover, it provides quantitative support for the assumption of Chiaromonte et al. that ancestral repeats predominantly evolve neutrally, predicting that only ∼0.1% of all transposable elements are selectively constrained.
Given that 1.2% of the human genome encodes protein (31), each of these two approaches thus indicates a greater amount (∼1.3–4%) of functional sequence residing outside of the coding sequence than inside. Many of these selectively constrained non-coding sequences are even better conserved than protein-coding sequence, yet on the whole their functions remain mysterious (32–36). Nevertheless, functional clues may be elicited from their non-uniform distribution among introns. It is observed that longer introns in general, and in particular, introns of genes regulating transcription, morphogenesis or organogenesis and introns within nervous-system-expressed genes on average possess higher than expected densities of conserved sequence (37,38). This suggests that conserved intronic sequence might often regulate processes during transcription and development.
Conversely, highly conserved intronic regions are significantly under-represented among genes with roles in response to pathogenic insults (38). Consequently, because the amino acid sequences of proteins involved in transcription and development generally evolve slowly, and those involved in immunity and host defence evolve rapidly, it appears that selection has acted in a relatively uniform manner across genomic loci: divergent protein sequences are encoded by genes whose introns are subject to relaxed constraints, whereas genes of conserved protein sequences contain longer and more conserved introns. In particular, it is notable that genes whose expression is limited to the nervous system often possess highly conserved protein-coding, UTR and intronic sequences (38–40).
ADAPTIVE EVOLUTION WITHIN NON-CODING SEQUENCE
The higher abundance of non-coding over coding sequence within the constrained portion of the human genome indicates that the majority of functional sequence is non-coding. It thus appears possible that recent adaptive events too might have involved more non-coding than coding sequence. Despite this, most attention has been paid to detect positive selection in coding sequence. Partly, this is because protein-coding sequence is more easily identified and annotated and partly because synonymous sites can be exploited to provide an estimate of the local neutral rate against which substitution rates within proximal non-synonymous sites can be compared (41,42). These methods, for example, have been exploited to identify adaptive amino acid substitutions proposed to be linked to the development of speech or to the enlargement of the hominin brain (43–45).
In contrast, predicting positive selection in non-coding sequence is hindered by the difficulty of identifying functional sequence when it has rapidly evolved, by the lack of proximal presumed neutral sites and by variations in neutral rate (46). Nevertheless, non-coding substitutions close to human LCT or CYP1A2 genes have been identified which appear to affect their expression levels. These genes encode lactase and cytochrome P450 1A2, and the identified alleles have been associated with acquired tolerances to lactose or toxins (47–49).
An approach for detecting adaptive evolution within modern populations is to identify genomic regions that show evidence of a recent selective sweep (50). Such regions of diminished sequence variation and high linkage disequilibrium are indeed enriched in the vicinity of protein-coding genes, such as those involved in the immune response and sensory perception, which are most expected to have been the targets of positive selection (51). Although a powerful approach, it only identifies relatively large intervals within which the site subject to positive selection still remains to be identified.
INDELS AND HETEROGENEOUS SELECTION
Identifying the substrates of adaptive evolution would be more straightforward if the functional portion of the genome were already identified and separated from the sea of neutral sequence in which it lies scattered. The method of Chiaromonte and coworkers (1,2) that predicted ∼5% of sequence to be conserved, and thus functional, cannot by itself exactly pinpoint the conserved bases, whereas other phylogenetic approaches are able to do so (17). Unfortunately, although conservation over long time spans does imply function (36), not all functional sequence is conserved. Consequently, these methods are less effective at pinpointing sequences that have evolved by positive selection than they are at identifying selectively purified sequence. What was required instead was a robust method to identify rapidly substituting, yet functional, sequence among all non-coding genomic regions.
To this end, we sought first to identify genomic regions that demonstrate another hallmark of purifying selection, the purging of inserted or deleted nucleotides, to obtain a set of likely functional sequence. Then, among these regions, we kept only those that, additionally, exhibit nucleotide substitutions at rates exceeding the expected neutral rates. We termed this confluence of purifying selection on indels and of positive selection on nucleotide substitutions as heterogeneous selection (3).
Specifically, we identified 54.4 Mb of sequence in which no indels had been fixed since the common ancestor of human and mouse (3). This set contained a predicted 1% of false positives that have evolved neutrally but had escaped any indel events purely by chance. If no regions of the human genome evolved by heterogeneous selection, then we might expect only 1% of this 54.4 Mb set to exhibit nucleotide substitution rates at, or over, their neutral rates. Nevertheless, we observed five times this amount of rapidly substituting sequence within the indel-purified 54.4 Mb (Fig. 1). This implies the strong admixture of sequences whose nucleotide substitutions either have been under relaxed constraints or have often been subject to positive selection. Much of this indel-purified and sequence-divergent DNA is within known functional material, such as protein coding exons, and it is substantially under-represented within transposable elements. This suggests that within this non-coding material should be the functional sequence that has been the target of positive selection upon nucleotide substitution.
AMOUNT OF POSITIVELY SELECTED SEQUENCE IN NON-CODING REGIONS
These results lead us to consider the possibility that ∼1 Mb of sequence has evolved adaptively. This low value (∼0.03% of the human genome) is consistent with the results of others who compared human polymorphism with human–chimpanzee divergence data (52).
This small proportion pales in comparison with the estimated 20% contribution from positive selection to the divergence between the fruit fly Drosophila melanogaster and its sister species D. simulans in intronic and intergenic sequence (53). [This is in addition to the large contribution to fruit fly amino acid sequence divergence predicted to arise from adaptive evolution (54–57).] It is to be expected that adaptive evolution would impact most on species such as fruit flies, whose effective population sizes are considerably larger than those of mammals, simply because selection on mutations is more efficient (58). However, although likely to be less widespread, it would be curious if positive selection were not also to have acted upon mammalian introns and intergenic sequences as it has on fruit fly sequences.
For four reasons, the 1 Mb of adaptive human DNA may be a considerable under-estimate. First, the method necessarily only exploits orthologous regions that retain sufficient resemblance to allow their accurate alignment. Lineage-specific or orthologous segments whose sequences have diverged greatly as a consequence of positive selection are thus not able to be aligned and are not counted towards the genomic total. For example, it has been shown that sequences unalignable between human and mouse often contain structural RNA elements (59). Secondly, sequences are often not included if they have recently gained function, owing to their sequence divergence being intermediate between those of neutral and constrained sequences. Thirdly, the method misses adaptive sequence within which selection has not acted heterogeneously, but instead has driven both beneficial indels and substitutions to fixation. Finally, it also overlooks positively selected sites, or short regions, that are scattered among a majority of constrained bases. Conversely, the 1 Mb total may wrongly include sequences that have not evolved adaptively, such as regions that either have lost constraint recently or have evolved by a combination of constraint and neutrality, because a minority of these regions will have accumulated high numbers of unselected substitutions purely by chance. The 1 Mb of sequence under positive selection thus should be considered to be an approximate first estimate.
More recently, we exploited this signature of selection upon indels to conduct a genome-wide scan for positive selection on small functional intronic elements. We find such elements to be especially abundant in the introns of genes that are expressed in the brain (Lunter and Ponting, submitted for publication). These results are consistent with a recent study of human sequences whose evolution has been rapid only in the few million years since our last common ancestor with chimpanzees. Haussler and coworkers found that genes involved in transcriptional regulation or in neurodevelopment are significantly associated with such human accelerated regions (HARs). One particularly striking example involves a 118 bp region (HAR1) that is expressed specifically in Cajal–Retzius neurons. Exceptionally, this region, which folds into a stable RNA structure, has undergone 18 base changes since the human–chimpanzee ancestor, of which 10 have been compensatory and thus consistent with the predicted secondary structure (K.S. Pollard, S.R. Salama and D. Haussler, personal communication). This example, perhaps the single most striking example of human-specific positive selection in non-coding regions to date, hints at a larger role of positive selection in non-coding sequence than hitherto appreciated.
POSITIVE SELECTION AND TURN-OVER OF FUNCTIONAL SEQUENCE
The division of genomic DNA into the well-known trichotomy of neutral, conserved and positively selected sites is, of course, an over-simplification. In particular, it does not consider sequence whose functionality has been intermittent over the long timescales separating mammalian species. The impermanence of functional sequence is most apparent within transcription factor binding sites (60). On the basis of limited experimental data, it is estimated that approximately one-third of these sites in human or rodents are not functional in the other species (61); a similar proportion is observed between two Drosophila species (62). Mammalian promoter and transcription start sites also appear to have been particularly prone to rapid evolution due to possible contributions from elevated mutation rates, reduced constraints, redundancy and positive selection (63,64).
If selection were often to be fleeting, rather than permanent, it would begin to explain the increasingly common identification of functional sequence that has not been conserved between diverse mammals. There are thousands of newly identified Piwi-interacting RNAs, for example, that are not conserved between mouse and more distant species (65). More generally, large numbers of non-coding sequences exhibit divergence levels, between mouse and either human or rat, that are similar to those of putatively neutral sequence (66–68).
If such sequences are indeed rapidly interchanging between neutrality and functionality, then our model organisms will not yield experimental findings on these sequences that are sufficiently relevant to human biology. Comparative genomics will remain central to the study of selection, but current evolutionary models and statistical techniques will need to be adapted to cope with transiently selected sequence. Moreover, the genomic data we currently have to hand will be too coarse-grained: we will need to determine the genome sequences of more nearly related species in order to investigate the more rapid fluctuations of selection relevant to the biology of our own species.
ACKNOWLEDGEMENTS
C.P.P. would like to thank Professor John Mattick (University of Queensland) for his generous hospitality during the writing of this review. We gratefully acknowledge the financial support of the UK Medical Research Council.
Conflict of Interest statement. None declared.