Characterization of frequencies and distribution of single nucleotide insertions/deletions in the human genome

doi:10.1016/j.gene.2006.04.009

Gene

Volume 376, Issue 2, 19 July 2006, Pages 268-280

https://doi.org/10.1016/j.gene.2006.04.009 Get rights and content

Abstract

Most of the studies on single nucleotide variations are on substitutions rather than insertions/deletions. In this study, we examined the distribution and characteristics of single nucleotide insertions/deletions (SNindels), using data available from dbSNP for all the human chromosomes. There are almost 300,000 SNindels in the database, of which only 0.8% are validated. They occur at the frequency of 0.887 per 10 kb on average for the whole genome, or approximately 1 for every 11,274 bp. More than half occur in regions with mononucleotide repeats the longest of which is 47 bases. Overall the mononucleotide repeats involving C and G are much shorter than those for A and T. About 12% are surrounded by palindromes.

There is general correlation between chromosome size and total number for each chromosome. Inter-chromosomal variation in density ranges from 0.6 to 21.7 per kilobase. The overall spectrum shows very high proportion of SNindel of types –/A and –/T at over 81%. The proportion of –/A and –/T SNindels for each chromosome is correlated to its AT content. Less than half of the SNindels are within or near known genes and even fewer (< 0.183%) in coding regions, and more than 1.4% of –/C and –/G are in coding compared to 0.2% for –/A and –/T types. SNindels of –/A and –/T types make up 80% of those found within untranslated regions but less than 40% of those within coding regions.

A separate analysis using the subset of 2324 validated SNindels showed slightly less AT bias of 74%, SNindels not within mononucleotide repeats showed even less AT bias at 58%. Density of validated SNindels is 0.007/10 kb overall and 90% are found within or near genes. Among all chromosomes, Y has the lowest numbers and densities for all SNindels, validated SNindels, and SNindels not within repeats.

Introduction

Since the completion of the first draft of the human genome, the study of human genome sequence has shifted its focus to the identification and characterization of variations. However, most of such studies are on substitutions rather than insertions/deletions, particularly single nucleotide polymorphisms (SNPs) involving only single base substitutions. As the most abundant class of polymorphisms, SNPs have the advantage of being available for any region of the DNA, including exons, introns, promoters or regulatory regions. Their ubiquitous presence throughout all chromosomes means they can be used as very high density markers for linkage studies. They are also easy and inexpensive to genotype. However, SNPs are usually biallelic, which means the polymorphism information content is limited as the maximum heterozygosity is only 0.5.

Insertion/deletions polymorphisms (indels) involve a difference in length between alleles. The difference can be anything from 1 base pair (bp) to hundreds or even more than a thousand. Indels can arise from random events during replication when there is slippage during base-pairing in regions with repeating units or the very organized actions of transposable elements. Insertions and deletions are common variations in the genomes of many organisms. Based on size, they can be grouped into different classes: single nucleotide indels (SNindels), microsatellites, short interspersed repeat elements (SINEs), and long interspersed repeat elements (LINEs).

For studies on insertions/deletions, much of the earlier work has been on transposable elements like SINES and LINES. The largest group of SINES, Alu insertion elements, is approximately 300 bp long and could number as many as 1 million per haploid genome (Rubin et al., 1980, Gu et al., 2000, International Human Genome Sequencing Consortium, 2001). They may be inherited through the germ line or arise through de novo transposition. LINEs are longer repeats of around 6 kb/unit. Although they are less abundant, there are still more than 800,000 copies and they might make up as much as 15–17% of the genome (Smit, 1996, Gu et al., 2000, Lander et al., 2001, Lutz et al., 2003).

For smaller indels, previous studies have concentrated on microsatellites which are repeats of very short simple sequence with multiple alleles. The last decade saw their emergence as tools for linkage mapping of disease genes. They occur as tandem repeats of typically hundreds of units at each locus. The variable lengths of microsatellites are in a way due to insertion or deletion of multiple repeat units. They have been used for linkage analysis in genome scan for disease gene mapping and individual identification due to their high frequency, heterozygosity and polymorphism information content. Microsatellites have also been used for assessment of genetic diversity and tracing of parental lineage due to their greater polymorphism information content and allelic heterogeneity compared to other types of polymorphisms. However, most of the studies were done with di- or tetranucleotide repeats due to greater ease of scoring and better characterization of their size range and allele frequencies.

Compared to single nucleotide substitutions and microsatellites, single nucleotide insertions/deletions (SNindels) is a neglected area in the study of genomic variation. To date, no study has systematically evaluated the distribution and density of SNindels in the human genome as well as among the different chromosomes. In this study, we determined the frequency of the 4 different SNindels and variation in density between different human chromosomes. We also looked into the composition of the sequences immediately before and after the SNindel to investigate the relationship between observed variation and known mutation mechanisms, and to identify patterns which might shed light on context-sensitive mechanisms which affect biological processes such as replication and homologous recombination. Finally, we compare the pattern of distribution of the 4 types in the intronic, exonic and non-gene regions.

Section snippets

SNindel data and data mining

Data was based on NCBI dbSNP build 120 (updated on March 18, 2004) with a total of 293,601 entries recorded. SNindels for the human genome were downloaded from http://ftp.ncbi.nim.nih.gov/dbSNP in the zipped FASTA format data for rs record, subdivided into chromosomes. This format provides the flanking sequence for each report of variation in dbSNP, as well as all submitted sequences that have no variations. Variations which mapped to multiple chromosomes or did not map to any chromosome were

Frequency and distribution of SNindels among chromosomes

A total of 9,098,790 SNPs were recorded in the dbSNP database at that time, of which 3.2% or 293,601 were identified as SNindels. They come from all 22 autosomes and the 2 sex chromosomes. More than 20,000 are identified for each of the three largest chromosomes, and less than 5000 for the two smallest (Table 1A). Although the number of SNindels follows the normal distribution according to chromosome size (P > 0.15), correlation of chromosome size with number for each of the 4 types is not

Data validation and bias

In this study, our analysis used only data from the dbSNP database, which identifies variations through in silico comparison of DNA sequences submitted by different sources. As the existence of the identified SNindels was not validated experimentally, many of them may well turn out to be sequencing errors and not true SNindel polymorphisms. As an indication of the true existence of such variations identified through in silico alignment and catalogued in the database, only 80% of the single

References (34)

P.L. Deininger et al.
Alu repeats and human disease
Mol. Genet. Metab.
(1999)
E.P. Economou et al.
Detection of mutations in the factor VIII gene using single-stranded conformational polymorphism (SSCP)
Genomics
(1992)
Z. Gu et al.
Densities, length proportions, and other distributional features of repetitive sequences in the human genome estimated from 430 megabases of genomic sequence
Gene
(2000)
S.M. Lutz et al.
Allelic heterogeneity in LINE-1 retrotransposition activity
Am. J. Hum. Genet.
(2003)
M. Oldridge
De novo alu-element insertions in FGFR2 identify a distinct pathological basis for Apert syndrome
Am. J. Hum. Genet.
(1999)
R. Ophir et al.
Patterns and rates of indel evolution in processed pseudogenes from humans and murids
Gene
(1997)
Smit
The origin of interspersed repeats in the human genome
Curr. Opin. Genet. Dev.
(1996)
P. Stankiewicz et al.
Molecular-evolutionary mechanisms for genomic disorders
Curr. Opin. Genet. Dev.
(2002)
J.L. Weber et al.
Human diallelic insertion/deletion polymorphisms
Am. J. Hum. Genet.
(2002)
Z. Zhao et al.
Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution
Gene
(2003)

A. Astolfi et al.

Frequency and coverage of trinucleotide repeats in eukaryotes

Gene

(2003)

N.A. Chuzhanova et al.

Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity

Hum. Mutat.

(2003)

A.J. Gentles et al.

Genome-scale compositional comparisons in eukaryotes

Genome Res.

(2001)

International Human Genome Sequencing Consortium

Initial screening and analysis of the human genome

Nature

(2001)

S. Karlin et al.

Comparisons of eukaryotic genomic sequences

Proc. Natl. Acad. Sci. U. S. A.

(1994)

M.V. Katti et al.

Differential distribution of simple sequence repeats in eukaryotic genome sequences

Mol. Biol. Evol.

(2001)

Lander

Initial sequencing and analysis of the human genome

Nature

(2001)

Cited by (5)

Unifying view of stem-loop hairpin RNA as origin of current and ancient parasitic and non-parasitic RNAs, including in giant viruses
2016, Current Opinion in Microbiology
Citation Excerpt :
Arguably, high GC dipole moments favor replacement by A and T (U) [63]. Moreover, polymerases preferentially insert As and Ts [64]. Upon RNA de novo genesis, similar stem and loop nucleotide contents would reflect their environmental concentrations, stem/loop GC differences should increase with time.
Putatively, stem–loop RNA hairpins explain networks of selfish elements and RNA world remnants. Their genomic density increases with intracellular lifestyle, especially when comparing giant viruses and their virophages. RNA protogenomes presumably templated for mRNAs and self-replicating stem–loops, ancestors of modern genes and parasitic sequences, including tRNAs and rRNAs. Primary and secondary structure analyses suggest common ancestry for t/rRNAs and parasitic RNAs, parsimoniously link diverse RNA metabolites (replication origins, tRNAs, ribozymes, riboswitches, miRNAs and rRNAs) to parasitic RNAs (ribosomal viroids, Rickettsia repeated palindromic elements (RPE), stem–loop hairpins in giant viruses, their virophages, and transposable retrovirus-derived elements). Results indicate ongoing genesis of small RNA metabolites, and common ancestry or similar genesis for rRNA and retroviral sequences. Assuming functional integration of modular duplicated RNA hairpins evolutionarily unifies diverse molecules, postulating stem–loop hairpin RNAs as origins of genetic innovation, ancestors of rRNAs, retro- and Mimivirus sequences, and cells.
Past, present and future of nutrigenomics and its influence on drug development
2013, Current Drug Discovery Technologies
Cancer and diet
2012, Journal of Pharmacy and Nutrition Sciences
On the sequence-directed nature of human gene mutation: The role of genomic architecture and the local DNA sequence environment in mediating gene mutations underlying human inherited disease
2011, Human Mutation
Frameshift mutation hotspot identified in Smith-Magenis syndrome: case report and review of literature
2010, BMC Medical Genetics

View full text

Characterization of frequencies and distribution of single nucleotide insertions/deletions in the human genome

Abstract

Introduction

Section snippets

SNindel data and data mining

Frequency and distribution of SNindels among chromosomes

Data validation and bias

Mol. Genet. Metab.

Genomics

Gene

Am. J. Hum. Genet.

Am. J. Hum. Genet.

Gene

Curr. Opin. Genet. Dev.

Curr. Opin. Genet. Dev.

Am. J. Hum. Genet.

Gene

Frequency and coverage of trinucleotide repeats in eukaryotes

Gene

Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity

Hum. Mutat.

Genome-scale compositional comparisons in eukaryotes

Genome Res.

Initial screening and analysis of the human genome

Nature

Comparisons of eukaryotic genomic sequences

Proc. Natl. Acad. Sci. U. S. A.

Differential distribution of simple sequence repeats in eukaryotic genome sequences

Mol. Biol. Evol.

Initial sequencing and analysis of the human genome

Nature