Characterization of frequencies and distribution of single nucleotide insertions/deletions in the human genome
Introduction
Since the completion of the first draft of the human genome, the study of human genome sequence has shifted its focus to the identification and characterization of variations. However, most of such studies are on substitutions rather than insertions/deletions, particularly single nucleotide polymorphisms (SNPs) involving only single base substitutions. As the most abundant class of polymorphisms, SNPs have the advantage of being available for any region of the DNA, including exons, introns, promoters or regulatory regions. Their ubiquitous presence throughout all chromosomes means they can be used as very high density markers for linkage studies. They are also easy and inexpensive to genotype. However, SNPs are usually biallelic, which means the polymorphism information content is limited as the maximum heterozygosity is only 0.5.
Insertion/deletions polymorphisms (indels) involve a difference in length between alleles. The difference can be anything from 1 base pair (bp) to hundreds or even more than a thousand. Indels can arise from random events during replication when there is slippage during base-pairing in regions with repeating units or the very organized actions of transposable elements. Insertions and deletions are common variations in the genomes of many organisms. Based on size, they can be grouped into different classes: single nucleotide indels (SNindels), microsatellites, short interspersed repeat elements (SINEs), and long interspersed repeat elements (LINEs).
For studies on insertions/deletions, much of the earlier work has been on transposable elements like SINES and LINES. The largest group of SINES, Alu insertion elements, is approximately 300 bp long and could number as many as 1 million per haploid genome (Rubin et al., 1980, Gu et al., 2000, International Human Genome Sequencing Consortium, 2001). They may be inherited through the germ line or arise through de novo transposition. LINEs are longer repeats of around 6 kb/unit. Although they are less abundant, there are still more than 800,000 copies and they might make up as much as 15–17% of the genome (Smit, 1996, Gu et al., 2000, Lander et al., 2001, Lutz et al., 2003).
For smaller indels, previous studies have concentrated on microsatellites which are repeats of very short simple sequence with multiple alleles. The last decade saw their emergence as tools for linkage mapping of disease genes. They occur as tandem repeats of typically hundreds of units at each locus. The variable lengths of microsatellites are in a way due to insertion or deletion of multiple repeat units. They have been used for linkage analysis in genome scan for disease gene mapping and individual identification due to their high frequency, heterozygosity and polymorphism information content. Microsatellites have also been used for assessment of genetic diversity and tracing of parental lineage due to their greater polymorphism information content and allelic heterogeneity compared to other types of polymorphisms. However, most of the studies were done with di- or tetranucleotide repeats due to greater ease of scoring and better characterization of their size range and allele frequencies.
Compared to single nucleotide substitutions and microsatellites, single nucleotide insertions/deletions (SNindels) is a neglected area in the study of genomic variation. To date, no study has systematically evaluated the distribution and density of SNindels in the human genome as well as among the different chromosomes. In this study, we determined the frequency of the 4 different SNindels and variation in density between different human chromosomes. We also looked into the composition of the sequences immediately before and after the SNindel to investigate the relationship between observed variation and known mutation mechanisms, and to identify patterns which might shed light on context-sensitive mechanisms which affect biological processes such as replication and homologous recombination. Finally, we compare the pattern of distribution of the 4 types in the intronic, exonic and non-gene regions.
Section snippets
SNindel data and data mining
Data was based on NCBI dbSNP build 120 (updated on March 18, 2004) with a total of 293,601 entries recorded. SNindels for the human genome were downloaded from http://ftp.ncbi.nim.nih.gov/dbSNP in the zipped FASTA format data for rs record, subdivided into chromosomes. This format provides the flanking sequence for each report of variation in dbSNP, as well as all submitted sequences that have no variations. Variations which mapped to multiple chromosomes or did not map to any chromosome were
Frequency and distribution of SNindels among chromosomes
A total of 9,098,790 SNPs were recorded in the dbSNP database at that time, of which 3.2% or 293,601 were identified as SNindels. They come from all 22 autosomes and the 2 sex chromosomes. More than 20,000 are identified for each of the three largest chromosomes, and less than 5000 for the two smallest (Table 1A). Although the number of SNindels follows the normal distribution according to chromosome size (P > 0.15), correlation of chromosome size with number for each of the 4 types is not
Data validation and bias
In this study, our analysis used only data from the dbSNP database, which identifies variations through in silico comparison of DNA sequences submitted by different sources. As the existence of the identified SNindels was not validated experimentally, many of them may well turn out to be sequencing errors and not true SNindel polymorphisms. As an indication of the true existence of such variations identified through in silico alignment and catalogued in the database, only 80% of the single
References (34)
- et al.
Alu repeats and human disease
Mol. Genet. Metab.
(1999) - et al.
Detection of mutations in the factor VIII gene using single-stranded conformational polymorphism (SSCP)
Genomics
(1992) - et al.
Densities, length proportions, and other distributional features of repetitive sequences in the human genome estimated from 430 megabases of genomic sequence
Gene
(2000) - et al.
Allelic heterogeneity in LINE-1 retrotransposition activity
Am. J. Hum. Genet.
(2003) De novo alu-element insertions in FGFR2 identify a distinct pathological basis for Apert syndrome
Am. J. Hum. Genet.
(1999)- et al.
Patterns and rates of indel evolution in processed pseudogenes from humans and murids
Gene
(1997) The origin of interspersed repeats in the human genome
Curr. Opin. Genet. Dev.
(1996)- et al.
Molecular-evolutionary mechanisms for genomic disorders
Curr. Opin. Genet. Dev.
(2002) - et al.
Human diallelic insertion/deletion polymorphisms
Am. J. Hum. Genet.
(2002) - et al.
Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution
Gene
(2003)
Frequency and coverage of trinucleotide repeats in eukaryotes
Gene
Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity
Hum. Mutat.
Genome-scale compositional comparisons in eukaryotes
Genome Res.
Initial screening and analysis of the human genome
Nature
Comparisons of eukaryotic genomic sequences
Proc. Natl. Acad. Sci. U. S. A.
Differential distribution of simple sequence repeats in eukaryotic genome sequences
Mol. Biol. Evol.
Initial sequencing and analysis of the human genome
Nature
Cited by (5)
Unifying view of stem-loop hairpin RNA as origin of current and ancient parasitic and non-parasitic RNAs, including in giant viruses
2016, Current Opinion in MicrobiologyCitation Excerpt :Arguably, high GC dipole moments favor replacement by A and T (U) [63]. Moreover, polymerases preferentially insert As and Ts [64]. Upon RNA de novo genesis, similar stem and loop nucleotide contents would reflect their environmental concentrations, stem/loop GC differences should increase with time.
Past, present and future of nutrigenomics and its influence on drug development
2013, Current Drug Discovery TechnologiesCancer and diet
2012, Journal of Pharmacy and Nutrition SciencesFrameshift mutation hotspot identified in Smith-Magenis syndrome: case report and review of literature
2010, BMC Medical Genetics