Elsevier

Gene

Volume 376, Issue 2, 19 July 2006, Pages 268-280
Gene

Characterization of frequencies and distribution of single nucleotide insertions/deletions in the human genome

https://doi.org/10.1016/j.gene.2006.04.009Get rights and content

Abstract

Most of the studies on single nucleotide variations are on substitutions rather than insertions/deletions. In this study, we examined the distribution and characteristics of single nucleotide insertions/deletions (SNindels), using data available from dbSNP for all the human chromosomes. There are almost 300,000 SNindels in the database, of which only 0.8% are validated. They occur at the frequency of 0.887 per 10 kb on average for the whole genome, or approximately 1 for every 11,274 bp. More than half occur in regions with mononucleotide repeats the longest of which is 47 bases. Overall the mononucleotide repeats involving C and G are much shorter than those for A and T. About 12% are surrounded by palindromes.

There is general correlation between chromosome size and total number for each chromosome. Inter-chromosomal variation in density ranges from 0.6 to 21.7 per kilobase. The overall spectrum shows very high proportion of SNindel of types –/A and –/T at over 81%. The proportion of –/A and –/T SNindels for each chromosome is correlated to its AT content. Less than half of the SNindels are within or near known genes and even fewer (< 0.183%) in coding regions, and more than 1.4% of –/C and –/G are in coding compared to 0.2% for –/A and –/T types. SNindels of –/A and –/T types make up 80% of those found within untranslated regions but less than 40% of those within coding regions.

A separate analysis using the subset of 2324 validated SNindels showed slightly less AT bias of 74%, SNindels not within mononucleotide repeats showed even less AT bias at 58%. Density of validated SNindels is 0.007/10 kb overall and 90% are found within or near genes. Among all chromosomes, Y has the lowest numbers and densities for all SNindels, validated SNindels, and SNindels not within repeats.

Introduction

Since the completion of the first draft of the human genome, the study of human genome sequence has shifted its focus to the identification and characterization of variations. However, most of such studies are on substitutions rather than insertions/deletions, particularly single nucleotide polymorphisms (SNPs) involving only single base substitutions. As the most abundant class of polymorphisms, SNPs have the advantage of being available for any region of the DNA, including exons, introns, promoters or regulatory regions. Their ubiquitous presence throughout all chromosomes means they can be used as very high density markers for linkage studies. They are also easy and inexpensive to genotype. However, SNPs are usually biallelic, which means the polymorphism information content is limited as the maximum heterozygosity is only 0.5.

Insertion/deletions polymorphisms (indels) involve a difference in length between alleles. The difference can be anything from 1 base pair (bp) to hundreds or even more than a thousand. Indels can arise from random events during replication when there is slippage during base-pairing in regions with repeating units or the very organized actions of transposable elements. Insertions and deletions are common variations in the genomes of many organisms. Based on size, they can be grouped into different classes: single nucleotide indels (SNindels), microsatellites, short interspersed repeat elements (SINEs), and long interspersed repeat elements (LINEs).

For studies on insertions/deletions, much of the earlier work has been on transposable elements like SINES and LINES. The largest group of SINES, Alu insertion elements, is approximately 300 bp long and could number as many as 1 million per haploid genome (Rubin et al., 1980, Gu et al., 2000, International Human Genome Sequencing Consortium, 2001). They may be inherited through the germ line or arise through de novo transposition. LINEs are longer repeats of around 6 kb/unit. Although they are less abundant, there are still more than 800,000 copies and they might make up as much as 15–17% of the genome (Smit, 1996, Gu et al., 2000, Lander et al., 2001, Lutz et al., 2003).

For smaller indels, previous studies have concentrated on microsatellites which are repeats of very short simple sequence with multiple alleles. The last decade saw their emergence as tools for linkage mapping of disease genes. They occur as tandem repeats of typically hundreds of units at each locus. The variable lengths of microsatellites are in a way due to insertion or deletion of multiple repeat units. They have been used for linkage analysis in genome scan for disease gene mapping and individual identification due to their high frequency, heterozygosity and polymorphism information content. Microsatellites have also been used for assessment of genetic diversity and tracing of parental lineage due to their greater polymorphism information content and allelic heterogeneity compared to other types of polymorphisms. However, most of the studies were done with di- or tetranucleotide repeats due to greater ease of scoring and better characterization of their size range and allele frequencies.

Compared to single nucleotide substitutions and microsatellites, single nucleotide insertions/deletions (SNindels) is a neglected area in the study of genomic variation. To date, no study has systematically evaluated the distribution and density of SNindels in the human genome as well as among the different chromosomes. In this study, we determined the frequency of the 4 different SNindels and variation in density between different human chromosomes. We also looked into the composition of the sequences immediately before and after the SNindel to investigate the relationship between observed variation and known mutation mechanisms, and to identify patterns which might shed light on context-sensitive mechanisms which affect biological processes such as replication and homologous recombination. Finally, we compare the pattern of distribution of the 4 types in the intronic, exonic and non-gene regions.

Section snippets

SNindel data and data mining

Data was based on NCBI dbSNP build 120 (updated on March 18, 2004) with a total of 293,601 entries recorded. SNindels for the human genome were downloaded from http://ftp.ncbi.nim.nih.gov/dbSNP in the zipped FASTA format data for rs record, subdivided into chromosomes. This format provides the flanking sequence for each report of variation in dbSNP, as well as all submitted sequences that have no variations. Variations which mapped to multiple chromosomes or did not map to any chromosome were

Frequency and distribution of SNindels among chromosomes

A total of 9,098,790 SNPs were recorded in the dbSNP database at that time, of which 3.2% or 293,601 were identified as SNindels. They come from all 22 autosomes and the 2 sex chromosomes. More than 20,000 are identified for each of the three largest chromosomes, and less than 5000 for the two smallest (Table 1A). Although the number of SNindels follows the normal distribution according to chromosome size (P > 0.15), correlation of chromosome size with number for each of the 4 types is not

Data validation and bias

In this study, our analysis used only data from the dbSNP database, which identifies variations through in silico comparison of DNA sequences submitted by different sources. As the existence of the identified SNindels was not validated experimentally, many of them may well turn out to be sequencing errors and not true SNindel polymorphisms. As an indication of the true existence of such variations identified through in silico alignment and catalogued in the database, only 80% of the single

References (34)

  • A. Astolfi et al.

    Frequency and coverage of trinucleotide repeats in eukaryotes

    Gene

    (2003)
  • N.A. Chuzhanova et al.

    Meta-analysis of indels causing human genetic disease: mechanisms of mutagenesis and the role of local DNA sequence complexity

    Hum. Mutat.

    (2003)
  • A.J. Gentles et al.

    Genome-scale compositional comparisons in eukaryotes

    Genome Res.

    (2001)
  • International Human Genome Sequencing Consortium

    Initial screening and analysis of the human genome

    Nature

    (2001)
  • S. Karlin et al.

    Comparisons of eukaryotic genomic sequences

    Proc. Natl. Acad. Sci. U. S. A.

    (1994)
  • M.V. Katti et al.

    Differential distribution of simple sequence repeats in eukaryotic genome sequences

    Mol. Biol. Evol.

    (2001)
  • Lander

    Initial sequencing and analysis of the human genome

    Nature

    (2001)
  • Cited by (5)

    View full text