ReviewThe essence of SNPs
Introduction
The Human Genome Project (HGP) is progressing rapidly, with over one million partial cDNA sequences and approximately 10% of a ‘reference’ genomic sequence now in public databases. With this advance has come an appreciation of the need to also study naturally occurring sequence variations, i.e. to understand human DNA polymorphism, about 90% of which is single nucleotide polymorphism (SNP) (Collins et al., 1998). Significant efforts towards large-scale characterisation of human SNPs have been initiated in the last year or so, a somewhat late stage given that almost two decades ago the original incarnation of SNPs [as restriction fragment length polymorphisms (RFLPs)] clearly indicated the existence of widespread subtle genome variation. Now, the renewed and extensive interest in genome polymorphism signifies a development in human genetics research that will have a major impact upon population genetics, drug development, forensics, cancer and genetic disease research. One consequence of all this activity is that the acronym ‘SNP’ (pronounced ‘S’ ‘N’ ‘P’ or ‘SNiP’) has appeared in many diverse articles and reviews, leading many to ponder “what are SNPs and why all the fuss?”. This review is an attempt to answer these questions.
We can start with a working definition — SNPs are single base pair positions in genomic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater. Thus, single base insertion/deletion variants (indels) would not formally be considered to be SNPs. In principle, SNPs could be bi-, tri-, or tetra-allelic polymorphisms. However, in humans, tri-allelic and tetra-allelic SNPs are rare almost to the point of non-existence, and so SNPs are sometimes simply referred to as bi-allelic markers (or di-allelic to be etymologically correct). This is somewhat misleading because SNPs are only a subset of all possible bi-allelic polymorphisms (e.g. indels, multiple base variations).
In practice, the term SNP is typically used more loosely than required by the above definition. Single base variants in cDNAs (cSNPs) are usually classed as SNPs since most of these will reflect underlying genomic DNA variants. This, however, ignores the possibility that they may be the result of RNA editing. Genomic DNA indels involving single or multiple bases are commonly discovered in SNP search efforts and so can become deposited in SNP lists and databases. In a similar way, such data-sets also contain SNP variants of less than 1% allele frequency. Complications with the above definition also exist. Specifically, some people might not want to consider disease predisposing single base variants to be SNPs — but the above definition would encompass such things as recessively acting, low penetrance dominant, quantitative trait loci, or risk associated alleles, since all of these will occur in some normal (non-diseased) individuals. Also the ‘some population’ component of the definition is limited by practical challenges of attaining and surveying representative global population samples. Consequently, claims of non-polymorphic sequences should always be accompanied by statements of the actual populations and the numbers of chromosomes tested. Overall, it is therefore apparent that the term ‘SNP’ is being widely and imprecisely used as a catch-all label for many different types of subtle sequence variation. To maintain clarity within this review, I shall restrict myself to the SNP definition given above. I shall also use the term polymorphism consistently and correctly to refer to the set of alleles at a locus, rather than to any one allele alone.
Section snippets
SNP basics
Bi-allelic SNPs comprise four distinct types. Using the abbreviation X⇔Y (X1⇔Y1) to represent allelic nucleotides X and Y of an SNP on one DNA strand, with their base pairing nucleotides X1 and Y1 of the second strand shown in parentheses, then the four SNP alternatives include one transition C⇔T (G⇔A) and three transversions C⇔A (G⇔T), C⇔G (G⇔C), and T⇔A (A⇔T). This four-way classification is valid if one considers each DNA strand to be equivalent, so C⇔T (G⇔A) is an identical ‘mirror image’
SNP discovery and scoring
Significant efforts towards large-scale SNP discovery have now begun, in what started as something of a hectic race between industry and academia. Both camps appreciate the functional importance and practical utility of SNPs, and whilst the former is keen to secure intellectual property protection on them, the latter would generally like them to be available to all as a research tool. With so many SNPs out there to be gathered and no real indication as to which will be the most useful (with the
Population genetics and linkage disequilibrium
Population genetics is the study of the genetic composition and inter-relationships between populations. The major research tool it uses is DNA polymorphism. Unfortunately, population genetics and human molecular genetics have in some ways been running along parallel research paths, with much population genetics effort over the last few decades being directed towards non-human organisms. With the new SNP era, these fields are beginning to interact far more closely. Population genetics
Complex phenotypes and genome variation
The myriad of human phenotype variations one might wish to study are likely to be caused by genetic and non-genetic (environmental) factors, as well as by an interplay between the two and even a sprinkling of chance events. Clearly, many clinical phenotypes do seem to have a considerable genetic component. The underlying genetic factors of relevance will be encoded in the spectrum of genomic variation that is primarily SNPs. Thus, risks of major common diseases such as cancer, cardiovascular
SNP based association studies
If a factor contributes an increased risk for disease occurrence, then that factor should be found at higher frequency in individuals with that disease compared to non-diseased controls, i.e. associated with the phenotype. A non-genetic example would be smoking which is associated with lung cancer (Vial, 1986), and a good genetic example would be the ε4 allele of the apolipoprotein E gene (APOE4) which is associated with Alzheimer's Disease (Strittmatter and Roses, 1995). In common diseases
Conclusions
An SNP revolution has begun which promises to challenge and stimulate DNA technologists, population geneticists, and molecular genetics researchers alike, and should bring them closer together than ever before. The field is new and important, with the consequence that much money is being spent with some very different ideas about what are the best initial experiments to perform. Industry is a major player, but joining forces with academia could be the most effective way to reach their goals, as
Acknowledgments
Thanks are given to J.D. Terwilliger for his critical reading of this manuscript. Input from members of our research group is recognised and appreciated. Generous financial support, provided by the Swedish Medical Research Council, Professor U. Pettersson and the Beijer Foundation, for research activities bringing the author into this field are gratefully acknowledged.
References (73)
- et al.
Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase
Am. J. Human Genet.
(1998) - et al.
The diastrophic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping
Cell
(1994) - et al.
DNA methylation and mutation
Mutat. Res.
(1993) - et al.
Large-scale sequence comparisons reveal unusually high levels of variation in the HLA-DQB1 locus in the class II region of the human MHC
J. Mol. Biol.
(1998) Contamination of the genome by very slightly deleterious mutations: why have we not died 100 times over?
J. Theor. Biol.
(1995)- et al.
Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes
Am. J. Human Genet.
(1998) - et al.
Mapping genes by drift-generated linkage disequilibrium
Am. J. Human Genet.
(1998) - et al.
A 4 Mb high-density single nucleotide polymorphism-based map around human APOE
Genomics
(1998) - et al.
Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis
Mol. Phylogenet. Evol.
(1996) - et al.
A missense mutation of the endothelin-B receptor gene in multigenic Hirschsprung's disease
Cell
(1994)