Journal of Molecular Biology
Identification and Analysis of Deleterious Human SNPs
Introduction
Knowledge of the human genome sequence,1, 2 together with a large number of single nucleotide polymorphisms (SNPs) present in the human population3, 4, 5, 6 opens the way for the development of a detailed understanding of the mechanisms by which genetic variation results in phenotype variation. In particular, it should now be possible to identify the contribution of SNPs to human disease. It is estimated that the human population has approximately one SNP with a frequency of more than 1% every 290 base-pairs, implying a total of about ten million.7 Missense SNPs, resulting in an amino acid change in a protein, are most accessible to analysis. There are an average of about four coding region SNPs per gene, half of which are non-synonymous or missense SNPs,8, 9 implying a total of about 50,000.
Genetic mapping, especially linkage analysis,10 has successfully mapped more than 1000 human inheritable diseases to genes. Most of those are rare monogenic (one gene/one trait) diseases, following a simple Mendelian inheritance pattern. On the other hand, common human diseases, such as hypertension, diabetes, Alzheimer, stroke, and heart disease, which follow a more complicated inheritance pattern, are proving much harder to analyze.11, 12, 13 Difficulties are caused by incomplete penetrance (a person carrying a predisposing allele may not exhibit the disease phenotype); genetic heterogeneity (mutations on one of several genes may result in identical phenotypes); and polygenic inheritance (a trait is controlled by multiple gene interactions, such that each individual predisposing allele has a low risk factor and shows weak correlation with the disease trait). In addition, environmental factors may also play an important role in shaping disease phenotypes. Genetic mapping does not provide direct insight into the relationship between the presence of a SNP and susceptibility to a particular disease. There are a variety of mechanisms that may be involved, including effects of transcription rate, protein folding, protein function and protein half-life.
Here, we analyze non-synonymous or missense SNPs, that is, those which change an amino acid, and so may affect protein folding, function, or half-life. More than half of monogenic disease causes are these single amino acid substitutions.10 We use two methods to identify which missense non-synonymous SNPs (nsSNPs) are deleterious to protein function. Both methods have been developed and tested on amino acid changes causative of monogenic disease, and a control set of single residue changes fixed between closely related mammalian species.14 One method analyzes the impact of amino acid changes on protein stability, making use of the three-dimensional structural environment.15 We find the majority of single base changes that cause monogenic disease significantly destabilize the folded state of the protein concerned. The second method, reported here, makes use of the tendency of critical amino acids to be conserved with a protein family. The more conserved and restricted the type of amino acid at a position, the more likely that a substitution not consistent with that pattern will have a deleterious impact on protein function. This method is more general than the stability model, including all types of protein level effect. It is also more widely applicable, since it does not require knowledge of three-dimensional structure. On the other hand, it provides less direct insight into the mechanism by which a missense SNP affects protein function. The principles of sequence conservation methods have also been explored by others.14, 16, 17, 18, 19, 20, 21 We have used a machine learning method, the support vector machine, trained on five simple features that capture the relative sequence conservation at each position in a multiple sequence alignment. The support vector machine allows the identification of a subset of high confidence predictions. Both methods are carefully benchmarked. The use of two separate methods provides an additional means of assessing the reliability of the conclusions.
The two methods have been used to analyze sets of non-synonymous SNPs found in the human population, extracted from the dbSNP database,4 and a subset of those for which population frequency data are available. The subset are data from Perlegen5 and the Hapmap consortium.6 Using stringent criteria, we find that about one quarter of these SNPs are classified as deleterious at the same level as those causing monogeneic disease in other genes. These are very likely to have a significant impact on protein function, and so probably contribute to complex disease traits, and provide a basis for prioritization in association studies.
We have also examined a number of aspects of the relationship between monogenic disease genes and the rest. First, we have compared the occurrence of deleterious SNPs in monogenic versus non-monogenic disease genes. We find that, whereas in monogenic disease genes nearly all deleterious SNPs occur at low frequency in the population, in other genes a larger proportion are found at high frequencies, consistent with the idea that the effect of deleterious SNPs in other genes is buffered. Second, we have looked at the rate of sequence divergence of monogenic versus other genes. An interesting variation with conservation level is found. Third, we have found that there is a correlation between the phenotypic impact of mouse knockouts and whether or not the orthologous human gene is implicated in monogenic disease. Finally, we have checked to see if monogenic disease genes are less likely to have paralogs than the others, exploring the idea that paralogs sometimes can provide substitute function. No such effect was found.
Section snippets
Training and testing data used for the classification methods
Table 1 summarizes the monogenic disease and control datasets used for training and benchmarking the sequence profile and structure stability methods. There were a total of 10,263 deleterious mutants in 731 proteins and 16,682 control substitutions in 348 proteins available. The profile model includes 92% and 71% of these respectively, since profiles can be built for most proteins. In testing, high confidence (HC, SVM score >|0.5|) classifications were obtained for over 80% of these.
Discussion
The main conclusion of this study is that about one quarter of the known missense SNPs in the human population are significantly deleterious to protein function in vivo. Others have reported a figure of about one third.17, 18 It has also been suggested that the fraction is much lower,16 with false positives, errors in dbSNP, and known monogenic disease mutations inflating the apparent value. We have taken into account the effect of false positives and false negatives to obtain a corrected value
Construction of the deleterious variant dataset
The deleterious variants are a set of single amino acid substitutions known to cause monogenic disease. Genes associated with monogenic disease were identified by checking all 16,220 human gene names in the NCBI Locuslink35 database (as of 26 April 2002) against the Human Gene Mutation Database10 (HGMD) (as of 9 February 2002). HGMD contains the most comprehensive collection of mutations related to monogenic disease. Most cause monogenic disease, although a few may be associated with disease as
Acknowledgements
This work was supported by grant LM07174 from the National Library of Medicine. We thank Eugene Melamud for help with the database infrastructure, and many useful discussions.
References (38)
- et al.
SNP association studies in Alzheimer's disease highlight problems for complex disease analysis
Trends Genet.
(2001) - et al.
Loss of protein structure stability as a major causative factor in monogenic disease
J. Mol. Biol.
(2005) - et al.
Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation
J. Mol. Biol.
(2001) - et al.
Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties
J. Mol. Biol.
(2002) - et al.
Human disease genes: patterns and predictions
Gene
(2003) - et al.
Susceptibility to infection and altered hematopoiesis in mice deficient in both P- and E-selectins
Cell
(1996) - et al.
The sequence of the human genome
Science
(2001) - et al.
Initial sequencing and analysis of the human genome
Nature
(2001) - et al.
A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms
Nature
(2001) - et al.
dbSNP: the NCBI database of genetic variation
Nucl. Acids Res.
(2001)
Whole-genome patterns of common DNA variation in three human populations
Science
Variation is the spice of life
Nature Genet.
Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis
Nature Genet.
Characterization of single-nucleotide polymorphisms in coding regions of human genes
Nature Genet.
Human Gene Mutation Database (HGMD): 2003 update
Hum. Mutat.
Mapping complex disease loci in whole-genome association studies
Nature
Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease
Nature Genet.
Prediction of deleterious human alleles
Hum. Mol. Genet.
Cited by (225)
Hereditary Vitamin D-Resistant Rickets (HVDRR) associated SNP variants of vitamin D receptor exhibit malfunctioning at multiple levels
2023, Biochimica et Biophysica Acta - Gene Regulatory MechanismsASCARIS: Positional feature annotation and protein structure-based representation of single amino acid variations
2023, Computational and Structural Biotechnology JournalAI and precision oncology in clinical cancer genomics: From prevention to targeted cancer therapies-an outcomes based patient care
2022, Informatics in Medicine Unlocked