Journal of Molecular Biology
Volume 356, Issue 5, 10 March 2006, Pages 1263-1274
Journal home page for Journal of Molecular Biology

Identification and Analysis of Deleterious Human SNPs

https://doi.org/10.1016/j.jmb.2005.12.025Get rights and content

We have developed two methods of identifying which non-synonomous single base changes have a deleterious effect on protein function in vivo. One method, described elsewhere, analyzes the effect of the resulting amino acid change on protein stability, utilizing structural information. The other method, introduced here, makes use of the conservation and type of residues observed at a base change position within a protein family. A machine learning technique, the support vector machine, is trained on single amino acid changes that cause monogenic disease, with a control set of amino acid changes fixed between species.

Both methods are used to identify deleterious single nucleotide polymorphisms (SNPs) in the human population. After carefully controlling for errors, we find that approximately one quarter of known non-synonymous SNPs are deleterious by these criteria, providing a set of possible contributors to human complex disease traits.

Introduction

Knowledge of the human genome sequence,1, 2 together with a large number of single nucleotide polymorphisms (SNPs) present in the human population3, 4, 5, 6 opens the way for the development of a detailed understanding of the mechanisms by which genetic variation results in phenotype variation. In particular, it should now be possible to identify the contribution of SNPs to human disease. It is estimated that the human population has approximately one SNP with a frequency of more than 1% every 290 base-pairs, implying a total of about ten million.7 Missense SNPs, resulting in an amino acid change in a protein, are most accessible to analysis. There are an average of about four coding region SNPs per gene, half of which are non-synonymous or missense SNPs,8, 9 implying a total of about 50,000.

Genetic mapping, especially linkage analysis,10 has successfully mapped more than 1000 human inheritable diseases to genes. Most of those are rare monogenic (one gene/one trait) diseases, following a simple Mendelian inheritance pattern. On the other hand, common human diseases, such as hypertension, diabetes, Alzheimer, stroke, and heart disease, which follow a more complicated inheritance pattern, are proving much harder to analyze.11, 12, 13 Difficulties are caused by incomplete penetrance (a person carrying a predisposing allele may not exhibit the disease phenotype); genetic heterogeneity (mutations on one of several genes may result in identical phenotypes); and polygenic inheritance (a trait is controlled by multiple gene interactions, such that each individual predisposing allele has a low risk factor and shows weak correlation with the disease trait). In addition, environmental factors may also play an important role in shaping disease phenotypes. Genetic mapping does not provide direct insight into the relationship between the presence of a SNP and susceptibility to a particular disease. There are a variety of mechanisms that may be involved, including effects of transcription rate, protein folding, protein function and protein half-life.

Here, we analyze non-synonymous or missense SNPs, that is, those which change an amino acid, and so may affect protein folding, function, or half-life. More than half of monogenic disease causes are these single amino acid substitutions.10 We use two methods to identify which missense non-synonymous SNPs (nsSNPs) are deleterious to protein function. Both methods have been developed and tested on amino acid changes causative of monogenic disease, and a control set of single residue changes fixed between closely related mammalian species.14 One method analyzes the impact of amino acid changes on protein stability, making use of the three-dimensional structural environment.15 We find the majority of single base changes that cause monogenic disease significantly destabilize the folded state of the protein concerned. The second method, reported here, makes use of the tendency of critical amino acids to be conserved with a protein family. The more conserved and restricted the type of amino acid at a position, the more likely that a substitution not consistent with that pattern will have a deleterious impact on protein function. This method is more general than the stability model, including all types of protein level effect. It is also more widely applicable, since it does not require knowledge of three-dimensional structure. On the other hand, it provides less direct insight into the mechanism by which a missense SNP affects protein function. The principles of sequence conservation methods have also been explored by others.14, 16, 17, 18, 19, 20, 21 We have used a machine learning method, the support vector machine, trained on five simple features that capture the relative sequence conservation at each position in a multiple sequence alignment. The support vector machine allows the identification of a subset of high confidence predictions. Both methods are carefully benchmarked. The use of two separate methods provides an additional means of assessing the reliability of the conclusions.

The two methods have been used to analyze sets of non-synonymous SNPs found in the human population, extracted from the dbSNP database,4 and a subset of those for which population frequency data are available. The subset are data from Perlegen5 and the Hapmap consortium.6 Using stringent criteria, we find that about one quarter of these SNPs are classified as deleterious at the same level as those causing monogeneic disease in other genes. These are very likely to have a significant impact on protein function, and so probably contribute to complex disease traits, and provide a basis for prioritization in association studies.

We have also examined a number of aspects of the relationship between monogenic disease genes and the rest. First, we have compared the occurrence of deleterious SNPs in monogenic versus non-monogenic disease genes. We find that, whereas in monogenic disease genes nearly all deleterious SNPs occur at low frequency in the population, in other genes a larger proportion are found at high frequencies, consistent with the idea that the effect of deleterious SNPs in other genes is buffered. Second, we have looked at the rate of sequence divergence of monogenic versus other genes. An interesting variation with conservation level is found. Third, we have found that there is a correlation between the phenotypic impact of mouse knockouts and whether or not the orthologous human gene is implicated in monogenic disease. Finally, we have checked to see if monogenic disease genes are less likely to have paralogs than the others, exploring the idea that paralogs sometimes can provide substitute function. No such effect was found.

Section snippets

Training and testing data used for the classification methods

Table 1 summarizes the monogenic disease and control datasets used for training and benchmarking the sequence profile and structure stability methods. There were a total of 10,263 deleterious mutants in 731 proteins and 16,682 control substitutions in 348 proteins available. The profile model includes 92% and 71% of these respectively, since profiles can be built for most proteins. In testing, high confidence (HC, SVM score >|0.5|) classifications were obtained for over 80% of these.

Discussion

The main conclusion of this study is that about one quarter of the known missense SNPs in the human population are significantly deleterious to protein function in vivo. Others have reported a figure of about one third.17, 18 It has also been suggested that the fraction is much lower,16 with false positives, errors in dbSNP, and known monogenic disease mutations inflating the apparent value. We have taken into account the effect of false positives and false negatives to obtain a corrected value

Construction of the deleterious variant dataset

The deleterious variants are a set of single amino acid substitutions known to cause monogenic disease. Genes associated with monogenic disease were identified by checking all 16,220 human gene names in the NCBI Locuslink35 database (as of 26 April 2002) against the Human Gene Mutation Database10 (HGMD) (as of 9 February 2002). HGMD contains the most comprehensive collection of mutations related to monogenic disease. Most cause monogenic disease, although a few may be associated with disease as

Acknowledgements

This work was supported by grant LM07174 from the National Library of Medicine. We thank Eugene Melamud for help with the database infrastructure, and many useful discussions.

References (38)

  • D.A. Hinds et al.

    Whole-genome patterns of common DNA variation in three human populations

    Science

    (2005)
  • The International Hapmap Consortium. (2003). The International HapMap Project. Nature, 426,...
  • L. Kruglyak et al.

    Variation is the spice of life

    Nature Genet.

    (2001)
  • M.K. Halushka et al.

    Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis

    Nature Genet.

    (1999)
  • M. Cargill et al.

    Characterization of single-nucleotide polymorphisms in coding regions of human genes

    Nature Genet.

    (1999)
  • P.D. Stenson et al.

    Human Gene Mutation Database (HGMD): 2003 update

    Hum. Mutat.

    (2003)
  • C.S. Carlson et al.

    Mapping complex disease loci in whole-genome association studies

    Nature

    (2004)
  • D. Botstein et al.

    Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease

    Nature Genet.

    (2003)
  • S. Sunyaev et al.

    Prediction of deleterious human alleles

    Hum. Mol. Genet.

    (2001)
  • Cited by (225)

    View all citing articles on Scopus
    View full text