A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST)

https://doi.org/10.1016/j.mrfmmm.2006.09.003Get rights and content

Abstract

A method is described to discover if a gene carries one or more allelic mutations that confer risk for any specified common disease. The method does not depend upon genetic linkage of risk-conferring mutations to high frequency genetic markers such as single nucleotide polymorphisms. Instead, the sums of allelic mutation frequencies in case and control cohorts are determined and a statistical test is applied to discover if the difference in these sums is greater than would be expected by chance. A statistical model is presented that defines the ability of such tests to detect significant gene–disease relationships as a function of case and control cohort sizes and key confounding variables: zygosity and genicity, environmental risk factors, errors in diagnosis, limits to mutant detection, linkage of neutral and risk-conferring mutations, ethnic diversity in the general population and the expectation that among all exonic mutants in the human genome greater than 90% will be neutral with regard to any effect on disease risk. Means to test the null hypothesis for, and determine the statistical power of, each test are provided. For this “cohort allelic sums test” or “CAST”, the statistical model and test are provided as an Excel™ program, CASTAT© at http://epidemiology.mit.edu. Based on genetics, technology and statistics, a strategy of enumerating the mutant alleles carried in the exons and splice sites of the estimated ∼25,000 human genes in case cohort samples of 10,000 persons for each of 100 common diseases is proposed and evaluated: A wide range of possible conditions of multi-allelic or mono-allelic and monogenic, multigenic or polygenic (including epistatic) risk are found to be detectable using the statistical criteria of 1 or 10 “false positive” gene associations per 25,000 gene–disease pair-wise trials and a statistical power of >0.8. Using estimates of the distribution of both neutral and gene-inactivating nondeleterious mutations in humans and the sensitivity of the test to multigenic or multicausal risk, it is estimated that about 80% of nullizygous, heterozygous and functionally dominant gene–common disease associations may be discovered. Limitations include relative insensitivity of CAST to about 60% of possible associations given homozygous (wild type) risk and, more rarely, other stochastic limits when the frequency of mutations in the case cohort approaches that of the control cohort and biases such as absence of genetic risk masked by risk derived from a shared cultural environment.

Introduction

The identification of genes carrying risk-conferring alleles for rare inherited diseases represents a major contribution of genetics. More than 2000 conditions of rare disease–gene association have been discovered (vis. http://www.hgmd.org). Two genes have been convincingly associated with common diseases, the melanocortin 1 receptor gene, MC1R, with skin cancers [1] and the complement factor H gene, HF1/CFH with macular degeneration [2], [3], [4]. By “common diseases” we mean diseases that may afflict ∼1% or more of the population during a lifetime. They include vascular diseases, cancers, diabetes and late-onset conditions broadly associated with aging.

The method employed to discover many of the rare and one of the common (HF1) gene–disease associations has been the method of linkage disequilibrium. For case-control cohort studies in genetically heterogeneous populations this method depends on the assumption that risk is wholly or predominantly represented by the descended copies of a single mutational event, i.e., a restricted condition of mono-allelic risk [5], [6]. The assumption of mono-allelic risk was based on Kimura and Crow's 1964 arguments that genes would be expected to carry one or fewer gene altering mutations in a small aboriginal human population and that all individual point mutation rates were so low that the chance of the same mutational event occurring more than once in human genetic evolution was negligible [7], [8], [9].

However, by the early agricultural period the human population exceeded several million persons. Multiple non-deleterious mutations per gene would have been expected by then to have survived stochastic extinction (genetic drift) in the growing population. Furthermore, in 1958 Benzer and Freese demonstrated that spontaneous and mutagen-induced, gene-inactivating bacteriophage mutations occurred at highly mutable base pairs (“hotspots”), which comprised about half of all gene-inactivating mutations [10]. Similar observations in bacteria, human cells and human populations have confirmed and extended their observations [11], [12], [13]. The spectra of non-deleterious, gene-inactivating mutations generated in the human germ line representing risk for common diseases would thus be expected to be essentially “multi-allelic” and contain many identical mutations that occurred independently. The present day set of more frequent risk-conferring alleles would be expected to be a mixture of isoallelic mutations that arose as independent hotspot events and rarer mutations that survived stochastic extinction in aboriginal populations or were selected by unknown evolutionary pressures. Selective pressures are known in some instances to have significantly skewed risk toward a particular risk-conferring allele creating essentially mono-allelic risk [14], [15]. Any strategy that would apprehend gene–disease relationships should account for the possibility that risk for a particular disease could be mono-allelic or multi-allelic.

Multi-allelicism, in our view, is probably the major factor that has confounded attempts to discover gene–disease associations by disequilibrium-based methods. But it is by no means the only important confounding condition for gene–disease association discovery. The historically significant changes in age-specific mortality rates for common diseases such as cancers, vascular diseases and diabetes attest to the importance of usually unknown environmental risk factors that may act in concert with or, independently of, genetic risk (vis. http://epidemiology.mit.edu). Furthermore, when considering all common diseases demonstrating familial risk, a wide range of genetic possibilities must be considered including risk arising independently from mutations in more than one gene (multigenic risk) or from interactions of mutations in more than one gene as in polygenic risk including epistatic interactions. In this paper, we have attempted to create a comprehensive algebraic model incorporating all of these theoretical possibilities into a practical but general analytical framework of pair-wise trials of gene–common disease association.

Section snippets

Experimental strategy and technological approach

Recognizing the limitation of linkage studies in conditions of multi-allelic risk, we have previously prescribed comparing the summed frequencies of multiple point mutations in young versus aged persons to discover if a particular gene carried risks for late-onset mortal diseases [16]. Appreciating that large population samples (∼10,000 persons per case cohort) would be required in conditions of even low levels of multigenic risk, we suggested application of technical approaches to mutational

Form of experimental data: sum of all mutant alleles carried by each gene

As we are experienced in the use of denaturing capillary electrophoresis (DCE) to scan a gene for mutations as a series of sequences that together comprise the exons and splice sites of a protein-coding gene, we use the forms of data derived from that technology as an example of an experimental tactic. The use of other exon scanning approaches including accelerated modes of direct DNA sequencing [27], [28], would encounter similar statistical uncertainties and confounding biological variables.

Confounding biological variables

Having created a statistical model delimiting the minimum significant difference between the number of mutant sequences in a sample of cases and the number in a sample of controls, one may enquire as to the value of such an experiment in discovering different possible forms of genetic risk. Answering requires consideration of the several biological or genetic conditions that may affect the relationship between the actual number of mutations and the experimentally observed number of mutations in

The expected difference in mutant allelic sums between two populations

The number of risk-conferring mutations in the case population of N(cases) is the sum of the risk-conferring mutations from the accurately diagnosed persons carrying the condition of risk plus the accurately diagnosed persons not carrying the condition of risk plus the misdiagnosed persons who carry the same degree of risk as in the control population.

Using the parameters defined above this sum is2N(cases)CD[AR+(1A)Q],where Q = −ln(1  q) and where R is defined by the function of M and q

Determination of the statistical significance of an observed difference S(cases)  S*(controls)

As noted above, the null hypothesis that no difference exists in the frequency of total point mutations in a gene between the case and control cohorts may be tested by discovering if the test statistic|T|=|S(cases)S*(controls)|quantile[totalvariance]1/2>0.The quantile value is chosen on the basis of the perceived costs of false negative and false positive identifications of gene–disease associations. As noted above, accepting 10 false negatives on average per disease on the basis of random

Statistical power

The power of a statistical test is the probability of detecting an actual effect, which we desire to approach 1.0. Given the cost and human effort in our proposed strategy, we wish to exclude as few forms of true gene–disease relationships as possible. Whether or not a particular gene–disease causative relationship can be detected depends on the cohort sample sizes, N(cases) and N(controls), the values of the several parameters, A, C, D, M, P, Q, R, and the quantile value chosen to exclude

Discussion

We have expected that common genetic risks will be found to be multi-allelic and have here endeavored to develop an argument grounded on genetics and statistics to permit useful pair-wise trials of hypothesized multi-allelic gene–common disease relationships [16]. The relationship between a large set of MC1R mutations, individual tanning responses and skin cancer risk appears to be the first clear example for a common, formerly lethal disease [1, http://epidemiology.mit.edu]. Situations in

Summary

With certain important limitations, the difference of sums of total exonic mutant allele frequencies between case and control cohorts, CAST, appears to offer a statistically robust means to test gene–disease relationships. Applied as a series of pangenomic tests of all known genes against 100 common diseases using case cohorts of 10,000 persons for each disease should net nearly all gene–common disease relationships in which risk is conferred by a nullizygous condition, ∼80% of relationships

Acknowledgments

Drs. Per Olaf Ekstrom (Oslo) and Xiao-Cheng Li-Sucholeiki (Winchester, MA, USA) have provided important contributions through discussion of this work well beyond the scope of analytical variance. The late Prof. Lars Ehrenberg (Stockholm) participated in and encouraged our first efforts to apply mutational spectrometry to human population genetics. Prof. Kari Hemminki (Heidelberg) shared his data and insights with regard to familial risks for common forms of cancer. Dr. Eivind Hovig (Oslo)

References (48)

  • R.E. Saunders et al.

    An interactive web database of factor H associated hemolytic uremic syndrome mutations: insights into the structural consequences of disease associated mutations

    Hum. Mutat.

    (2005)
  • K.P. Magnusson et al.

    CFH Y402H confers similar risk of Soft Drusen and both forms of advanced AMD

    PLoS Med.

    (2005)
  • E.S. Lander et al.

    Mapping mendelian factors underlying quantitative traits using RFLP linkage maps

    Genetics

    (1989)
  • M. Kimura et al.

    The number of alleles that can be maintained in a finite population

    Genetics

    (1964)
  • L.L. Calvari-Sforza et al.

    The Genetics of Human Populations

    (1971)
  • D.L. Hartl

    A Primer of Population Genetics

    (2000)
  • S. Benzer et al.

    Induction of specific mutations with 5-bromouracil

    Proc. Natl. Acad. Sci. U.S.A.

    (1958)
  • P.J. Farabaugh et al.

    Genetic studies of the lac repressor. VII. On the molecular nature of spontaneous hotspots in the lacI gene of Escherichia coli

    J. Mol. Biol.

    (1978)
  • K. Khrapko et al.

    Mitochondrial mutational spectra in human cells and tissues

    Proc. Natl. Acad. Sci. U.S.A.

    (1997)
  • A. Tomita-Mitchell et al.

    The mutational spectrum of the HPRT gene from human T cells in vivo shares a significant concordant set of hot spots with MNNG-treated human cells

    Cancer Res.

    (2003)
  • G. Cabrera et al.

    The sickle cell trait is associated with enhanced immunoglobulin G antibody responses to Plasmodium falciparum variant surface antigens

    J. Infect. Dis.

    (2005)
  • G.B. Pier et al.

    Salmonella typhi uses CFTR to enter intestinal epithelial cells

    Nature

    (1998)
  • X.-C. Li-Sucholeiki et al.

    A sensitive scanning technology for low frequency nuclear point mutations in human genomic DNA

    Nucl. Acids Res.

    (2000)
  • J.N. Hirschhorn et al.

    Genome-wide association studies for common diseases and complex traits

    Nat. Rev. Genet.

    (2005)
  • Cited by (410)

    View all citing articles on Scopus
    View full text