Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis
A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST)
Introduction
The identification of genes carrying risk-conferring alleles for rare inherited diseases represents a major contribution of genetics. More than 2000 conditions of rare disease–gene association have been discovered (vis. http://www.hgmd.org). Two genes have been convincingly associated with common diseases, the melanocortin 1 receptor gene, MC1R, with skin cancers [1] and the complement factor H gene, HF1/CFH with macular degeneration [2], [3], [4]. By “common diseases” we mean diseases that may afflict ∼1% or more of the population during a lifetime. They include vascular diseases, cancers, diabetes and late-onset conditions broadly associated with aging.
The method employed to discover many of the rare and one of the common (HF1) gene–disease associations has been the method of linkage disequilibrium. For case-control cohort studies in genetically heterogeneous populations this method depends on the assumption that risk is wholly or predominantly represented by the descended copies of a single mutational event, i.e., a restricted condition of mono-allelic risk [5], [6]. The assumption of mono-allelic risk was based on Kimura and Crow's 1964 arguments that genes would be expected to carry one or fewer gene altering mutations in a small aboriginal human population and that all individual point mutation rates were so low that the chance of the same mutational event occurring more than once in human genetic evolution was negligible [7], [8], [9].
However, by the early agricultural period the human population exceeded several million persons. Multiple non-deleterious mutations per gene would have been expected by then to have survived stochastic extinction (genetic drift) in the growing population. Furthermore, in 1958 Benzer and Freese demonstrated that spontaneous and mutagen-induced, gene-inactivating bacteriophage mutations occurred at highly mutable base pairs (“hotspots”), which comprised about half of all gene-inactivating mutations [10]. Similar observations in bacteria, human cells and human populations have confirmed and extended their observations [11], [12], [13]. The spectra of non-deleterious, gene-inactivating mutations generated in the human germ line representing risk for common diseases would thus be expected to be essentially “multi-allelic” and contain many identical mutations that occurred independently. The present day set of more frequent risk-conferring alleles would be expected to be a mixture of isoallelic mutations that arose as independent hotspot events and rarer mutations that survived stochastic extinction in aboriginal populations or were selected by unknown evolutionary pressures. Selective pressures are known in some instances to have significantly skewed risk toward a particular risk-conferring allele creating essentially mono-allelic risk [14], [15]. Any strategy that would apprehend gene–disease relationships should account for the possibility that risk for a particular disease could be mono-allelic or multi-allelic.
Multi-allelicism, in our view, is probably the major factor that has confounded attempts to discover gene–disease associations by disequilibrium-based methods. But it is by no means the only important confounding condition for gene–disease association discovery. The historically significant changes in age-specific mortality rates for common diseases such as cancers, vascular diseases and diabetes attest to the importance of usually unknown environmental risk factors that may act in concert with or, independently of, genetic risk (vis. http://epidemiology.mit.edu). Furthermore, when considering all common diseases demonstrating familial risk, a wide range of genetic possibilities must be considered including risk arising independently from mutations in more than one gene (multigenic risk) or from interactions of mutations in more than one gene as in polygenic risk including epistatic interactions. In this paper, we have attempted to create a comprehensive algebraic model incorporating all of these theoretical possibilities into a practical but general analytical framework of pair-wise trials of gene–common disease association.
Section snippets
Experimental strategy and technological approach
Recognizing the limitation of linkage studies in conditions of multi-allelic risk, we have previously prescribed comparing the summed frequencies of multiple point mutations in young versus aged persons to discover if a particular gene carried risks for late-onset mortal diseases [16]. Appreciating that large population samples (∼10,000 persons per case cohort) would be required in conditions of even low levels of multigenic risk, we suggested application of technical approaches to mutational
Form of experimental data: sum of all mutant alleles carried by each gene
As we are experienced in the use of denaturing capillary electrophoresis (DCE) to scan a gene for mutations as a series of sequences that together comprise the exons and splice sites of a protein-coding gene, we use the forms of data derived from that technology as an example of an experimental tactic. The use of other exon scanning approaches including accelerated modes of direct DNA sequencing [27], [28], would encounter similar statistical uncertainties and confounding biological variables.
Confounding biological variables
Having created a statistical model delimiting the minimum significant difference between the number of mutant sequences in a sample of cases and the number in a sample of controls, one may enquire as to the value of such an experiment in discovering different possible forms of genetic risk. Answering requires consideration of the several biological or genetic conditions that may affect the relationship between the actual number of mutations and the experimentally observed number of mutations in
The expected difference in mutant allelic sums between two populations
The number of risk-conferring mutations in the case population of N(cases) is the sum of the risk-conferring mutations from the accurately diagnosed persons carrying the condition of risk plus the accurately diagnosed persons not carrying the condition of risk plus the misdiagnosed persons who carry the same degree of risk as in the control population.
Using the parameters defined above this sum iswhere Q = −ln(1 − q) and where R is defined by the function of M and q
Determination of the statistical significance of an observed difference S(cases) − S*(controls)
As noted above, the null hypothesis that no difference exists in the frequency of total point mutations in a gene between the case and control cohorts may be tested by discovering if the test statisticThe quantile value is chosen on the basis of the perceived costs of false negative and false positive identifications of gene–disease associations. As noted above, accepting 10 false negatives on average per disease on the basis of random
Statistical power
The power of a statistical test is the probability of detecting an actual effect, which we desire to approach 1.0. Given the cost and human effort in our proposed strategy, we wish to exclude as few forms of true gene–disease relationships as possible. Whether or not a particular gene–disease causative relationship can be detected depends on the cohort sample sizes, N(cases) and N(controls), the values of the several parameters, A, C, D, M, P, Q, R, and the quantile value chosen to exclude
Discussion
We have expected that common genetic risks will be found to be multi-allelic and have here endeavored to develop an argument grounded on genetics and statistics to permit useful pair-wise trials of hypothesized multi-allelic gene–common disease relationships [16]. The relationship between a large set of MC1R mutations, individual tanning responses and skin cancer risk appears to be the first clear example for a common, formerly lethal disease [1, http://epidemiology.mit.edu]. Situations in
Summary
With certain important limitations, the difference of sums of total exonic mutant allele frequencies between case and control cohorts, CAST, appears to offer a statistically robust means to test gene–disease relationships. Applied as a series of pangenomic tests of all known genes against 100 common diseases using case cohorts of 10,000 persons for each disease should net nearly all gene–common disease relationships in which risk is conferred by a nullizygous condition, ∼80% of relationships
Acknowledgments
Drs. Per Olaf Ekstrom (Oslo) and Xiao-Cheng Li-Sucholeiki (Winchester, MA, USA) have provided important contributions through discussion of this work well beyond the scope of analytical variance. The late Prof. Lars Ehrenberg (Stockholm) participated in and encouraged our first efforts to apply mutational spectrometry to human population genetics. Prof. Kari Hemminki (Heidelberg) shared his data and insights with regard to familial risks for common forms of cancer. Dr. Eivind Hovig (Oslo)
References (48)
The genetics of sun sensitivity in humans
Am. J. Hum. Genet.
(2004)Are rare variants responsible for susceptibility to complex diseases?
Am. J. Hum. Genet.
(2001)- et al.
Single nucleotide polymorphism spectra in newborns and centenarians: identification of genes coding for risk of mortal disease
Gene
(1998) - et al.
A comprehensive review of genetic association studies
Genet Med.
(2002) - et al.
Genetic association studies of complex traits: design and analysis issues
Mutat. Res.
(2005) - et al.
Detection and frequency estimation of rare variants in pools of genomic DNA from large populations using mutational spectrometry
Mutat. Res.
(2005) - et al.
Ionizing radiation and genetic risks. XI. The doubling dose estimates from the mid-1950s to the present and the conceptual change to the use of human data on spontaneous mutation rates and mouse data on induced mutation rates for doubling dose calculations
Mut. Res.
(2000) - et al.
Population screening of single-nucleotide polymorphisms exemplified by analysis of 8000 alleles
J. Biomol. Screen.
(2002) The sampling theory of selectively neutral alleles
Theor. Popul. Biol.
(1972)- et al.
A common haplotype in the complement regulatory gene factor H (HF1/CFH) predisposes individuals to age-related macular degeneration
Proc. Natl. Acad. Sci. U.S.A.
(2005)
An interactive web database of factor H associated hemolytic uremic syndrome mutations: insights into the structural consequences of disease associated mutations
Hum. Mutat.
CFH Y402H confers similar risk of Soft Drusen and both forms of advanced AMD
PLoS Med.
Mapping mendelian factors underlying quantitative traits using RFLP linkage maps
Genetics
The number of alleles that can be maintained in a finite population
Genetics
The Genetics of Human Populations
A Primer of Population Genetics
Induction of specific mutations with 5-bromouracil
Proc. Natl. Acad. Sci. U.S.A.
Genetic studies of the lac repressor. VII. On the molecular nature of spontaneous hotspots in the lacI gene of Escherichia coli
J. Mol. Biol.
Mitochondrial mutational spectra in human cells and tissues
Proc. Natl. Acad. Sci. U.S.A.
The mutational spectrum of the HPRT gene from human T cells in vivo shares a significant concordant set of hot spots with MNNG-treated human cells
Cancer Res.
The sickle cell trait is associated with enhanced immunoglobulin G antibody responses to Plasmodium falciparum variant surface antigens
J. Infect. Dis.
Salmonella typhi uses CFTR to enter intestinal epithelial cells
Nature
A sensitive scanning technology for low frequency nuclear point mutations in human genomic DNA
Nucl. Acids Res.
Genome-wide association studies for common diseases and complex traits
Nat. Rev. Genet.
Cited by (410)
Gene-based association study of rare variants in children of diverse ancestries implicates TNFRSF21 in the development of allergic asthma
2024, Journal of Allergy and Clinical ImmunologyAn allelic-series rare-variant association test for candidate-gene discovery
2023, American Journal of Human GeneticsHLA amino acid Mismatch-Based risk stratification of kidney allograft failure using a novel Machine learning algorithm
2023, Journal of Biomedical InformaticsUnified views on variant impact across many diseases
2023, Trends in GeneticsDYNATE: Localizing rare-variant association regions via multiple testing embedded in an aggregation tree
2024, Genetic Epidemiology