A simulated annealing algorithm for maximum likelihood pedigree reconstruction
Introduction
The maximum likelihood approach to the reconstruction of pedigrees based on genotypic data was developed in a series of articles by Thompson 1974, Thompson 1975, Thompson 1976a and Cannings and Thompson (1981). This work was largely concerned with human pedigrees. In Meagher and Thompson 1986, Meagher and Thompson 1987 and Marshall et al. (1998) pedigree reconstruction techniques are developed further in the context of natural populations. Pedigree reconstruction has also been explored based on DNA fingerprint data in Geyer et al. (1993). The special case of sibling relationships within a single generation has been undertaken by Blouin et al. (1996), Painter 1994, Painter 1996 and Almudevar and Field (1999).
In general, parent–offspring triplets suffice to define an entire pedigree (Cannings and Thompson, 1981), so that pedigree reconstruction can proceed from an initial enumeration of these triplets. In Thompson (1976a) a sequential procedure based on sibship membership is proposed which exploits age and sex information. The method is heuristic, undertaking a sequential consideration of parent–offspring triplets, and will not necessarily achieve the global maximum likelihood value. In a sequential method a single mistaken inference early in the procedure may preclude the determination of the true maximum likelihood. This problem is made more acute by the fact, noted in Thompson 1976a, Thompson 1976b that a sibling of an individual not genetically excluded as a parent will, on average, have a higher likelihood under the hypothesis of a parent–offspring relationship than a true parent.
The technique proposed in this article can be considered as an extension of triplet-based techniques, in that the first step is a triplet enumeration. The problem considered here is that of assembling the triplets into a feasible pedigree, possibly without using age or sex data, based on the maximum likelihood principle. A reformulation of this optimization problem, developed below, proves to be crucial. The effect of this reformulation is to divide the set of all admissible pedigrees into subsets on which the likelihood is easily maximized. These sets are in one-to-one correspondence with a permutation space, which then becomes the domain of optimization. This is a feature shared by the travelling salesman problem, so that techniques suitable for that problem can be adapted to this one.
A number of issues regarding the likelihood should be discussed. First, we distinguish between a complete sample and an incomplete sample. In the former, the parents of each individual are either in the sample or are unrelated to all other members of the sample. In particular, this implies that founders are mutually unrelated. The likelihood given in Thompson (1976a), and the one used in this article, assumes a complete sample. Clearly, a complete solution to the problem requires treatment of likelihoods for incomplete samples. Such a situation would arise, for example, when the grandparents but not the parents of an individual are contained in the sample. The inference should then include this set of relationships, which would not happen with a parent–offspring triplet approach.
The second issue is the use of age data. In Cannings and Thompson (1981) it is suggested that age data or some other type of ordering is necessary for the successful reconstruction of a pedigree. If such data were not present then were the likelihood to be maximized simply by assigning the maximum likelihood parents to each individual independently, it is probable that two individuals will be estimated to be parents of each other, resulting in an inadmissible pedigree. The simulated annealing algorithm proposed in this article is designed precisely to arrange individuals into their correct generations, so that maximum likelihood pedigrees can be constructed in the absence of age data. Any available age or sex data can still be incorporated in a straightforward manner.
In Section 2 the likelihood function is defined. In Section 3 a tree-based algorithm for the explicit enumeration of all admissible pedigrees of nonzero likelihood is given, suitable for smaller pedigrees. In Section 4 a mathematical representation of admissible pedigrees is developed, leading to an alternative formulation of the maximization problem. In Section 5 an implementation of a simulated annealing algorithm is introduced and applied to the maximization of the likelihood, following which some numerical examples are given in Section 6. In Section 7 some resulting issues are discussed, and future directions in this area proposed.
Section snippets
The pedigree likelihood function
The likelihood function of a pedigree based on genotypic or phenotypic information is based on the rules of Mendelian inheritance. Suppose we have (multilocus) genotype data for NI individuals labelled . We assume that the loci are unlinked. Let Gi represent the genotype of individual i. A pedigree can be completely defined by a specification of each individual's parents, taking consideration of the fact that one or both parents of an individual may not be represented in the pedigree.
Enumeration of feasible pedigrees
The space over which L(T) is maximized is formally 2NI dimensional, each dimension representing one putative parent for each offspring. There would therefore be on the order of NI2NI possible pedigrees to examine in an enumerative approach, which would be computationally unfeasible except in the smallest problems. The problem is reduced by introducing two constraints. First, that the pedigree be a member of , and second that the likelihood function be nonzero. The second constraint is
An algebraic representation of pedigrees
For a given pedigree we will define a pedigree matrix as an NI×NI matrix in which element (r,c) is a 1 if c is a parent of r and is 0 otherwise. Let M2 be the set of all 0–1 matrices with at most 2 nonzero entries in each row. Clearly, a pedigree matrix must be in M2, but a member of M2 need not be valid pedigree matrix corresponding to a pedigree in . Let MT be the class of M2 matrices representing admissible pedigrees in . In order to characterize MT we require the following definitions. Definition 1 An n
A simulated annealing algorithm
When the number of feasible pedigrees is too large for exhaustive enumeration, an alternative is the use of simulated annealing, an optimization tool which has proven effective in a large variety of combinatorial optimization problems (Kirkpatrick et al., 1983).
Suppose we wish to maximize a function f on a state space X. We construct a Monte Carlo Markov chain (MCMC) on X as follows. For each x∈X there is a neighbourhood of states, assumed to be a constant size N. A transition from state xi to x
Examples
We consider here 3 examples. Genotypes are simulated for test pedigrees 1 and 2 given in Fig. 1, Fig. 2. Ten loci were assumed, each with 8 equally frequent alleles, for a heterozygosity of 0.875 per locus. This is comparable in allelic diversity to a set of DNA collected from a sample of 17 sperm whales analyzed in Example 3. No sex distinction is made in test pedigrees 1 and 2.
Discussion
A simulated annealing algorithm adapted to maximize the likelihood function of a joint pedigree was presented. The algorithm appears to converge properly, and has demonstrated success in inferring pedigrees without using age or sex data, provided that there is sufficient kinship structure in the data.
A number of issues remain. A complete solution to the problem requires an estimate of the error of the procedure. In Almudevar (2001) a bootstrap procedure is proposed which constructs a confidence
Acknowledgements
This research is supported by a grant from the Natural Sciences and Engineering Research Council of Canada. The author thanks the referees for their valuable recommendations and for correcting an error in the original manuscript. In addition, the author acknowledges the generous support of Chris Field. The author is also grateful to Heydar Radjavi for some insightful observations. The sperm whale data is courtesy of Jenny Christal.
References (24)
- et al.
Microsatellites and their application to population genetic studies
Curr. Opin. Genet. Dev.
(1993) Partition-distancea problem and class of perfect graphs arising in clustering
Inf. Process. Lett.
(2002)- et al.
The relationship between single parent and parent pair genetic likelihoods in genealogy reconstruction
Theor. Popul. Biol.
(1986) - et al.
Simulated Annealing and Boltzmann Machines
(1989) - et al.
Statistical coolinga general approach to combinatorial optimization problems
Philips J. Res.
(1985) - Aarts, E., van Laarhoven, P.J.M., 1985b. A new polynomial-time cooling schedule. Proceedings of IEEE International...
A bootstrap assessment of variability in pedigree reconstruction based on DNA markers
Biometrics
(2001)- et al.
Estimation of single generation sibling relationships based on DNA markers
J. Agric. Biol. Environ. Stat.
(1999) - et al.
Use of microsatellite loci to classify individuals by relatedness
Mol. Ecol.
(1996) - et al.
Genealogical and Genetic Structure
(1981)
Analysis of relatedness in the California condorsfrom DNA fingerprints
Mol. Biol. Evol.
Identification of kin structure among Guam rail foundersa comparison of pedigrees and DNA profiles
Mol. Ecol.
Cited by (51)
Bonsai: An efficient method for inferring large human pedigrees from genotype data
2021, American Journal of Human GeneticsOn the use of dense SNP marker data for the identification of distant relative pairs
2016, Theoretical Population BiologyAn information theoretic approach to pedigree reconstruction
2016, Theoretical Population BiologyCitation Excerpt :Section 8 summarizes the results in a conclusion. There is a clear relationship between graphical models and pedigree reconstruction that has been noted in the literature (Almudevar, 2003, 2007b; Riester, 2009; Cowell, 2009; Almudevar and LaCombe, 2012; Sheehan et al., 2014). The DAG is a natural representation for a pedigree, constructed using directed edges from parent to offspring.
Bayesian pedigree inference with small numbers of single nucleotide polymorphisms via a factor-graph representation
2016, Theoretical Population BiologyCitation Excerpt :The second assumption is that there is no genotyping error, or that it is negligible (which is rarely true in molecular ecology). The latter appears to be the case in Almudevar (2003), Cowell (2009), Almudevar and LaCombe (2012), and Sheehan et al. (2014). Riester et al. (2009), by contrast, include a correction in the genetic transmission probability from parent(s) to offspring to allow for genotyping error, but the correction is applied independently for every transmission in a way that does not allow the observed genotypes of an individual’s offspring and parents to be jointly informative of the true state of the individual’s imperfectly-observed genotype.
Special issue on New Developments in Relatedness and Relationship Estimation
2016, Theoretical Population Biology