A simulated annealing algorithm for maximum likelihood pedigree reconstruction

https://doi.org/10.1016/S0040-5809(02)00048-5Get rights and content

Abstract

The calculation of maximum likelihood pedigrees for related organisms using genotypic data is considered. The problem is formulated so that the domain of optimization is a permutation space. This is a feature shared by the travelling salesman problem, for which simulated annealing is known to be effective. Using this technique it is found that pedigrees can be reconstructed with minimal error using genotypic data of a quality currently realizable. In complex pedigrees accurate reconstruction can be done with no a priori age or sex information. For smaller numbers of individuals a method of efficiently enumerating all admissible pedigrees of nonzero likelihood is given.

Introduction

The maximum likelihood approach to the reconstruction of pedigrees based on genotypic data was developed in a series of articles by Thompson 1974, Thompson 1975, Thompson 1976a and Cannings and Thompson (1981). This work was largely concerned with human pedigrees. In Meagher and Thompson 1986, Meagher and Thompson 1987 and Marshall et al. (1998) pedigree reconstruction techniques are developed further in the context of natural populations. Pedigree reconstruction has also been explored based on DNA fingerprint data in Geyer et al. (1993). The special case of sibling relationships within a single generation has been undertaken by Blouin et al. (1996), Painter 1994, Painter 1996 and Almudevar and Field (1999).

In general, parent–offspring triplets suffice to define an entire pedigree (Cannings and Thompson, 1981), so that pedigree reconstruction can proceed from an initial enumeration of these triplets. In Thompson (1976a) a sequential procedure based on sibship membership is proposed which exploits age and sex information. The method is heuristic, undertaking a sequential consideration of parent–offspring triplets, and will not necessarily achieve the global maximum likelihood value. In a sequential method a single mistaken inference early in the procedure may preclude the determination of the true maximum likelihood. This problem is made more acute by the fact, noted in Thompson 1976a, Thompson 1976b that a sibling of an individual not genetically excluded as a parent will, on average, have a higher likelihood under the hypothesis of a parent–offspring relationship than a true parent.

The technique proposed in this article can be considered as an extension of triplet-based techniques, in that the first step is a triplet enumeration. The problem considered here is that of assembling the triplets into a feasible pedigree, possibly without using age or sex data, based on the maximum likelihood principle. A reformulation of this optimization problem, developed below, proves to be crucial. The effect of this reformulation is to divide the set of all admissible pedigrees into subsets on which the likelihood is easily maximized. These sets are in one-to-one correspondence with a permutation space, which then becomes the domain of optimization. This is a feature shared by the travelling salesman problem, so that techniques suitable for that problem can be adapted to this one.

A number of issues regarding the likelihood should be discussed. First, we distinguish between a complete sample and an incomplete sample. In the former, the parents of each individual are either in the sample or are unrelated to all other members of the sample. In particular, this implies that founders are mutually unrelated. The likelihood given in Thompson (1976a), and the one used in this article, assumes a complete sample. Clearly, a complete solution to the problem requires treatment of likelihoods for incomplete samples. Such a situation would arise, for example, when the grandparents but not the parents of an individual are contained in the sample. The inference should then include this set of relationships, which would not happen with a parent–offspring triplet approach.

The second issue is the use of age data. In Cannings and Thompson (1981) it is suggested that age data or some other type of ordering is necessary for the successful reconstruction of a pedigree. If such data were not present then were the likelihood to be maximized simply by assigning the maximum likelihood parents to each individual independently, it is probable that two individuals will be estimated to be parents of each other, resulting in an inadmissible pedigree. The simulated annealing algorithm proposed in this article is designed precisely to arrange individuals into their correct generations, so that maximum likelihood pedigrees can be constructed in the absence of age data. Any available age or sex data can still be incorporated in a straightforward manner.

In Section 2 the likelihood function is defined. In Section 3 a tree-based algorithm for the explicit enumeration of all admissible pedigrees of nonzero likelihood is given, suitable for smaller pedigrees. In Section 4 a mathematical representation of admissible pedigrees is developed, leading to an alternative formulation of the maximization problem. In Section 5 an implementation of a simulated annealing algorithm is introduced and applied to the maximization of the likelihood, following which some numerical examples are given in Section 6. In Section 7 some resulting issues are discussed, and future directions in this area proposed.

Section snippets

The pedigree likelihood function

The likelihood function of a pedigree based on genotypic or phenotypic information is based on the rules of Mendelian inheritance. Suppose we have (multilocus) genotype data for NI individuals labelled I={1,…,NI}. We assume that the loci are unlinked. Let Gi represent the genotype of individual i. A pedigree can be completely defined by a specification of each individual's parents, taking consideration of the fact that one or both parents of an individual may not be represented in the pedigree.

Enumeration of feasible pedigrees

The space over which L(T) is maximized is formally 2NI dimensional, each dimension representing one putative parent for each offspring. There would therefore be on the order of NI2NI possible pedigrees to examine in an enumerative approach, which would be computationally unfeasible except in the smallest problems. The problem is reduced by introducing two constraints. First, that the pedigree be a member of T, and second that the likelihood function be nonzero. The second constraint is

An algebraic representation of pedigrees

For a given pedigree we will define a pedigree matrix as an NI×NI matrix in which element (r,c) is a 1 if c is a parent of r and is 0 otherwise. Let M2 be the set of all 0–1 matrices with at most 2 nonzero entries in each row. Clearly, a pedigree matrix must be in M2, but a member of M2 need not be valid pedigree matrix corresponding to a pedigree in T. Let MT be the class of M2 matrices representing admissible pedigrees in T. In order to characterize MT we require the following definitions.

Definition 1

An n

A simulated annealing algorithm

When the number of feasible pedigrees is too large for exhaustive enumeration, an alternative is the use of simulated annealing, an optimization tool which has proven effective in a large variety of combinatorial optimization problems (Kirkpatrick et al., 1983).

Suppose we wish to maximize a function f on a state space X. We construct a Monte Carlo Markov chain (MCMC) on X as follows. For each xX there is a neighbourhood of states, assumed to be a constant size N. A transition from state xi to x

Examples

We consider here 3 examples. Genotypes are simulated for test pedigrees 1 and 2 given in Fig. 1, Fig. 2. Ten loci were assumed, each with 8 equally frequent alleles, for a heterozygosity of 0.875 per locus. This is comparable in allelic diversity to a set of DNA collected from a sample of 17 sperm whales analyzed in Example 3. No sex distinction is made in test pedigrees 1 and 2.

Discussion

A simulated annealing algorithm adapted to maximize the likelihood function of a joint pedigree was presented. The algorithm appears to converge properly, and has demonstrated success in inferring pedigrees without using age or sex data, provided that there is sufficient kinship structure in the data.

A number of issues remain. A complete solution to the problem requires an estimate of the error of the procedure. In Almudevar (2001) a bootstrap procedure is proposed which constructs a confidence

Acknowledgements

This research is supported by a grant from the Natural Sciences and Engineering Research Council of Canada. The author thanks the referees for their valuable recommendations and for correcting an error in the original manuscript. In addition, the author acknowledges the generous support of Chris Field. The author is also grateful to Heydar Radjavi for some insightful observations. The sperm whale data is courtesy of Jenny Christal.

References (24)

  • C.J. Geyer et al.

    Analysis of relatedness in the California condorsfrom DNA fingerprints

    Mol. Biol. Evol.

    (1993)
  • S.M. Haig et al.

    Identification of kin structure among Guam rail foundersa comparison of pedigrees and DNA profiles

    Mol. Ecol.

    (1994)
  • Cited by (51)

    • An information theoretic approach to pedigree reconstruction

      2016, Theoretical Population Biology
      Citation Excerpt :

      Section 8 summarizes the results in a conclusion. There is a clear relationship between graphical models and pedigree reconstruction that has been noted in the literature (Almudevar, 2003, 2007b; Riester, 2009; Cowell, 2009; Almudevar and LaCombe, 2012; Sheehan et al., 2014). The DAG is a natural representation for a pedigree, constructed using directed edges from parent to offspring.

    • Bayesian pedigree inference with small numbers of single nucleotide polymorphisms via a factor-graph representation

      2016, Theoretical Population Biology
      Citation Excerpt :

      The second assumption is that there is no genotyping error, or that it is negligible (which is rarely true in molecular ecology). The latter appears to be the case in Almudevar (2003), Cowell (2009), Almudevar and LaCombe (2012), and Sheehan et al. (2014). Riester et al. (2009), by contrast, include a correction in the genetic transmission probability from parent(s) to offspring to allow for genotyping error, but the correction is applied independently for every transmission in a way that does not allow the observed genotypes of an individual’s offspring and parents to be jointly informative of the true state of the individual’s imperfectly-observed genotype.

    View all citing articles on Scopus
    View full text