- Split View
-
Views
-
Cite
Cite
S. Piry, A. Alapetite, J.-M. Cornuet, D. Paetkau, L. Baudouin, A. Estoup, GENECLASS2: A Software for Genetic Assignment and First-Generation Migrant Detection, Journal of Heredity, Volume 95, Issue 6, November/December 2004, Pages 536–539, https://doi.org/10.1093/jhered/esh074
- Share Icon Share
Abstract
GENECLASS2 is a software that computes various genetic assignment criteria to assign or exclude reference populations as the origin of diploid or haploid individuals, as well as of groups of individuals, on the basis of multilocus genotype data. In addition to traditional assignment aims, the program allows the specific task of first-generation migrant detection. It includes several Monte Carlo resampling algorithms that compute for each individual its probability of belonging to each reference population or to be a resident (i.e., not a first-generation migrant) in the population where it was sampled. A user-friendly interface facilitates the treatment of large datasets.
The general aim of genetic assignment methods is to assign or exclude reference populations as possible origins of individuals on the basis of multilocus genotypes. Faster and cheaper development of highly polymorphic genetic markers, such as microsatellites, has increased the efficiency of such methods (see Estoup et al. [1998] for an empirical comparison of assignment results using microsatellite and protein loci). Genetic assignment methods are useful in addressing issues such as relationships, structure, and classification at the individual level (reviewed in Estoup and Angers [1998]). Because assignment methods allow us to draw inferences about where individuals were or were not born, they also have the potential to provide direct estimates of real-time dispersal through the detection of immigrant individuals (Paetkau et al., 2004; Rannala and Mountain 1997).
Several methods based on different assignment criteria computed for likelihood estimation have been developed to reach these goals (Cornuet et al. 1999; Paetkau et al. 1995; Rannala and Mountain 1997). The question of individual assignment to population samples also prompted the development of statistical methods distinguishing between resident individuals that are “misassigned” (have a genotype that is most likely to occur in a population other than the one in which the individual was sampled) by error from real immigrant individuals (i.e., type I error). Monte Carlo resampling methods have been proposed to identify a statistical threshold beyond which individuals are likely to be excluded from a given reference population sample (Cornuet et al. 1999; Paetkau et al., 2004; Rannala and Mountain 1997). The principle behind these resampling methods is to approximate the distribution of genotype likelihoods in a reference population sample and then compare the likelihood computed for the to-be-assigned individual to that distribution. Paetkau et al. (2004) have shown that the Monte Carlo resampling methods of Cornuet et al. (1999) and Rannala and Mountain (1997) generally result in an excess of resident individuals being excluded. In fact, the identification of accurate critical values required resampling methods to preserve the linkage disequilibrium deriving from recent generations of immigrants (i.e., admixture linkage disequilibrium) (Stephens et al. 1994) and to reflect the sampling variance inherent in the limited size of reference datasets. Paetkau et al. (2004) proposed a new Monte Carlo resampling method taking into account those aspects and that better control type I error rates. In particular, this resampling method was found to perform better than other ones for the detection of first-generation migrants (Paetkau et al. 2004).
Most available computer programs have been developed for assigning individual diploid genotypes (e.g., GENECLASS [Cornuet et al. 1999] and Immanc [Rannala and Mountain 1997]). GENECLASS2 provides an efficient and user-friendly tool that (1) computes various genetic assignment criteria used for likelihood estimation, (2) treats datasets with diploid or haploid data, (3) assigns or excludes individuals as well as groups of individuals to reference populations, and (4) computes probabilities that each individual belongs to each reference population or is a resident (i.e., not a first-generation immigrant) in the population where the individual has been sampled (cf. the first-generation migrant detection task). For the computations performed in (4), different Monte Carlo resampling algorithms, including that of Paetkau et al. (2004), have been implemented.
Statistical Criteria
Three types of criteria used for likelihood estimation have been implemented in GENECLASS2: genetic distance-based criteria, a criterion directly based on allele frequencies, and Bayesian criteria (see Cornuet et al. [1999] for a comparative study).
Distance Criteria
The assignment criterion is a genetic distance computed between the individual or group of individuals to be assigned and each reference population (Cornuet et al. 1999). The following distances have been implemented: Nei's standard genetic distance (Nei 1972), Nei's minimum genetic distance (Nei 1973 in Nei 1987), Nei's Da distance (Nei et al. 1983), Chord distance (Cavalli-Sforza and Edwards 1967), and a distance taking into account the allele size of microsatellite markers (Goldstein et al. 1995). For a review of a mathematical description of these distances, see Takezaki and Nei (1996).
Frequency Criterion
Bayesian Criteria
Self-Assignment and Detection of Migrants
These procedures are based on the computation of the above statistical criteria for all individuals included in a given population dataset. When the population considered is that where the individual has been sampled (i.e., self-assignment task), individuals are excluded from their population during computation (leave-one-out procedure; Efron 1983).
For the specific task of first-generation migrant detection, the statistical criterion computed for likelihood estimation can be one of three types: (1) the likelihood of the individual genotype within the population where the individual has been sampled (L_home), (2) the ratio of L_home to the highest likelihood value among all available population samples including the population where the individual was sampled (L_max) (Paetkau et al. 2004), and (3) the ratio of L_home to the highest likelihood value among all population samples excluding the population where the individual was sampled (L_max_not_home). Note that the likelihood ratios L_home/L_max and L_home/L_max_not_home have more power than the L_home statistics (cf. Paetkau et al. [2004] for L_home/L_max versus L_home, and unpublished results for L_home/Lmax_not_home versus L_home). Such likelihood ratios are appropriate when all source populations for immigrants are thought to be sampled. However, if some source populations are clearly missing, it becomes more appropriate to use L_home as the test statistic for the detection of first-generation migrants.
Probability Computation
The program computes the probability of the multilocus genotype of each individual to be encountered in a given population. Monte Carlo methods allow computing a random sample of multilocus genotypes for a large number of individuals (e.g., 1000 or 10,000). The assignment criterion values of the simulated individuals are then computed, stored, and sorted, so that the probability of an observed multilocus genotype can be estimated as the rank of its corresponding criterion value within the distribution of simulated criterion values (Cornuet et al. 1999; Rannala and Mountain 1997).
Historically, probabilities were computed by simulating a single set of a large number of individuals by the random drawing of alleles using allele frequencies directly estimated from the reference population samples (e.g., the programs GENECLASS [Cornuet et al. 1999] or IMMANC [Rannala and Mountain 1997]). Paetkau et al. (2004) show that these Monte Carlo resampling methods introduce a bias that leads to overrejection of resident individuals. To correct for this bias, Paetkau et al. (2004) propose a new Monte Carlo resampling method which has been implemented in GENECLASS2, in addition to those used in Cornuet et al. (1999) and Rannala and Mountain (1997). This new simulation algorithm generates population samples of the same size as the reference population sample. The assignment criterion is then computed for each individual of the newly simulated population minus itself (leave-one-out procedure). The program iterates until the total number of simulated assignment criterion values is reached (e.g., 1000 or 10,000). Because this method takes into account the sample size of the reference population, it better reflects the sampling variance associated with the analyzed dataset than the resampling procedure of Cornuet et al. (1999) and Rannala and Mountain (1997).
The second important feature of the resampling method of Paetkau et al. (2004) is that multilocus genotypes of the simulated individuals are generated by the random drawing of multilocus gametes using the following procedure. In the case of sexually reproducing diploid individuals, one individual (i.e., a potential “parent”) is randomly drawn from the reference population sample, and one gene and corresponding allelic state is then randomly drawn among the two gene copies for each locus. A second gamete is designed the same way from a second individual (“parent”) and both gametes are associated to give a simulated multilocus diploid genotype. This method was also adapted to haploid individuals with a diploid reproduction phase. The random generation of gametes as a basis for constructing simulated individual genotypes has the advantage of preserving the potential admixture linkage disequilibrium deriving from recent generations of immigrants (Paetkau et al. 2004).
It is worth noting that, while the computation of the assignment criteria was generalized for any level of ploidy or for groups of individuals, the three resampling methods implemented in GENECLASS2 only apply to haploid or diploid biological organisms with a sexual reproduction phase.
The computation of the probability of belonging is relatively time consuming. Typically computation without using the probabilities of belonging option takes a few seconds to a few minutes to run, while the computation of probabilities takes a few minutes to several hours depending on the sizes of the analyzed and reference datasets, the number of simulated individuals, and the speed of the processor. Note also that the computations based on the algorithm of Paetkau et al. (2004) are more time consuming than those based on the algorithm of Cornuet et al. (1999) or Rannala and Mountain (1997).
Management of Missing Data
Missing data were treated as follows. A locus is excluded from all computations when one or more reference population samples have no observation (i.e., genotypes) at this locus. On the other hand, when the to-be-assigned entity (individual or group of individuals) was genotyped for several loci but not for a locus l, then computations are done using all loci except locus l. A list of used and unused loci is given for each individual computation in the output file. It is worth mentioning that criterion values are not comparable among individuals when based on a different number of loci.
Program Features
Input Data Files
For the simple computation of criteria, a data file containing a mixture of diploid or haploid data is accepted, and the level of ploidy is taken into account during computations. However, probabilities based on Monte Carlo resampling methods are computable only for data files containing diploid or haploid data and not a mixture of both. The file formats accepted by GENECLASS2 are those used by the following population genetics software programs: GENEPOP (Raymond and Rousset 1995), GENETIX (Belkhir et al. 1996–2001), and FSTAT (Goudet 1995). GENECLASS2 also converts input data files into any of the three above file formats.
Datasets of virtually any size can be treated, provided the computer has enough memory to load the data and make the computations. Two files are needed for the assignment of individuals not included in the reference population sample. One file contains the reference dataset, the other the individuals or groups of individuals to be assigned. Only a single file is needed when the to-be-assigned individuals are included in the reference population sample (cf. self-assignment or migrants detection).
Output Data Files
All results can be printed, or saved in a CSV format (i.e., values are written by rows, fields separated with semicolons ) for further treatment in a spreadsheet (e.g., Microsoft Excel). A tool also included in GENECLASS2 displays basic summary statistics of the analyzed datasets: the number of alleles and genes per population and locus, allelic frequencies, heterozygotes proportions, and Nei's gene diversity (i.e., expected heterozygosity) (Nei 1987).
Running Environment
GENECLASS2 was developed in the Pascal object programming language and compiled with Borland Delphi6 and Kylix2. Therefore the software can be run on a Microsoft Windows or a Linux platform. An easy-to-use graphics interface has been designed to guide the user in the assignment process: choice of task (assign/exclude population as origin of individuals or detection of migrants), choice of statistical criterion for likelihood estimation, computation of probabilities, etc. The package includes a user-friendly help file with graphical interfaces that explain how to run the program to perform the two above tasks.
Program Availability
GENECLASS2 is freely available in English or French at http://www.montpellier.inra.fr/CBGP/softwares. A self-extracting setup executable leads the user through installation of the software on Windows-based machines. An RPM file containing the program file and the libraries allows the package to be installed on Mandrake and Red Hat Linux platforms. A .tar.Z archive allows manual installation of the binaries on other kinds of Linux platforms. A registration form allows users to be kept informed of new releases.
Corresponding Author: Sudhir Kumar
This work was supported by grants from the CIRAD/INRA on “Approches biomathématiques et biotechnologiques pour l'identification génétique et la gestion adaptée des populations animales et végétales” (to S.P., J.-M.C., and L.B.) and from the INRA SPE department on methods associated with real-time population genetics (to J.-M.C. and A.E.).
References
Baudouin L and Lebrun P,
Belkhir K, Borsa P, Chikhi L, Raufaste N, and Bonhomme F,
Cavalli-Sforza LL and Edwards AWF,
Cornuet JM, Piry S, Luikart G, Estoup A, and Solignac M,
Efron B,
Estoup A and Angers B,
Estoup A, Rousset F, Michalakis Y, Cornuet JM, Adriamanga M, and Guyomard R,
Goldstein DB, Ruiz Linares A, Cavalli-Sforza LL, and Feldman MW,
Nei M, Tajima F, and Tateno Y,
Paetkau D, Calvert W, Stirling I, and Strobeck C,
Paetkau D, Slade R, Burden M, and Estoup A,
Rannala B and Mountain JL,
Raymond M and Rousset F,
Stephens JC, Briscoe D, and O'Brien SJ,