- Split View
-
Views
-
Cite
Cite
Qing-Peng Kong, Hans-Jürgen Bandelt, Chang Sun, Yong-Gang Yao, Antonio Salas, Alessandro Achilli, Cheng-Ye Wang, Li Zhong, Chun-Ling Zhu, Shi-Fang Wu, Antonio Torroni, Ya-Ping Zhang, Updating the East Asian mtDNA phylogeny: a prerequisite for the identification of pathogenic mutations, Human Molecular Genetics, Volume 15, Issue 13, 1 July 2006, Pages 2076–2086, https://doi.org/10.1093/hmg/ddl130
- Share Icon Share
Knowledge about the world phylogeny of human mitochondrial DNA (mtDNA) is essential not only for evaluating the pathogenic role of specific mtDNA mutations but also for performing reliable association studies between mtDNA haplogroups and complex disorders. In the past few years, the main features of the East Asian portion of the mtDNA phylogeny have been determined on the basis of complete sequencing efforts, but representatives of several basal lineages were still lacking. Moreover, some recently published complete mtDNA sequences did apparently not fit into the known phylogenetic tree and conflicted with the established nomenclature. To refine the East Asian mtDNA tree and resolve data conflicts, we first completely sequenced 20 carefully selected mtDNAs—likely representatives of novel sub-haplogroups—and then, in order to distinguish diagnostic mutations of novel haplogroups from private variants, we applied a ‘motif-search’ procedure to a large sample collection. The novel information was incorporated into an updated East Asian mtDNA tree encompassing more than 1000 (near-) complete mtDNA sequences. A reassessment of the mtDNA data from a series of disease studies testified to the usefulness of such a refined mtDNA tree in evaluating the pathogenicity of mtDNA mutations. In particular, the claimed pathogenic role of mutations G3316A, T3394C, A4833G and G15497A appears to be most questionable as those initial claims were derived from anecdotal findings rather than e.g. appropriate association studies. Following a guideline based on the phylogenetic knowledge as proposed here could help avoiding similar problems in the future.
INTRODUCTION
Sequencing of entire molecules represents the ultimate approach to acquire information from the maternally inherited mitochondrial genome. With this approach in the past few years, more and more data have been obtained for reconstructing the world mitochondrial DNA (mtDNA) phylogeny and for discerning the phylogenetic status of the (sub)continentally specific haplogroups (1–15). A well-reconstructed phylogeny helps to gain unique and valuable insights for elucidating human evolution and pioneer settlement patterns; for instance, complete mtDNA sequence data indicate that our ancestors adopted a single route [viz. the ‘southern route’; (16)] to leave Africa and then migrated rapidly along the Asian coast (8,10,15,17,18). In mitochondrial disease studies, sequencing of the entire mitochondrial genome has become a routine method too. In this regard, the phylogenetic system of the world's mtDNAs has a great potential in assessing the pathogenicity of mtDNA mutations [e.g. see (19,20)].
In case–control studies, the attribution of a pathogenic or predisposing role to a specific variant is essentially based on the observation that the frequency of such a variant is significantly higher in the case subjects than in the normal controls; such results, however, could be easily affected by, e.g., population stratification (leading to false positive associations; type I error; 21–23). Even with well-matched controls, a significant association between the variant and the disease is not sufficient to prove its role as a disease-marker because of a number of reasons, including that some other mutation tightly linked with the nucleotide variant under study could be responsible for the significant association. Such a scenario is even more likely in a non-recombining molecule, such as mtDNA. Fortunately, the now emerging worldwide mtDNA phylogeny can provide a comprehensive understanding of where, when and how many times certain mtDNA mutations occurred, thus helping to evaluate the role of the variant under study.
To acquire a well-reconstructed mtDNA phylogeny, high-quality genome data are indispensable. Furthermore, to facilitate communication and avoid misunderstandings such a phylogeny has to be encoded by employing a consistent nomenclature. In this aspect, the recent inclusion in the East Asian mtDNA phylogeny of 672 new Japanese mtDNA genomes (24) should be treated with caution, as the genome data set bears a number of shortcomings, some of which are even reflected in the published summary trees (see Materials and Methods for further discussion); in addition, haplogroup designation partially conflicts with the already existing nomenclature and some haplogroup definitions seem to be problematic in a few instances. Meanwhile, with an increasing number of East Asian mtDNAs being analyzed, some novel (sub)haplogroups were recognized and awaited further characterization.
To address these issues, we reconstructed an updated East Asian mtDNA phylogeny based on the available complete sequences plus 20 newly sequenced ones. This resulting refined phylogeny was then employed to re-evaluate some previous disease studies that lacked a phylogenetic perspective. It appears that some mutations have repeatedly been the target of disease studies simply because they were not recorded in MITOMAP (http://www.mitomap.org/) although they clearly constitute polymorphisms in East Asia. With the phylogenetic bookkeeping of mutations at hand, such cases would demand a well-planned experimental design for extensive association studies (25). Moreover, the mtDNA phylogeny would also assist in the phylogenetic proof-reading of complete sequencing attempts (19) [see (26) for a response] as well as of packages of seemingly somatic mutations (20).
RESULTS
Reconciliation of previous nomenclature conflicts
A number of conflicts in the East Asian nomenclature have arisen among past studies, most of which were introduced because of the neglect of already existing haplogroup names or the delay of updating results in the light of most recent information. For example, the widely accepted haplogroup definitions M8a and M8 (5,27,28) were altered to S and CSZ, respectively, by Starikovskaya et al. (29). In the same study, the newly named haplogroup C2 is identical to the already defined haplogroup C4 (5). Table S1 (Supplementary Material) summarizes the conflicts as well as our corresponding reconciliation according to a chronological criterion (see Materials and Methods). Moreover, with our newly collected sequences, it was also possible to gain more insights into the phylogeny of haplogroup G, which was still poorly understood as for its entire variation.
As shown in Figure 1, the haplogroup G phylogeny, especially for G3 [defined by the control-region motif 16223-16274-16362 (30)], has been refined considerably. It is now evident that haplogroup G3 comprises two potential sister clades: G3a (further characterized by mutations at 143 and 15746) and G3b (recognizable by mutations at 13477, 14605 and 15927). Here we follow the convention that G3 is characterized by the 16274 transition (30), which, however, is now seen to be the sole mutation shared by both G3 branches. In retrospect, this definition was too optimistic in anticipating coding-region mutations accompanying the transition at 16274, which is a highly variable site. In haplogroup G2, a new branch appeared, named G2b, whose signature mutations (transitions at 3593, 4853, 8877, 11151, 16172 and a back mutation at 263) were identified by motif search (Supplementary Material, Table S2). The G sub-haplogroup ‘G3’, as identified by Tanaka et al. (24), is renamed here as G4 (Figure 1 or Supplementary Material, Table S1) and is determined by an array of mutations: the insertion 191+A, transitions at 194, 4541, 5051, 5460, 6216, 7521, 7660, 9670, 11383 and 15940 and the 16114A transversion.
Newly identified haplogroups
Our motif-search results (Supplementary Material, Table S2) generally support the definitions of some previously described (sub)haplogroups (5,24) for which only limited information was available, namely, G1a, M7b1, M7b2, F1a, F1c, F2a, R9b, B4a1, B5a and Y1. Our study also provides additional information for supporting several recently identified haplogroups. For instance, haplogroup M13 [here we adopted the nomenclature of (27) which was published online earlier than that of (24); cf. Supplementary Material, Table S1], originally identified by the transition motif 10411-16145-16188-16223 (27), is further characterized by eight mutations at sites 152, 3644, 5773, 6023, 6253, 6620, 10790 and 15924. The major sub-clade, M13a, of haplogroup M13 is defined by the coding-region transition at 13135. The lineage with the hypervariable segment (HVS) I motif 16086-16297-16324 that was suggested to belong to M7b and named M7b3 in Yao et al. (27), gets additional support from the coding region (Fig. 1). Similarly, the newly defined haplogroup B6 (27) gets clear support from coding-region information (Fig. 2). A novel clade, which was characterized by mutations at 9968, 16261 and 16292, emerged in haplogroup B4 and is named B4g here. Furthermore, a new basal haplogroup, M12, could be identified. It shares the mutation 14569 with haplogroup G and was shown to be further defined by mutations at 4170, 5580, 12030, 12372, 14727, 15010, 16172, 16234 and 16290 (see Fig. 1 or Supplementary Material, Table S2).
Updated East Asian mtDNA tree
The East Asian phylogenetic tree (Figs 1 and 2) was greatly refined on the basis of >1000 (near-)complete mtDNA sequences available so far: a large number of novel ‘boughs’ and ‘twigs’ (30) were recognized, thus largely broadening our understanding of the East Asian mtDNA phylogeny. In particular, a few artificial branches that were introduced in the past by relying predominantly on early (problematic) mtDNA data from disease studies (cf. 5,19,30) could now be eliminated. For example, the formerly defined haplogroup D4b1 turned out to be a sub-clade of what was then called haplogroup D4b2 (5), which was introduced because of the omission of variant 15440 in one haplogroup D4b sequence reported by Ozawa et al. (31). Subsequently, the definition of D4b was broadened as being based solely on the 8020 transition (24): it embraces sub-haplogroups D4b1 (identifiable by variants 10181, 15440, 15951 and 16319) and D4b2 (defined by 1382C, 8964 and 9824A) (Fig. 1).
Pathogenic mutations in the mtDNA tree?
The bookkeeping of mtDNA mutations as organized in the current updated tree for East Asian mtDNAs permits the researcher to distinguish shared mutations of considerable age from private mutations of potentially much younger age. For instance, the substitution 3644 announced as ‘uncharacterized’ by Munakata et al. (32) is in fact characteristic of haplogroups M13 and D4h (Fig. 1) and thus is widely distributed across East Asia, although at a minor frequency. Then the mtDNAs of the other five cases also presented in Table 1 of Munakata et al (32) can be tentatively allocated to East Asian haplogroups: the transition 13651 defines D4a1b; 12311 defines G2a3; 11084 defines M7a1a and 4705 defines F1b1a1.
Similarly, a number of earlier claims for pathogenic and somatic mtDNA mutations clearly need revision. For example, Chiu et al. (33) analyzed the entire mitochondrial genomes of 10 patients suffering from hydatidiform mole and one affected by choriocarcinoma, for which only the putatively ‘new mtDNA polymorphisms’ were reported. However, the vast majority of these mutations can be found in previously published complete mtDNA genome data. Specifically, mutations at 6392, 10535 and 13928 detected in patient M72 define haplogroup F2, those at 10031, 11061 and 13681 in patient M78 define haplogroup R11, and variants 5585, 5913, 6392 and 10320 in patient M89 are characteristic of haplogroup F3 (5). Moreover, Chiu et al. (33) claimed to have detected six somatic point mutations (viz. transitions at sites 153, 489, 1719, 16051, 16209 and 16519) in a single choriocarcinoma sample; however, these mutations have jointly been observed in samples Hui45 [(27) and this study] and Bouyei50 (Supplementary Material, Table S2) which belong to haplogroup M9b, as revealed by our present complete sequencing efforts. Evidently, lab contamination or sample mix-up would easily explain the results, and the authors’ claim that the high frequency of ‘somatic’ mutations is responsible for the gestational trophoblastic disease is therefore unsupported. This case would thus parallel a multitude of similar instances of sample mix-up, which appear to be predominant in the field of cancer research (20).
DISCUSSION
In spite of having utilized more than 1000 complete genomes for phylogeny reconstruction, we cannot expect that our mtDNA tree covers absolutely all deep lineages in the East Asian matrilineal gene pool. In view of the enormous population size in East Asia (>1.4 billion), some basal additions and peripheral refinement will become necessary in the near future with more and more data available. This entails that not every novel combination of previously observed or hitherto unobserved mtDNA mutations found in some patient would necessarily point to mutations that were very recently acquired in the matriline. Only complete sequencing of the entire mtDNA and systematic searches for potentially related mtDNAs could then help to clarify the situation.
Knowledge about the mtDNA phylogeny is particularly necessary for disease studies when assessing the pathogenicity of certain mtDNA mutation(s). By elucidating the evolutionary relationships of certain mtDNA variants in a geographic context, we can (i) pinpoint the potential sequencing or documentation errors that could easily lead to spurious positive associations with diseases, (ii) anticipate the potential effect of population stratification, which would cause serious problems with case–control studies when ignored and (iii) distinguish the haplogroup-diagnostic variants from the private ones and determine whether the single mutation or a set of mutations or matrilineal backgrounds could play a role in the susceptibility to a disease. Unfortunately, clinical studies have hitherto shown a systematic lack of attention to the available mtDNA phylogenetic knowledge, which had some serious consequences (19,20,34–36).
Effect of population stratification
The 3394 transition has a long record as a suspect for association with diseases, although its status as a pathogenic mutation is still ‘unclear’ according to MITOMAP. Johns et al. (37) had first brought the 3394 mutation into play in connection with patients of Leber's hereditary optic neuropathy (LHON). The mtDNAs of those patients harboring the 3394 change evidently belonged to haplogroup J1c1, for which this mutation is characteristic within haplogroup J1c (8). Quite a different scenario had been evoked in Japan, where this mutation was believed to contribute to hypertrophic cardiomyopathy (38). In East Asia, however, this mutation is a marker mainly for haplogroup M9a’b (besides D4g1a). In fact, the mtDNA of patient 6 described in Obayashi et al. (38) is a member of haplogroup M9a1. Later, this mutation was assumed to contribute to the manifestation of non-insulin-dependent diabetes mellitus (NIDDM) (39–41) and was most recently observed in the case of carnitine palmitoyl-transferase II deficiency and ragged red fibers at muscle biopsy (42).
One of the arguments put forward by Hirai et al. (39) was that the mutation at 3394 was observed in 18/365 (4.9%) of the NIDDM patients but only in 3/225 (1.3%) of non-diabetic controls, which at face value could be regarded as significant. It is then interesting to compare these figures with the corresponding percentages in different mtDNA samples from Japan. In the data of Tanaka et al. (24), the frequency of the 3394 variant is 14/672 (2.1%), due to 11 M9a mtDNAs (contributing 1.6% alone) and 3 D4g1a mtDNAs. As judging from the HVS-I motifs, the haplogroup M9a contribution alone would, for example, be as high as 6/150 (4.0%) in Nishimaki et al. (43), take intermediate values 5/162 (3.1%) in the HVS-I&II data set of Imaizumi et al. (44) and 5/231 (2.2%) in the HVS-I data set of Tajima et al. (45), but merely 2/211 (0.9%) in the data from Maruyama et al. (46). Thus, by extrapolation from the Tanaka et al. sample (under the optimistic hypothesis that no other minor haplogroup would bear the 3394 mutation), we would expect the frequency of the 3394 transition to range freely between 1 and 5% in samples of size >150, depending on the sampling region, thus framing the frequencies found in both the patient and control cohorts by Hirai et al. (39). It is certainly not surprising that rather minor haplogroups, such as M9a, can show considerably different frequencies across a broad geographical area, which could turn out to be significantly different with large enough samples and thus positively testify to population substructure. Arbitrarily drawn control groups, constituting convenience samples, are never controlled for such hidden variables. As repeatedly demanded in the literature (21–23), control for population stratification is crucial in order to avoid spurious positive association in disease population-based studies, such as the classical case–control association design. The burden of proof for demonstrating that population structure did not influence the results rests on the researcher who has designed the study.
Haplogroup-diagnostic mutation in disease study
An interesting case of a haplogroup-defining mutation that came up in connection with a number of diseases and mitochondrial disorders, mainly in patients from Japan [see (47) or MITOMAP for pertinent references], is represented by the 3316 transition, which defines haplogroup D4e1. This haplogroup encompasses the Native Northwest American haplogroup D2, which is, for instance, virtually fixed in the Aleuts of the Commander Islands (48). The 3316 mutation is, however, not restricted to haplogroup D4e1 but actually found in mtDNAs of all continental origins (cf. 3,49). Grazina et al. (47) reported the 3316 mutation along with the 3337 transition in a (Portuguese) patient with frontotemporal dementia (FTD). To our knowledge, there exists no record that would claim considerable frequency of FTD in the Aleuts. Therefore, the anecdotal finding of this mutation in a Portuguese patient cannot suggest potential disease status of the 3316 variant, especially as another mutation (at 3337) was also observed in the patient.
To give another example, the 4833 transition was seen in connection with abnormal glucose tolerance (50), but it was not realized that this transition actually defines the entire haplogroup G and does not appear in any other part of the mtDNA phylogeny (among more than 2500 complete or coding mtDNA genomes scrutinized). Indeed, the 4833 change, recognizable by the gain of a HaeII site at 4830, originally defined the East Asian haplogroup G within the high-resolution system of restriction fragment length polymorphisms and has therefore been known for a long time.
The 15497 transition was claimed to be associated with obesity-related variables and lipid metabolism (51), but this mutation, in fact, defines haplogroup G1, which is mainly represented by sub-haplogroup G1a1 in Japan [with frequency 3.3%, according to Tanaka et al. (24)]. Another sub-haplogroup, G1b, is even the predominant haplogroup in Koryaks and Itel'men (52). The same mutation was also reported in a haplogroup H4 mtDNA from a 22-year-old male triathlete and thought to be connected with the development of exercise-induced paracrystalline inclusions (53). These authors erroneously asserted that this mutation would not occur among the 560 coding-region sequences provided by Herrnstadt et al. (3,49). However, it does, viz. in sample no. 423, which is closely related to sample no. 11 of Achilli et al. (6). Both samples along with that described by Tarnopolsky et al. (53) belong to a common sub-haplogroup that is supported by transitions at sites 7581, 15497 and 15930 within haplogroup H4. In West Eurasia, the 15497 transition (recognizable by the gain of the DdeI restriction site at 15494) is also characteristic of a sub-haplogroup of haplogroup R1 (54). Tarnopolsky et al. (53) performed functional assays using cybrids (constructed by fusing the subject's platelets with U87MG glioblastoma and SH-SY5Y neuroblastoma rho-zero cells) and found that cybrids containing this mtDNA generated higher basal levels of reactive oxygen species but lower ATP production. Moreover, they were more sensitive to oxygen and glucose deprivation and peroxynitrite generation compared with control cybrids. It is unclear whether the 15497 transition is really responsible for these effects. Most problematically, this mutation, in the same individual, changed from heteroplasmic state (55) to homoplasmic state (53) in two studies performed by the same laboratory.
To put the number of dubious cases into relation to the total list of putative disease-associated mutations, we have taken a survey of mutations seen in mitochondrial encephalomyopathies (published in Neuromuscular Disorders, January 2004) by way of example. From a total of 97 substitutions listed there, six appear in our East Asian mtDNA tree: three are the ‘secondary LHON mutations’ (at 3394, 4216 and 13708) and one is the haplogroup G2a3 mutation at 12311 [first found by Hattori et al. (56) in chronic progressive external ophthalmoplegia] discussed earlier, and the other two are the 11696 transition [found in Leigh syndrome by De Vries et al. (57) and in LHON by Zhou et al. (58)] that defines haplogroup D4j and the 8296 transition [found in diabetes and deafness by Kameoka et al. (59), in myoclonus epilepsy with ragged-red fibers by Arenas et al. (60), in hypertrophic cardiomyopathy by Akita et al. (61) and in MELAS by Sakuta et al. (62)] that defines haplogroup D4a2a. This could suggest that the number of haplogroup-defining mutations that came under suspicion of being associated with a disease is not too large, but any such mutation would be visited repeatedly in the context of different diseases, because, by chance, each cohort of patients could include a member of the respective haplogroups. Specifically, Sato and Hayasaka (63) already suggested in response to the Hattori et al. (56) study that the 12311 mutation may be a polymorphism. As for the 8296 mutation, Arenas et al. (60) judged it of unclear significance and a rare polymorphism. Indeed, it has been shown by Bornstein et al. (64,65) through cybrid analysis that ‘the only effect detected in the A8296G mutation is a moderate decrease in the aminoacylation capacity, which does not affect mitochondrial protein biosynthesis’.
A phylogenetic solution
Before attributing a role in disease expression to a certain mtDNA haplogroup-diagnostic mutation, especially when it appears to be basal, great caution should be taken. As all the haplogroup-diagnostic variants shown deeply in the phylogenetic tree have been under selective pressure for thousands of years, their broad distribution among normal populations would suggest them to be rather benign. It is also possible that, in some cases, haplogroup-diagnostic variant(s) might have an additive (or multiplicative, by way of epistatic effects with some nuclear loci) effect to the susceptibility of diseases, as proposed by the common variant–common disease hypothesis; whereas in other cases, the variant(s) could be beneficial in certain environments but deleterious in others. However, to avoid spurious positive associations [as demonstrated in a recent study on East Asian families with hearing impairment (66)] in association studies, it is strongly recommended that, before drawing any hasty conclusion about the pathogenic role of such variant(s), a phylogenetic test should be performed. Any results which could not pass the test of phylogenetic analysis should be considered with major caution, and further studies with a much more refined experimental design should be carried out and an extended large data set (targeting the same population group) should be screened. Specifically, when a direct functional assay cannot be performed, a feasible solution based on phylogenetic knowledge could lend the extra aid for a proper evaluation of pathogenicity of certain haplogroup-specific mutation(s). First, such a study should have a phylogenetic focus and evaluate mtDNA lineages from a large number of well-matched patients and controls in different populations, in order to minimize and/or evaluate the effect of population stratification and efficiently determine whether the single mutation or a set of mutations (or matrilineal background) are associated with the disease. Secondly, the full motifs of the involved haplogroup and its subclade(s) should be taken into consideration according to the reconstructed phylogenetic tree, so as to rule out the possibility that the observed significant association of a certain haplogroup or haplogroup-diagnostic mutation is in fact attributed to its subclade(s) or the other tightly linked mutation(s). Thirdly, to further evaluate the pathogenicity of the diagnostic mutation, additional evidence could also come from the genetic epidemiological analysis by screening the mutation or the haplogroup in larger samples from general populations, which then helps to reveal whether individuals bearing the mutation or belonging to the haplogroup are more susceptible to the disease. Finally, if possible, further evidence from studying the correlation of the mutation and the disease in other ethnic group(s) of different matrilineal background would be most helpful for a better understanding of the pathogenesis of the disease.
In this spirit, the detailed knowledge of the West Eurasian and East Asian mtDNA phylogenies opens a perspective to reassess the role of mtDNA backgrounds in both mitochondrial and complex disorders, for instance in LHON. In Europe and in populations of predominantly European ancestry, the two frequent primary LHON mutations in ND (NADH dehydrogenase) subunits (viz. 11778/ND4 and 14484/ND6) are consistently associated with haplogroup J (67–70). This suggests that the sequence variation within this haplogroup, in particular the transitions at 4216 (characteristic of haplogroups JT and R2), 10398 (back-mutated on haplogroup J but not on T), 13708 (characteristic of haplogroup J) and 3394 (characteristic of haplogroup J1c1), which cause amino acid changes in ND subunits, increase the penetrance of LHON (69,70). A recent study on LHON revealed that such a preferential association (with 11778/ND4) is attributable to sub-haplogroups J1c and J2b (71), thus well explaining the puzzling observation that this association is not confirmed for West Asian populations (72) and further indicating that only a complex pattern of several mutations combined has a notable influence on the etiology of LHON (71). In order to find out whether single occurrences or specific combinations of mutations play any role in this respect, one could envision large-scale studies that target other (sub)continents, e.g. East Asian-specific haplogroups bearing exactly one or two of those mutations. Specifically, the 4216 transition is part of the motif for haplogroup D5c, the 13708 transition is characteristic of haplogroups F2, D5a1, and D4b2a1 and 3394 of M9a’b and D4g1a; further, the ancestral state at site 10398 is present in virtually all macrohaplogroup M lineages as well as in particular sub-haplogroups of macrohaplogroup N that changed back at 10398 (as is the case for haplogroups R11, B5, B4c1c, F3b1 and Y). Such analyses, however, demand a reconstructed phylogeny with a much higher resolution. Therefore, the progress of mtDNA genomics studies should not stop at the identification of basal haplogroups, but aim at identifying more specific sub-haplogroups. This will not only help to reveal the subtle patterns in the past evolution of modern humans (cf. 6,7), but also play an important role in determining the involved mutation(s) or lineage(s) in the association studies (cf. 71).
Nevertheless, it is still possible that the same mtDNA variant (or haplotype) could contribute differently to the susceptibility of the disease in different human populations (73), mainly due to epistatic effects between e.g. some nuclear DNA factor and the mtDNA variant (haplotype) under study. Then, ideally, genotyping of an appropriate number (e.g. 40) of (apparently) neutral unlinked markers (e.g. SNPs) (the genomic control procedure) (74) in cases and controls would help to rule out the existence of a certain level of population stratification that could affect the association study. The structure assessment method popularized by Pritchard et al. (75,76) could be equally helpful, but again, it requires to genotype autosomal unlinked genetic markers.
In summary, the combination of virtually all previously reported and our newly produced complete East Asian mtDNA sequences have helped to define a large number of novel (sub)haplogroups as well as further solidify previously reported haplogroups, which in turn have greatly refined the phylogenetic tree of mtDNAs found in East Asian populations. This updated phylogenetic tree with its standardized nomenclature resolves the confusion of nomenclature introduced by inadvertent neglect of the literature and provides an essential reference guide for disease, anthropological and forensic studies in East Asian populations.
MATERIALS AND METHODS
Data collection
To determine the full motifs of the newly emerging novel haplogroups, a total of 20 samples (GenBank accession nos. DQ272107–DQ272126), representing 20 different (sub)haplogroups, were selected from our mtDNA database (27,77,78) (authors’ unpublished data) for complete sequencing. To identify the motifs of novel (sub)haplogroups, we systematically screened candidate mutations in control mtDNAs which appear to be closely related (as judged from the control-region and partial coding-region information) to complete mtDNA sequences constituting so far singular terminal branches of the estimated mtDNA tree, with the aim to distinguish private mutations from shared ones that are diagnostic for some twigs of the mtDNA tree (‘motif search’).
Motif search
The motifs (strings of characteristic/diagnostic mutations shared by descent) of the most basal haplogroups could immediately be read off from the ‘trunks’ (30) or other major branches of the mtDNA tree, whereas the motifs of some rare basal haplogroups or minor lineages at the ‘twig’ level are more difficult to determine, as these haplogroups may be represented by only one complete sequence in the database. To identify the motifs of such lineages, the direct way is to completely sequence more representative samples. Another feasible way is, alternatively, to screen the polymorphisms of the representative sample allocated to the corresponding (terminal) branch of the tree in other mtDNAs (viz. controls) that appear to be closely related or (near-)matched according to available control-region and/or partial coding-region information. Then the motif of the haplogroup (viz. the shared variants) can be easily distinguished from the private polymorphisms of the representative sample. Such a motif-search strategy can thus serve as a useful auxiliary method of mtDNA genome study for the phylogeny reconstruction. As a result, a total of 59 control samples, which were selected from our mtDNA database (27,28,77–80) (authors’ unpublished data), were analyzed for short coding-region segments. The amplification and sequencing procedures performed in the present study have been fully described elsewhere (5,8,15), and the resulting sequences were edited with DNASTAR software (DNASTAR Inc.). Mutations were scored relative to the revised Cambridge reference sequence (81). To ensure high quality of our sequencing results, we adopted previously described quality-control measures (5) and some suggestions (82–84) for data generation and editing.
Annotated tree reconstruction
Besides our newly produced data, a total of 1001 reported near-)complete mtDNA sequences belonging to East Asian haplogroups (1–5,10,12,14,18,19,24,29,30,48,85–88) were reanalyzed in order to further annotate the available East Asian mtDNA phylogeny (5). To reconcile the nomenclature conflicts between different studies, a chronological criterion was employed in the present study, that is, the haplogroup name or definition, which was harmonious with the previous haplogroup system and was reported first, was preferred.
Several prerequisites needed to be in place before entering the published sequences for tree reconstruction: (i) the 13 sequences from Tanaka et al. (24) that were very likely affected by artificial recombination (authors' unpublished data) were excluded from the analysis. Some of the potential recombinants have even been reflected by the summary trees offered by the authors (24); for instance, the seemingly ‘private’ variants 6455, 12406, 12771, 12882, 13759, 13928C and 14002 of ND168 (haplogroup M7a2) in their Figure 1a are in fact faithfully ‘borrowed’ from the haplogroup F1a1b branch. (ii) Several obvious editing errors, such as 16569+G and 16569+GATCACAG, were corrected. (iii) In consideration of the relatively high frequency of coding-region homoplasy in those 672 sequences as well as the possibility of some potential recombinants undetected which involved only one mutation, we adopted a more conservative approach here for the tree reconstruction: any observed novel intermediate node that was introduced by the presence or absence of one or two adjacent mutations, when supported by only one sequence, was not considered. (iv) Errors in the three complete mtDNAs of Zhao et al. (88) were eliminated according to Yao et al. (66).
SUPPLEMENTARY MATERIAL
Supplementary Material is available at HMG online.
ACKNOWLEDGEMENTS
We thank the two anonymous referees for their valuable comments. The work was supported by grants from the Natural Science Foundation of Yunnan Province, Natural Science Foundation of China (NSFC, No. 30021004), the Chinese Academy of Sciences (KSCX2-SW-2010), Fondo Investimenti Ricerca di Base 2001 and Fondazione Cariplo (to A.T.), the National Key Technologies R&D Program of China and the Science and Technology Committee of Yunnan Province (2005NG07).
Conflict of Interest statement. None declared.
REFERENCES
GenBank accession nos for the mtDNA complete genome data: DQ272107–DQ272126.