Main

Genetic mapping using genetic crosses is a powerful approach for identifying genes underlying various traits of parasites7,8,9,10. However, high costs and laborious procedures have prevented widespread application of genetic crosses for genetic studies in P. falciparum. Association mapping using field isolates is an alternative, but this approach requires typing large numbers of genetic markers and parasite isolates because of the complexity of the parasite population structure and highly variable recombination rates11. A genetic map consisting of 800 microsatellite markers has been constructed for P. falciparum12; however, compared with the methods recently developed for large-scale SNP typing13,14,15,16, microsatellite typing is relatively laborious. A genome-wide map with high-density SNP markers will greatly facilitate our ability to identify important genes underlying parasite traits and address important biological questions pertaining to the genome of P. falciparum. Unfortunately, a high-density SNP map is currently available only for chromosome 3 (ref. 11).

We searched for polymorphisms in the P. falciparum genome by amplifying and sequencing 3,539 predicted genes or fragments (19% of each isolate genome; 91% coding sequence) from four cloned isolates (Dd2, Hb3, D10 and 7G8) (Table 1 and Supplementary Table 1 online). After aligning the DNA sequences to the 3D7 genome sequence1, we identified 3,918 well-validated SNPs, giving a genome-wide average of one SNP per 5.9 kb DNA (Table 1 and Fig. 1). Of the genes surveyed, approximately half (54.3%) had one or more SNPs (Supplementary Fig. 1 online); the majority (65%) of SNPs were nonsynonymous (nsSNP; Table 1). The predominance of nsSNPs is probably due to codon bias and high frequencies of nonsynonymous sites in the parasite genome17,18. The estimates of the genome-wide average population mutation rate, 4Nμ (Watterson's θ), where N is the effective population size and μ is the per-nucleotide mutation rate, is 5.05 × 10−4, and the estimate of average pairwise nucleotide diversity (π) is 4.83 × 10−4 (Table 1), similar to previous reports11. Although these diversity values are lower than those from many model organisms, this could be due to our sequencing of mostly coding regions and our high stringency of SNP calling (see Methods).

Table 1 Summary of sequenced genes and diversity among five P. falciparum isolates
Figure 1: Physical maps showing distribution of polymorphic sites across 14 P. falciparum chromosomes.
figure 1

Vertical bars represent SNPs (black, nonsynonymous substitutions; red, synonymous substitutions) or microsatellites (blue, under the horizontal lines). Only one nonsynonymous SNP and one synonymous SNP are presented if there was more than one SNP in a gene (noncoding SNPs were grouped with synonymous SNPs). Most of the chromosomal ends (green vertical bars) were excluded because of gene families such as var, rifin and stevor.

We resequenced 45 kb of DNA covering 183 known SNPs on chromosome 3 from 99 worldwide isolates11, 108 of which are common SNPs (minor allele frequency ≥ 0.05). We discovered 185 new SNPs, 29 of which were common SNPs (Supplementary Table 2 online). Because only 22% of the 202 kb sequenced for the five isolates was resequenced for the 99 isolates, we would expect to miss 132 (29/0.22) common SNPs in five isolates. This indicates that our survey of five isolates captures 45% of common SNPs (108/(132 + 108) = 0.45) relative to the worldwide sample of 99 isolates and gives us a frequency of one common SNP per 842 bp (202 kb/240) and a global genome-wide expectation of >27,000 common SNPs, as estimated based on SNPs largely from single-copy nontelomeric genes.

Microsatellites are also abundant in the genome, averaging one polymorphic microsatellite per 1.3 genes (Fig. 1 and Table 1). The true frequency of polymorphic microsatellites throughout the genome would probably be much higher if more noncoding regions were assayed. Combining polymorphic microsatellites and SNPs, our data constitute a map with an average of one polymorphic marker per 3.6 kb for the P. falciparum genome (Fig. 1), providing a powerful tool for genetic studies of the parasite.

SNPs are not distributed evenly across chromosomes; rather, some regions have consecutive genes without any SNPs, and other DNA segments have consecutive genes with multiple SNPs (Fig. 1). The percentage of genes with SNPs varies from chromosome to chromosome, ranging from 47.1% (chromosome 13) to 67.5% (chromosome 7) (Table 1). The numbers of SNPs per gene differs more than twofold between chromosomes, averaging 0.85–2.08 SNPs per sequenced gene, with large chromosomes having fewer SNPs per gene (Table 1 and Supplementary Table 1). Indeed, excluding chromosome 7, there is a negative correlation between chromosome size and the number of SNPs per gene (Supplementary Fig. 2 online). Comparison of the average θ values from genes at chromosome ends (15% of the sequenced genes from each chromosome end) with genes in the remainder of the chromosome showed significantly higher θ values for genes at chromosome ends (P = 0.0001, Wilcoxon signed rank test; similarly, P = 0.0001 if we compared and tested ten genes from each end). These results suggest that a generally higher level of polymorphism at chromosome ends may contribute to this negative correlation, because these regions take up a relatively larger proportion of the small chromosomes.

With these genome-wide markers, we estimated the number of recombination events using methods described previously11. We detected recombination events at relatively high frequencies (Table 1); they were distributed nonuniformly both within and among chromosomes, clustering in subtelomeric regions, as previously described for a larger sample on chromosome 3 (ref. 11). Understanding the patterns and rates of recombination is of critical importance for genetic studies, particularly association mapping.

Genes encoding surface antigens, cell adhesion molecules and proteins involved in drug interactions are mostly polymorphic (Supplementary Fig. 3 online). The antigen group has a high ratio of nonsynonymous pairwise differences per nonsynonymous site (pN) relative to synonymous pairwise differences per synonymous site (pS) (pN/pS = 5.8), suggestive of balancing, diversifying or partial directional selection. Additionally, estimates of Tajima's D, a measure of the frequency distribution of alleles, across chromosomes also identified some genes that show an excess of diversity indicative of balancing selection, such as eba-175, which has been shown to be under strong balancing selection (Supplementary Table 1)19.

Chromosomal regions flanking many var clusters are more polymorphic than the genome-wide average (Supplementary Fig. 4 and Supplementary Table 1 online). In addition to many subtelomeric var genes, three of the four internal var clusters in the 3D7 genome (on chromosomes 7, 8, and 12) are flanked by five or more consecutive polymorphic genes, with some extending 100 kb from the core var cluster, particularly the two clusters on chromosomes 7 and 8 (Supplementary Fig. 4). On one side of the chromosome 7 internal var locus, elevated polymorphism extended over 200 kb to the chloroquine-resistance transporter gene (pfcrt). Indeed, genes flanking the var loci on chromosomes 7 (25 genes) and 8 (14 genes) have significantly higher θ values than the average values for their chromosome (Table 1, P < 0.0001 for chromosome 7 and P < 0.001 for chromosome 8, Wilcoxon signed rank test). The var genes encode a family of variant antigens called PfEMP1 that are important for immune invasion and disease pathogenesis20,21,22 and that may be under strong balancing selection from the host immune response23, thus maintaining more variation than expected under neutrality. Genes flanking var clusters may have correlated evolutionary histories with the var genes, preserving diverse alleles linked to each unique var haplotype. Therefore, a peak of elevated nucleotide diversity surrounding a selected target is one of the signatures of balancing selection24,25,26. Some var clusters, however, do not have obvious elevated polymorphism in the flanking regions, which could be due to the absence of var genes at a specific location in some parasites, or it could be that some var clusters are subject to lesser selective pressures (perhaps expressed less frequently). If this is true, vaccine development based on var genes should probably give emphasis to var genes that are under strong selection.

Signatures of selection can be exploited to identify genes encoding new antigens or drug targets. We searched for genomic regions with consecutive polymorphic genes or peaks of polymorphism (indicative of balancing selection) that may harbor genes encoding antigens. Indeed, approximately 40% of the 83 loci with five or more consecutive polymorphic genes contain genes encoding known antigens (Supplementary Table 3 and Supplementary Fig. 4 online). These results suggest that most of the parasite antigens are under selection from the host immune system. Further investigation of the other 37 loci (Supplementary Table 3) with unknown genes may lead to some new vaccine candidates. Table 2 lists 56 highly polymorphic genes having θ values 2 s.d. higher than the mean θ value for 1,920 genes with one or more SNP. Although more than half of these genes encode proteins of unknown function, 18% of them (10) are known antigens (Table 2). Two genes encode proteins involved in lipid metabolism, one of which is a known drug target in other organisms27. According to the annotation in PlasmoDB28, 32 of these genes (57%) have one or more predicted transmembrane domains (38%) and/or a signal peptide (36%), whereas only approximately 11% and 31% of the genes in the genome have a predicted signal peptide or transmembrane domain, respectively. The higher proportion of proteins with signal peptides and transmembrane domains suggests potential membrane and/or surface localization that may be recognized by the host immune system.

Table 2 Highly polymorphic genes that are potential immune or drug targets

We next expressed the 56 genes in Table 2 plus 52 genes that have five or more SNPs and encode a predicted signal peptide and/or transmembrane domains (Supplementary Table 4 online) using an E. coli cell-free rapid expression system. Expression of proteins was verified via protein blot using antibodies to the His tags incorporated into the C terminus of the expressed proteins and detected using pooled human immune sera. Eleven of the 65 expressed proteins were recognized by pooled human immune sera but not by pooled nonimmune sera (Fig. 2); seven of these represented previously unknown antigens that require further evaluation as potential vaccine candidates.

Figure 2: Immunoblots of candidate antigens expressed in vitro.
figure 2

Separated proteins were probed with (a) pooled human immune sera from affected individuals, (b) pooled human nonimmune sera or (c) antibodies to the His tag. Eleven of the fourteen proteins (except PFB0105c, PF080012, and PFE1600w) were clearly recognized by human immune sera but not by nonimmune sera. PFB0105c seems to cross-react with nonimmune sera. PF080012 and PFE1600w showed low expression, and we did not detect any bands by the immune sera (acting as negative controls). EBA-175, Rh2, AMA-1, and MSP-1 are known antigens (positive controls). Genes PF11-0034, PFI-0170w, PFE1600w, and PFL1358c also have multiple SNPs, but their θ values are not high enough to be included in Table 2. Gene identifications are as in PlasmoDB. Arrowheads indicate specific bands recognized by human immune sera.

This study identifies thousands of well-validated SNPs and polymorphic microsatellites for mapping genes that may be important in drug resistance, parasite development and disease pathogenesis. Developing high-density markers and high-throughput methods for genotyping large numbers of parasites is critical for mapping genes associated with malaria phenotypes, particularly in high-transmission populations where limited linkage disequilibrium exists. A high-throughput array-based genotyping method is being developed for use in the malaria community. The genome-wide data also show that the P. falciparum genome is highly polymorphic, with at least one polymorphic site per 0.5 kb in only five isolates. Additionally, this work shows that a genome-wide survey for polymorphisms and signatures of selections is a valuable approach for identifying antigens, which should lead to the identification of many new antigens as potential vaccine targets. Our study also suggests that different var clusters may be under variable immune selective pressures that should be taken into consideration when designing a var-based vaccine. Further characterization of the proteins encoded by these genes may lead to new vaccines, which are urgently needed to combat this deadly disease.

Methods

Parasites and DNA amplification and sequencing.

DNA sequences of the 3D7 parasite were downloaded from PlasmoDB. Primers for PCR (Supplementary Table 1) and DNA sequencing were designed from predicted ORFs larger than 400 bp (1.5 kb was sequenced for large genes, excluding the well-known gene families var, stevor and rifin) using a proprietary primer selection software (Visual Basic script). Primers for sequencing both strands of DNA (18–25 bp, with four to seven G/Cs and spaced 400 bp apart) were automatically selected and commercially synthesized. Genomic DNA from cultured parasites Dd2 (Thailand), HB3 (Honduras), 7G8 (Brazil) and D10 (Papua New Guinea) were amplified and sequenced as described previously18. Direct sequencing of PCR products eliminates artificial polymorphism frequently introduced when cloning AT-rich DNA into bacteria.

Sequence alignment and analysis.

A Java package was written to process raw sequence data, including trimming and aligning DNA sequences using Phred/Phrap and Sequencher 4.5 (GeneCodes). The 3D7 genomic sequence and the corresponding annotated coding sequences from PlasmoDB 5.0 (both sets of chromosomal flat files dated 2002 and 2005) provided the gene annotations, including annotation of coding and noncoding regions and gene ontology classifications. The program also mapped alignment files to the 3D7 genome and characterized any variation found in the alignment as SNPs or indels or microsatellites. A set of scripts was then used to calculate summary statistics of diversity (θ, π and Tajima's D values). All SNPs and microsatellites were confirmed by visually inspecting chromatogram traces of all potential polymorphisms. All alignments of indels and microsatellites were manually adjusted to minimize mismatches and size polymorphism. SNPs in repetitive regions were not called, because misalignments may create artificial SNPs. The number of recombination events throughout the genome was estimated using the nonparametric methodology of ref. 29 as described previously11.

Protein expression and blotting.

Cell-free expression of proteins was performed using a rapid translation system (Roche Diagnostics) according to the manufacturer's instruction. Proteins expressed in 50 μl of an E. coli cell-free expression system were enriched using paramagnetic precharged nickel particles (Promega), separated on 4%–12% polyacrylamide gels, transferred to PVDF membrane and detected using antibodies to the His tag or pooled human antisera from villagers of Mali.

Accession codes.

SNPs have been deposited at NCBI dbSNP database (accession codes 6565428865658180) and also at PlasmoDB (v5.2).

URLs.

PlasmoDB: http://www.plasmodb.org/. Phred/Phrap: http://www.phrap.org/.

Note: Supplementary information is available on the Nature Genetics website.