Introduction

The origins of anatomically modern humans can be trace back to Africa as early as 200,000 years before present (YBP) evidenced by many archeological, linguistic and genetic data, the whole of which renders the “Out of Africa” hypothesis a generally accepted scientific precept (Watson et al. 1997; Quintana-Murci et al. 1999; Underhill et al. 2000; Ke et al. 2001; Maca-Meyer et al. 2001). The genetic composition of Africa has been described as extremely diverse (Chen et al. 1995; Ingman et al. 2000; Jorde et al. 2000; Watkins et al. 2001; Salas et al. 2002), in part due to this early rise of mankind, but also due to complex demic expansions both out of and back into the continent (Hammer et al. 1998; Cruciani et al. 2002; Semino et al. 2002; Maca-Meyer et al. 2003). These waves of migrations and undoubtedly back-migrations occurring at various times throughout human history bestow a unique importance upon East and Northeast Africa. Specifically, the Levantine region in the Nile River delta and the Horn of Africa near the Strait of Sorrow have been proposed as major corridors into Arabia and vice versa (Cavalli-Sforza et al. 1994; Kivisilid et al. 1999; Quintana-Murci et al. 1999; Stringer 2000; Underhill et al. 2000; Bar-Yosef 2002; Nebel et al. 2004; Kivisilid et al. 2004; Luis et al. 2004). Furthermore, Arabia represents the strategic multi-directional thruway intersecting three continents (Europe, Asia and Africa) and four major linguistic families (Caucasian, Elamite, Indo-European and Afro-Asiatic) (Renfrew 1996, 2000). The northern ridge of Arabia, also known as the Fertile Crescent, was the birthplace of agriculture some 8,000 YBP. Additionally, the initial domestication of animals associated with the practice of nomadic pastoralism followed sometime later to the east (Zvelebil 1980). Thus, this region as a focal point of early technological advances of civilization provided the impetus for major demic diffusions permeating in all directions (Quintana-Murci et al. 2001). In terms of extant populations, this report aims to investigate geographically targeted groups covering an expanse from east sub-Saharan Africa northeastward into northern India with an emphasis on the Near East.

Although Neolithic agriculture arose independently in up to as many as nine different areas (Diamond and Bellwood 2003), the spread of farming and animal domesticates to Europe (approx. 8,000 YBP), North Africa (approx. 7,000 YBP) and further east into Asia (approx. 6,000 YBP) was generally attributed to have Near Eastern origins based on linguistic and archeological evidence (Hassan 2002; Militarev 2002). Considering the recent timescale of these pervasive landmark events in human prehistory, hypervariable genetic markers may offer the finely tuned resolution needed to characterize demic movements stemming from this cradle of civilization. Autosomal short tandem repeat (STR) loci are an informative markers for determining short-term reconstructive phylogenies (Bowcock et al. 1994; Jorde et al. 1995; Bosch et al. 2000; Lum et al. 2002) especially among closely related groups of individuals (Rowold and Herrera 2003; Shepard and Herrera 2005). Accordingly, this study employed 15 tetranucleotide STRs to generate allelic frequencies from a total of 885 individuals from eleven contemporary, anthropologically well-defined and geographically targeted populations from sub-Saharan Africa (Kenya, Rwandan Hutus, and Sudan), Southwest Asia and the Levant (Bahrain, Egypt, Jordan, Georgia, Oman, and Yemen) and South Asia (Pakistan and Indian Punjabis). Additionally, eight relevant datasets from previously published works from Cabinda (Angola), Mozambique, Tutsis (Rwanda), Iran, Japan, Taiwan, Belgium, and US (Caucasian dataset) were utilized as reference populations to provide information from another 2,173 individuals that were included in several inter-population assessments (Table 1).

Upon examination of these 15 highly polymorphic loci, we find that analysis of the distribution of genetic variation of these 19 populations reveal a strong overall correlation to both language and geography among groups of populations. Also, phylogenetic examinations show genetic segregation/affinities that roughly follow geographic and linguistic clines. Populations from Arabia exhibit a greater degree of genetic homogeneity compared to groups in adjacent regions. Overall, the genetic landscape paints a picture of geographic barriers to gene flow encapsulating a region centered in Arabia extending from Northeast Africa to Persia. This may suggest a basin of close contact among the human groups within this area. Increasingly differentiated populations are encountered beyond the barriers, south of the Sahara into Africa and further east of Iran into the remainder of Asia. Although geographical obstacles to gene exchange are not a novel concept in population genetics, the collection of human groups in this study provides an opportunity to empirically examine these phenomena. The implications of these findings are discussed within the context of other genetic evidence.

Materials and methods

Population information and sample collection

The eleven populations examined in the present study include the following: sub-Saharan Africa (Kenya, Rwandan Hutus, and Sudan), Southwest Asia and the Levant (Bahrain, Egypt, Jordan, Georgia, Oman, and Yemen) and South Asia (Pakistan and Indian Punjabis) (Table 1). Their geographic locations are indicated in Fig. 1. Data from eight additional worldwide populations were obtained from the literature and employed for comparison purposes in two analyses (Table 1 includes reference database citations). Individuals from each population were identified by biogeographical information gathered and traced back at least two generations. Informed consent was obtained from each individual prior to collection. Each collection was arranged through the community leaders and/or elders of each region and supervised by the same. Sample collections were performed according to the ethical guidelines indicated by Florida International University’s Institutional Review Board (IRB).

Table 1 Description of populations examined
Fig. 1
figure 1

Biogeographical area studied

DNA isolation, PCR amplification and detection of STRs

All samples from these 11 populations were collected as whole blood in EDTA VacutainerTM tubes. Genomic DNA was extracted by standard phenol-chloroform and ethanol precipitation method (Sambrook and Russell 2001). The samples were amplified by PCR using the commercial AmpFISTR Identifiler kit (Applied Biosystems, Foster City, CA, USA) at the following loci: D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S433, vWA, TPOX, D18S51, D5S818, FGA, and Amelogenin. Amplifications were carried out in a GeneAmp PCR System 9600 thermocycler (Applied Biosystems, Foster City, CA, USA) with the following cycling parameters: 11 min denaturation at 95°C; 28 cycles of 1 min denaturation at 94°C, 1 min primer annealing at 59°C and 1 min primer extension at 72°C; and a final soak for 60 min at 60°C. A portion of each amplified sample was mixed in formamide and GS500 LIZ as an internal size standard as recommended by the manufacturer (Applied Biosystems, Foster City, CA, USA), and then amplicons were separated using an ABI PRISM 3100 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA) CE instrument. GeneScan® 3.7 was employed to determine the fragment sizes and Genotyper® 3.7 NT software was utilized to designate alleles by comparison with the allelic ladder provided by the manufacturer.

Statistical and phylogenetic analysis

Allelic frequencies of the 15 STR loci for 11 populations were calculated by the gene counting method (Li 1976). The Arlequin software package Version 2.000 (Levene 1949; Guo and Thompson 1992; Schneider et al. 2000) was used to assess Hardy–Weinberg equilibrium (HWE) using Fisher’s exact test with the modified Markov-chain Monte Carlo method, as well as to determine Nei’s gene diversity index (GD) (Nei 1987). The following parameters of population genetics interest were examined: matching probability (MP), power of discrimination (PD), polymorphic information content (PIC) and power of exclusion (PE) using PowerStats program Version 1.2 (Jones 1972; Brenner and Morris 1990; Tereba 1999).

In two analyses Neighbor-Joining (NJ) phylogeny and Multi-Dimensional Scaling (MDS), in addition to the eleven populations datasets from Iranians (Shepard and Herrera 2005) and Rwandan Tutsis (Regueiro 2004) were included due to their regional geographic relevance. Allelic frequencies of the groups studied were employed to generate the NJ tree based on Fst distances using the PHYLIP 3.52c software (Felsenstein 2002). Bootstrap consensus scores (1,000 replications) were generated by the SEQBOOT and GENDIST options of the PHYLIP software, while the CONSENSE programs determined the best-fit tree. MDS analysis was performed using the Statistical Package for the Social Sciences (SPSS) software program to summarize multivariate genetic relationships among the 13 groups.

Genetic structuring was analyzed among all 19 populations according to both geographic proximity and linguistic subfamily affiliation through hierarchical analysis of molecular variance (amova) (Excoffier et al. 1992) to examine potential partitioning along these lines on a global scale.

Results

Intra-population STR diversity

Allelic frequency distributions for eleven populations from Africa (Egypt, Hutus from Rwanda, Kenya, and Sudan) and Asia (Bahrain, Georgia, Jordan, Oman, Pakistan, Punjabis from northwest India, and Yemen) are presented for the first time as part of this study and are available as electronic supplementary materials (Tables 1 through 11 at the following URL: http://www.fiu.edu/~herrerar/Frequency_data.htm). In addition, important population genetics parameters are summarized in Table 2 for each group under study. The combined matching probability (CMP) for these eleven populations ranges from 1 in 6.922×1015 in the Jordanian dataset to 1 in 2.032×1017 in the Omani dataset. The combined power of exclusion (CPE) ranges from 0.99998973 in the Pakistanis to 0.99999985 in the Hutu from Rwanda. Each of the eleven datasets generate a combined power of discrimination (CPD) value >0.999999999999999. Table 2 lists the loci in each population that do not meet Hardy–Weinberg equilibrium (HWE) expectations when P<0.05 (14 loci out of 165 possible tests). However, after the application of the Bonferroni adjustment (α=0.05/15 or 0.0033) only three loci persist in their departure from HWE: D16S539 and D5S818 in the Pakistani dataset and vWA in the Punjabi collection.

Table 2 Parameters of population genetics interest

Inter-population STR diversity

To ascertain the genetic relationships between the 13 African and Asiatic populations, a NJ tree was generated using Fst distances based upon the allelic frequencies of 15 STR loci. Figure 2 illustrates the NJ tree based on the 11 populations presented in this study for the first time plus Iran and the Tutsis from Rwanda. Within the overall topology, 4 out of the 11 bootstrap values are below 50% incidence. There are roughly three major clusters of populations within the dendrogram (clockwise from the bottom): (1) African/Bahrain/Oman, (2) Jordan/Egypt/Yemen and (3) Georgia/Iran with the two South Asian populations from Pakistan and Punjab. The three sub-Saharan groups (Hutus, Tutsis and Kenyans) cap the end of the African/Arab cluster distantly from the nearest bifurcation from Sudan. Similarly, in the second clade, the groups of Pakistan and Punjab segregate far off from Iran and Georgia. It is interesting to note that 8 of the 13 populations (Sudan, Bahrain, Oman, Iran, Georgia, Jordan, Egypt, and Yemen) fall in close proximity to the trifurcation (bootstrap value 79%) of these three clusters. Oman segregates singly from the main African/Arab cluster initially (69%), followed next by Bahrain (87%). The Sudanese are distant from the main clustering of continental African groups, bifurcating much closer to the former two Arab populations (100%). Within that African cluster, the two Rwandan populations, the Hutu and Tutsi, do not segregate together. Instead, the Hutus bifurcate first (100%) followed by the Tutsi and Kenyan groups at the extreme of the branch. The length of the Tutsi branch is long relative to the other immediate populations indicating distinct genetic differentiation. Moving in a counter-clockwise direction to the next cluster, Georgia initially separates from the remaining populations (37%) followed by Iran (80%), then more distantly by Pakistan from the Punjabis from India. Lastly, Jordan initially separates from Egypt and Yemen (37%), while the latter two groups segregate from each other with a 47% bootstrap value.

A MDS analysis based on Fst distances was performed using these same 13 populations to examine the phylogenetic relationships based on the 15 STR loci (Fig. 3). The layout of the two-dimensional map in terms of spacing of the populations is consistent with the overall topology of the NJ tree. The proportion of variance accounted for by the corresponding distances of the scaled data is 95.67%. The main differences in this particular analysis from the phylogram are with regard to the proximities of the populations to each other within their respective main clusters. For example, while the sub-Saharan African populations fall in a linear arrangement together to the right side within the upper and lower right quadrants, the Hutu and Tutsi populations from Rwanda are closer in proximity to each other while the Kenyans remain distant. Similarly to the Hutu and Tutsi, the Pakistan and Punjabi groups are in closer proximity than in the NJ analysis. The same eight groups that agglutinated together surrounding the trifurcation of the dendrogram fall into or in close proximity to each other in the upper left quadrant of the MDS plot. However, the arrangement is slightly different, for instance, while the small distance between Sudan and Oman is repeated, Bahrain is farther away and Egypt is now closer to the former two groups. Both Georgia and Jordan fall in the middle of the upper left quadrant with Yemen close to the upper left limit of the graph. Both Iran and Bahrain lie in singular positions within the lower left quadrant just below the boundary that divides the upper from the lower quadrants with Bahrain closer to the crosshairs of the plot. The same two phylogenetic analyses (NJ and MDS) were performed with all 19 groups including the eight reference populations. The topology of both the NJ tree and MDS plot mirror the ones based on the 13 populations described above. One notable difference is that Georgia segregated in the same clade as Belgium and US Caucasians instead of within the Asian cluster (data is not shown).

Partitioning of populations based on geography and language

Distribution of genetic variance along linguistic and geographic lines among the 19 total populations was investigated using hierarchal amova. Potential linguistic correlations were assessed based on the following classification of language families: Niger-Congo (Cabinda of Angola, Hutu, and Tutsi from Rwanda, Kenya and Mozambique); Afro-Asiatic (Bahrain, Egypt, Jordan, Oman, Sudan and Yemen); Kartvelian (Georgia); Indo-European (Belgium, Iran, Pakistan, Punjabis from India and Caucasians from the US); Japanese (Japan); and Sino-Tibetan (Han Chinese from Taiwan). Possible correlations based on geography were ascertained according to the following five regional groups: sub-Saharan Africa (Cabinda of Angola, Hutu and Tutsi from Rwanda, Kenya, Mozambique, and Sudan); Southwest Asia and the Levant region (Bahrain, Egypt, Georgia, Iran, Jordan, Oman and Yemen); South Asia (Pakistan and Punjabi from India); East Asia (Japan and Taiwan); and Europe (Belgium and US Caucasian). Table 3 lists the results locus-by-locus and in combination. The overall linguistic and geographic analyses based on the 15 loci exhibit significant correlation to genetic structure (P<0.05) among groups of populations. With the exception of locus D3S1358 in the linguistic test, all loci individually reflect that the genetic differences among groups of populations correlate significantly (P<0.05) with linguistic and geographical partitioning. On the other hand, the overall linguistic and geographic analyses based on all loci do not significantly correspond to genetic structure among populations within groups. Only three loci (D21S11, TH01, and D18S51) show significant genetic parallelism to language and five (D8S1179, TH01, D13S317, D18S51, and TPOX) to geography.

Table 3 Significant locus-by-locus and combined loci AMOVA values for 19 populations

Discussion

This study presents novel databases for 15 autosomal STR loci from five previously poorly characterized regions (Bahrain, Georgia, Jordan, Kenya, and Sudan). In addition, new data augments previously published autosomal STR datasets for the following six groups: Egypt and Hutu (two novel loci each) (Tahir et al. 2003; Tofanelli et al. 2003, respectively); Oman and Yemen (six new loci each) (Tahir et al. 2000a and Klintschar et al. 2001, respectively); Punjabi (seven novel loci) (Tahir et al. 2000b) and Pakistan (12 new loci) (Hadi et al. 2004). After the application of the Bonferroni adjustment for number of loci tested (α=0.05/15 or 0.0033), only 2% or 3/165 tests fail to conform to Hardy–Weinberg equilibrium expectations (Table 2). More information regarding the intrapopulation STR diversity of these 11 groups is available within the allelic frequency distributions as electronic supplementary material (Tables 1 through 11 at the following URL: http://www.fiu.edu/~herrerar/Frequency_data.htm).

In the NJ analysis (Fig. 2) at the center of the three, the near eastern populations inclusing Sudan, Egypt and the Southwest Asian groups closely encircle the major trifurcation. This intermediary position and tight clustering of these groups may be indicative of their pivotal role of this region in cross continental migrations. It is possible that the dendrogram topology reflects the importance of Southwest Asia and the Levantine region as bidirectional crossroads of human migration involving eastern Asia, Europe and Africa. While this notion is corroborated by the distribution of populations in the MDS plot (Fig. 3), here the affinities of the Arabic and Iranian populations are even more pronounced. It is possible that these populations may have experienced a high degree gene flow not seen in sub-Saharan Africa, East Asia or even South Asia. The phylogenetic analyses do not indicate any affinity between Arabic and sub-Saharan groups (with the noted exception of the Sudanese population) nor between Arabic and South Asian populations. Overall, the topological layout of the tree and MDS map are consistent with major geographical barriers to gene flow, namely the Sahara in Africa, the Dash-e Kavir and Dasht-e Lut deserts of Iran and the Hindu Kush Mountains on the Afghani–Pakistani border. These obstacles likely have provided an enclosure that allow for gene exchange among populations within its limits while semipermeably encapsulating the region of the Near East.

Fig. 2
figure 2

Neighbor-Joining (NJ) phylogenetic analyses of 13 populations based on STR allelic frequencies using Fst distances. The GENDIST option of the Phylip software created branch distances onto which the corresponding bootstrap values (based on 1,000 replications) were transferred to the corresponding nodes of the NJ tree

Fig. 3
figure 3

Multi-dimensional scaling (MDS) analyses of 13 populations using Fst distances based of STR allele frequencies. Abbreviations are as follows: EGT Egypt, HUT Hutu (Rwanda), KEN Kenya, TUT Tutsi, BHR Bahrain, GEO Georgia, IRN Iran, JOR Jordan, OMN Oman, PAK Pakistan, PUN Punjabi (India) and YEM Yemen. The proportion of variance accounted for by the corresponding scaled data is 95.67%

An examination of the hierarchal amova results (Table 3) indicates that a majority of the loci exhibit significant genetic variance partitioning along both linguistic (14/15 loci) and geographic (15/15 loci) divisions among groups of populations. On the other hand, at most only one-fifth to one-third of the loci exhibit significant correlations to both language and geography, respectively, among populations within groups. It is likely that the allelic distributions at these particular loci provide sufficiently high resolution to detect genetic partitioning along linguistic and geographic lines at the level of populations within groups, which is lost by dilution when combined with the rest of the STR markers in the overall amova assessment. Since genetic differences are generally larger among groups of populations than within groups of populations, the greater number of loci generating significant correlations to linguistic and geographic partitioning comes as no surprise. The results imply that the genetic structure of the populations in this study parallel both regional subdivision and linguistic hierarchal classification.

Egypt, Arabia and Persia form a tri-continental nexus for initial migrational routes of anatomically modern humans out of Africa and into the remainder of Eurasia beginning some 60,000 YBP (Tishkoff et al. 1996; Watson et al. 1997; Quintana-Murci et al. 1999, 2004) and therefore represent the earliest region of the Asian continent to be inhabited by early man. This conduit was also utilized in “Back to Africa Episodes” (Luis et al. 2004). In addition, the Neolithic advancements in agriculture and animal domestication in the Fertile Crescent about 8,000 YBP likely had a pronounced effect on the contemporary genetic landscape, by nurturing an incubation period of close contact among Near Eastern populations prior to demic diffusion out of the area in all directions. We envision that the Sahara desert to the south and west, the Dash-e Kavir and Dash-e Lut deserts of northern and eastern Iran, respectively, as well as the Hindu Kush mountains in eastern Afghanistan formed an encirclement of structural barriers to gene flow into and out of the Near East. Since these obstacles do not completely isolate the region, they more likely allowed sporadic, bottleneck-type migrational events (some possibly by coastal routes), thus shaping Arabia and Persia into a basin of genetic homogeneity with narrow passageways of restricted gene flow. In this scenario, it is possible that multiple temporal and spatial bottleneck episodes may account for the decreasing clines of genetic diversity emanating from the Near East eastward into Asia and westward into Europe. Both mtDNA (Quintana-Murci et al. 2004) and Y chromosome studies (Quintana-Murci et al. 2001; Wells et al. 2001; Qamar et al. 2002) suggest the deserts of Persia and the Hindu Kush mountains between the Near East and surrounding territories may have played a key role in limiting gene flow.

Although this study was geared towards a comprehensive characterization of the geographic area from eastern sub-Saharan Africa northeastward into South Asia and the Indian subcontinent, some interesting and specific observations became evident. For instance, within the African clade of the NJ tree the two populations from Rwanda, the Hutu and the Tutsi distinctly separate from each other. Although this phylogenetic analysis indicates a greater differentiation of the Tutsi population in comparison to the other sub-Saharan groups based on branch length in the NJ tree, the bootstrap value is less than 50%. Since both groups cohabitate with one another in the same region, it is likely that their genetic uniqueness stems from their socio-political separation (Chretein 2003). It is generally believed that Hutu agriculturalists living in loosely organized clans were the first tribe to supplant the indigenous Twa pygmies of Rwanda. The pastoral Tutsis later settled in from the north, eventually imposing a feudalistic system of government which was the impetus for hundreds of years of social, economic and politically based isolation of these two groups within the same region (July 1992). A maximum likelihood phylogeny in a study based on Y chromosome haplogroup frequencies (Luis et al. 2004) showed that the Hutus segregated closer to a Bantu group of Kenyans than to the Tutsis that segregated into the same clade with mixed Nilo-Saharan and pygmy groups from Central Africa, thus lending support to the evidence of the genetic uniqueness of the Hutus and Tutsis. Yet, the close positioning of the two populations in the MDS analysis suggests some degree of genetic similarity.

Genetically, the Sudanese lie in an intermediary position between the other sub-Saharan populations and the Afro-Asiatic speaking peoples in the phylogenetic results. The fact that the Sudanese segregate closer to the Arab populations than to the sub-Saharan African groups in both NJ and MDS phylogenetic analyses is supported in the literature by studies involving NRY haplogroups in which the frequency of the E3a-M2 mutation characteristic of the Bantu expansion is close to 0% in the Sudanese (Underhill et al. 2000) and up to 52% in neighboring Kenya (Luis et al. 2004). Bahrain and Oman segregate close to Sudan in both NJ and MDS phylogenetic examinations belying their substantial geographic distance from that African population. This particular arrangement may hint at a possible migrational route and gene flow involving Sudan and the Near East by way of the Nile River waterway and/or the Horn of Africa.

Georgia represents the northernmost border of the Southwest Asian populations examined in this study and displays the longest branch length in the NJ phylogram of that group. The Dash-e Kavir desert in northern Iran may have played a role in the relative singularity of Georgia in comparison to the populations to the south. When a NJ tree was generated using all 19 populations Georgia segregates with the Europeans instead of the Asians (data not shown), which may lend support to this idea. Interestingly, the Iranian group segregates within the Asian clade, but distant from the Pakistani and Punjabi populations, which may argue for a barrier to gene flow between these groups. A recent study of mtDNA on Iranian and Indian populations indicated a similar situation of relatively little gene exchange among the two groups (Metspalu et al. 2004). On the other hand, there is evidence of genetic homogeneity involving Iran and populations to the west within the Near East. This is illustrated in our phylogenetic analyses, which is mirrored in the literature in a report showing genetic homogeneity between Iran and the East Anatolia region of Turkey (Shepard and Herrera 2005). Overall, the results obtained in this study are more compatible with well-defined physical barriers limiting gene flow within the Near East basin and not just the product of differences in geographical distances between populations. This is evident in the distant phylogenetic relationships of geographically close populations like Iran and Pakistan, as well as Yemen and Kenya, in contrast to the strong genetic affinity of geographically distant groups like Sudan and Oman. These affinities are not mere socio-political in nature since regions like Iran and Pakistan share religion, beliefs and culture.

Conclusion

The genetic information generated from this battery of autosomal STR markers not only has applications to population genetics but to recent human evolution as well. The hypervariability of these microsatellites markers generates enough evolutionary signal to clarify genetic relationships of closely related groups. The findings of this study expose a region of genetic homogeneity within the Near East with increasing genetic heterogeneity moving outwards in a southerly, northerly, and easterly direction. The Sahara in Africa, the Dash-e Kavir and Dasht-e Lut deserts of Iran, and the difficult terrain east of Iran in the Hindu Kush Mountain range appear to be the major geographic bounds of this genetic homogeneity. Substantial geographical confinement would allow for limited gene flow between this region and the outside. Thus, beyond these barriers, we detect a greater degree of genetic differentiation that is consistent with population bottlenecks and more specifically a decrease in genetic diversity in Asia to the east of Persia.