Introduction

Autism (OMIM: %209850) is a complex neurodevelopmental disorder characterized by impairments in reciprocal social interaction, difficulties in verbal and nonverbal communication, stereotyped behaviors and interests, and an onset in the first 3 years of life. Autism belongs to the group of pervasive developmental disorders (PDD), also known as autism spectrum disorders (ASDs), which also include Asperger syndrome and pervasive developmental disorder—not otherwise specified (PDD-NOS). The estimated population prevalence of core autism is around 15–20 in 10 000, with a male/female sex ratio of approximately 4:1. When all ASD subtypes are combined the prevalence is several times higher, reaching 116 in 10 000.1, 2, 3

Several lines of evidence indicate that genetic factors are important in susceptibility to idiopathic autism. Twin studies show a concordance of 60–92% for monozygotic (MZ) twins and 0–10% for dizygotic (DZ) twins, depending on phenotypic definitions, and the sibling recurrence risk is 25–60 times higher than the population prevalence.4 Furthermore, relatives of affected probands show a higher incidence of milder cognitive or behavioral features, consistent with the hypothesis of a ‘spectrum’ of severity.5

Autism spectrum disorders exhibit wide clinical variability and a high degree of genetic heterogeneity. A variety of chromosomal abnormalities are found in a small proportion of affected individuals (6–7%), most frequently in syndromic cases with dysmorphic features and cognitive impairment.6 The autism phenotype is also associated with known genetic conditions such as the Fragile X syndrome and tuberous sclerosis. Recently, rare ASD-causing mutations were reported in a number of genes, including NLGN3, NLGN4,7 NRXN1,8 SHANK39 and NHE9.10

In recent years, the development of DNA microarray technologies has revealed that submicroscopic deletions and duplications of DNA, known as copy number variants (CNVs), may be significant in autism susceptibility.11, 12, 13, 14 Recent surveys identified a higher rate of de novo CNVs in autism pedigrees compared to controls, with the increased rate becoming more exaggerated in singleton than in multiplex families.10, 12, 13 Nevertheless, it remains difficult to interpret the significance of the numerous CNVs identified in ASDs, to distinguish those that influence susceptibility from normal polymorphic variation and to understand how they might interact with other genetic and non-genetic factors.

Although individually rare, highly penetrant abnormalities, such as microdeletions/microduplications or point mutations, may have a significant function in ASDs. It is also likely that genetic susceptibility may also result from the combined action of several common genetic variants. Common variation in several candidate genes has been implicated in autism (MET, CNTNAP2, SLC6A4, RELN, GABRB3),15 but in most cases consistent replication has not been achieved.

Because the strong genetic component in ASDs was clearly demonstrated over a decade ago, a large number of molecular genetic studies have searched for susceptibility genes, following the general approach of a genome-wide linkage scan using affected sibling/relative pair families. The International Molecular Genetic Study of Autism Consortium (IMGSAC) identified the first autism linkage locus on chromosome 7q21–q32 (designated autism susceptibility locus 1, AUTS1) with a multipoint maximum LOD score (MLS) of 2.53 in 87 families.16 This result was confirmed in follow-up studies conducted by the IMGSAC using additional families and markers.17, 18 Another linkage susceptibility locus (AUTS5) was identified by IMGSAC on chromosome 2q24–q33 with an MLS of 3.74 in 152 affected sibling pairs.17

Replication of linkage signals in independent studies has proven difficult for ASDs. To date, 13 whole-genome linkage scan for ASDs have been published,15 and no single locus has been consistently confirmed in all studies. This finding is likely to result from the small effect size attributable to individual genes, as well as from the clinical and genetic complexity of ASDs; differences in ascertainment and inclusion criteria may have been additional factors. However, AUTS1 is one of the few identified loci that has been supported by overlapping positive results in multiple multiplex collections,19, 20 and in meta-analyses.21, 22 Similarly, the chromosome 2q locus is supported by overlapping linkage findings in another two independent genome scans,23, 24 and by homozygosity mapping in consanguineous families.10 The largest genome scan published to date, carried out by the Autism Genome Project (AGP) using Affymetrix 10K single nucleotide polymorphism (SNP) arrays and 1181 multiplex families, also provided some support for both the chromosome 2q and 7q loci within the families of inferred European ancestry.8

Despite the support for linkage on chromosomes 2q and 7q, the candidate genomic intervals remain broad, each spanning approximately 40 Mb and containing approximately 200 known genes. Systematic screening and association studies of several positional candidate genes on chromosomes 2q and 7q have been conducted by the IMGSAC,25, 26, 27, 28, 29 but these studies have not led to the identification of confirmed autism susceptibility variants. Owing to the recent technological advances in high-density SNP genotyping and bioinformatic resources, we focused our efforts on performing a gene-based high-density SNP association study of the autism susceptibility loci on chromosomes 2q and 7q implicated by IMGSAC linkage studies. SNP genotype data were also used to investigate copy number variation within these regions. The genetic architecture of ASDs is likely to be extremely complex, with disease risk determined by both common variants of modest effect, as well as rare variants with a range of effect sizes. The strategy of focusing on linkage regions for fine-mapping studies by high-density association screens will prioritize genes containing penetrant rare variants, which would not be well identified through association analysis. However, we might expect that genes containing such variants also contain more common variants of lesser effect and thus are still natural candidates to follow-up through association studies.

Genotyping was conducted in two stages, based on HapMap Phase I and Phase II data, respectively. In total, 3002 SNPs were genotyped in each region, directly testing 173 genes on chromosome 2 and 270 genes on chromosome 7. The study sample consisted of 126 and 127 affected individuals and their parents, selected from 293 IMGSAC multiplex families based on identity-by-descent (IBD) sharing on chromosomes 2q and 7q, respectively, as well as 188 gender-matched controls. This study design, where the same probands are used for family-based and case–control analysis, should be more robust against the respective weaknesses of the case–control and TDT approaches (such as population structure and segregation distortion, respectively), and extract the maximum information from our sample.30 Moreover, by selecting families showing excess allele sharing in the region of interest, we are likely to increase the frequency of the disease-associated alleles in the case sample, thereby increasing the power of association studies.31 Power calculations were performed over a range of risk allele frequencies and odds ratios (OR), confirming that the strategy of selecting families for increased IBD sharing outperformed a strategy in which families are selected at random, given fixed genotyping resources (see Supplementary Information).

Our study thus represents a deep exploration of SNP and copy number variation within genic regions of the two autism linkage loci on chromosomes 2q and 7q and pinpoints several genes that need further investigation.

Materials and methods

Study populations

The chromosome 2 primary sample included 126 independent autism families, for 371 individuals (119 parent–parent–child trios and 7 single parent–child pairs). The chromosome 7 primary sample included 127 independent autism families (117 parent–parent–child trios and 10 single parent–child pairs). All families were Caucasian (Table 1). The assessment methods and diagnostic criteria used by the IMGSAC have been described in detail previously.17 Diagnosis was based on the Autism Diagnostic Interview—Revised (ADI-R) and the Autism Diagnostic Observation Schedule (ADOS) and clinical evaluation. Karyotypes were obtained on all affected individuals when possible, and gross karyotypic abnormalities were excluded in at least one affected individual per family in 93% of families and in both affected individuals in 83% of families.

Table 1 Description of samples

Trios for the primary sample were selected from the 293 multiplex families in the IMGSAC multiplex collection (using one affected sib per family) based on IBD sharing on chromosomes 2q and 7q, respectively. Calculation of IBD states was based on microsatellite marker data available from our genome scan18 and fine-mapping studies (unpublished data). Ranked Z-scores were calculated for each family using Merlin32 at the linkage peak position (D2S2302-D2S2310 and D7S2430-D7S684 for chromosomes 2 and 7, respectively).

Two main sample collections were used for replication (Table 1): (1) ‘IMGSAC replication’ (IMGSAC-R) sample: 260 parent-affected child trios or pairs and 34 single cases and (2) ‘Northern Dutch’ sample (ND): 96 singleton families from the north of the Netherlands, including 82 parent–parent–child trios and 14 parent–child pairs. Both replication sample collections fulfilled diagnostic criteria for Case ‘Type 1’ or ‘Type 2’ as defined by IMGSAC17 (meet ADI-R criteria or one point below threshold on one behavioral domain, meet ADOS/ADOS-G criteria for autism or PDD, performance IQ>35). An extended Northern Dutch sample (ND-all; Table 1) was available, including 108 cases that did not meet stringent criteria for one of the following reasons: (1) met ADI-R criteria but failed to meet ADOS criteria or did not undergo ADOS evaluation, (2) met ADI-R and ADOS criteria but had an IQ score <35, (3) did not meet full criteria for ASD on the ADI-R.

The most significant SNPs from the chromosome 2 locus were also tested in a collection of 358 multiplex families (‘Mount Sinai’ sample), which have been previously described.23, 33 Similarly, three SNPs from two of the most strongly associated genes in the case–control and family-based analysis on chromosome 7 were genotyped in 62 Caucasian families selected for IBD sharing from a sample of 222 families showing linkage to the same region of chromosome 719 (‘University of Washington’ sample).

Controls used in the primary experiment included 188 DNA samples from UK random blood donors from the ECACC HRC panels,34 sex-matched with the autism case sample. The additional set of 180 controls genotyped in the replication phase included 92 DNAs from ECACC HRC panels, 41 random donors from the UK and 47 random donors from Italy.

The study was reviewed by the relevant local ethics committees.

Genotyping

Single nucleotide polymorphisms for the primary analysis were genotyped using the GoldenGate assay (Illumina, San Diego, CA, USA) on an Illumina BeadStation according to the manufacturer's instructions. BeadArrays were scanned using the BeadArray Reader at 532 and 647 nm. BeadStudio genotyping module (version 3.2.23) was used to generate genotypes.

Genotyping was conducted in two parallel stages for both chromosomal loci. A total of 3072 SNPs were genotyped in each stage using two custom 1536-plex Illumina arrays, one for each chromosome. The regions of interest ranged from 94.246 to 136.661 Mb on chromosome 7 and from 152.305 to 191.605 Mb on chromosome 2 (NCBI Build 36). These intervals were defined using the approximate 1-LOD drop of the linkage peaks on the two chromosomes, based on IMGSAC microsatellite marker data.18

In the first stage of this study, we evaluated the patterns of linkage disequilibrium (LD) and the distribution of haplotype blocks in the CEU genotype data from the HapMap project release 13 (HapMap Phase I data). Genic regions were defined by NCBI Build 34, by merging all RefSeq and UCSC Known Genes, including all exonic, intronic and 3′ UTR sequences, as well as 5 kb upstream of the 5′ end. A total of 1496 tag SNPs on both chromosome 2q and 7q were identified using HaploView35 and the Gabriel algorithm for block definition from LD blocks overlapping all genic regions.

In the second stage of genotyping, we took advantage of the higher-density HapMap Phase II data to better represent genetic variation in regions of lower LD not previously captured by the HapMap Phase I data. We also used the latest genome annotation (NCBI Build 36) to investigate novel genes and ensure comprehensive coverage of all intragenic and putative regulatory regions on both chromosomes. We identified ‘non-genic’ evolutionary conserved regions from PhastCons elements.36 We downloaded SNP genotype data from the CEU population from HapMap release 22, and selected all SNPs in all genic regions and in the top 5% of non-genic PhastCons elements. We also selected all nonsynonymous SNPs with minor allele frequency (MAF) 0.05. We then used the Tagger program from HaploView35 (version 4) to select a second set of 1516 tag SNPs for each chromosomal region. Parameters used for Tagger were r20.75 (chromosome 2) and r20.63 (chromosome 7), minimum MAF of 5%, aggressive tagging and force including SNPs already genotyped in stage 1. We estimated that our two sets of SNPs were able to tag 96 and 85% of intragenic HapMap SNP variation (MAF>0.05) with r2>0.8 on chromosomes 2 and 7, respectively.

Genotypes for 212 SNPs (99 on chromosome 2 and 113 on chromosome 7), previously generated by the AGP using the Affymetrix 10K version 2 SNP array,8 were available on the IMGSAC family sample and were also included in the family-based association analysis.

A total of 50 genome-wide unlinked SNPs were genotyped for detection of population stratification,37 and 10 chromosome X SNP were also included to estimate levels of mistyping. In addition, for regions of high LD, where tagging SNPs captured the most genetic variation, extra SNPs were chosen in case of genotyping failure.

Replication SNP genotyping

Single nucleotide polymorphisms for replication were genotyped using a combination of the Mass Extend iPLEX Gold (Sequenom, San Diego, CA, USA) and TaqMan platforms. A 100% genotyping concordance was observed for two replicate DNA samples genotyped in each experiment. Twenty-five genome-wide SNPs were also genotyped in the IMGSAC-R sample to test for population stratification.

Statistical analysis

Association analysis

We evaluated evidence of association using both ‘frequentist’ and Bayesian statistical approaches.

Primary association analysis of the 5880 SNPs (including the 212 SNPs available from the AGP linkage study8) successfully genotyped in the IMGSAC data set at the two loci was carried out using the PLINK package.38 To extend the amount of information captured by single-marker tests, an additional set of two-marker haplotype tags was devised using the ‘aggressive’ option of the Tagger program39 implemented in HaploView.35 In total, 3526 tests (2959 single-marker tests and 567 haplotype tags) were performed for the chromosome 2 study, and 3380 tests (2921 single-marker tests and 459 haplotype tags) for the chromosome 7 study.

Standard TDT from PLINK was used for family-based analysis, and the Cochran–Armitage trend test (1 degree of freedom) for the case–control analysis. Haplotype-based tests were calculated using PLINK.

Bayesian logistic regression analysis was performed using the GENEBPM algorithm,40, 41 again using both a case–control and family-based approach (see Supplementary Methods). The logistic regression model allowed for additive and dominance effects of unobserved causal variants, a main effect of gender as well as for parent-of-origin effects in the family-based analysis. GENEBPM analyses were performed using a sliding window of five SNPs across each chromosomal region. For comparison with frequentist single-SNP analyses, the GENEBPM algorithm was also applied to each SNP in turn (that is, single-SNP ‘haplotypes’).

Replication analysis

Association analysis of the IMGSAC-R and ND replication data sets was carried out using the UNPHASED package,42 given the presence of a higher proportion of families with missing parents (24%) (Table 1). UNPHASED implements maximum-likelihood-based association analysis for nuclear families and unrelated subjects allowing for missing genotypes and uncertain haplotype phase. In the presence of missing data it has only minor loss of robustness to population stratification and is more powerful than standard TDT.42

Analysis of the combined primary and replication cohorts was also carried out using UNPHASED, again using both a case–control approach and a family-based approach. Only the IMGSAC and IMGSAC-R data sets were combined for the population-based meta-analysis, because appropriate controls were not available for the ND population.

Copy number variation

We used transmission patterns of SNP genotypes within parent–offspring families to detect Mendelian errors consistent with the presence of a deletion. In addition, the clustering of all SNP genotypes was visually examined to identify abnormal clustering patterns or outlying samples that might point to CNVs associated with the autism phenotype. Sequencing was carried out to confirm the presence of microdeletions, CNVs or secondary SNPs.

After exclusion of whole-genome amplified samples, data from both GoldenGate arrays were combined for each region, no-calls were deleted, and run on QuantiSNP version 1.0.43 CNV validation and screening was carried out by multiplex PCR and quantitative multiplex PCR of short fluorescent fragments (QMPSF).44 Positive results were confirmed in a second independent QMPSF assay.

The distal breakpoint of the deletion detected in family 15-0084 was better defined by quantitative PCR (qPCR) of DOCK4 exons 52, 37, 31, 14 and 7, with GAPDH as a reference.

Additional information is available as Supplementary Methods.

Results

Genotyping

A total of 6004 SNPs—3002 in each chromosome region—were genotyped using the Illumina GoldenGate technology. After quality control procedures, we excluded 336 markers for one or more of the following reasons: MAF<0.05, more than 1 Mendelian error, genotyping rate <90%, poor clustering and deviations from Hardy–Weinberg equilibrium (P<0.001) in the control population.

For the 5668 (94%) SNPs that passed quality control, the genotyping efficiency exceeded 99.7%, with an estimated error rate from duplicate SNPs and from heterozygote calls of X chromosome SNPs in males in the order of 2–5 × 10−4. In summary, 2860 SNPs from the chromosome 2q23.3–q32.3 region were successfully genotyped in 559 DNA samples including 126 affected individuals, 245 parents and 188 gender-matched controls from the ECACC collection; 2808 SNPs from the chromosome 7q21.3–q33 region were successfully genotyped in 559 DNA samples including 127 affected individuals, 244 parents and 188 ECACC gender-matched controls. In addition, our family-based analysis included genotypes from 212 SNPs (99 on chromosome 2 and 113 on chromosome 7), which were generated by the AGP using the Affymetrix 10K version 2 SNP array.8

There was no significant difference in the pattern of LD between our sample and the HapMap CEU sample, indicating that the LD structure in the HapMap CEU data can be readily applied to our autism sample. SNPs were selected to capture efficiently the large majority of the currently known variation in all intragenic regions and highly conserved non-genic elements (see Supplementary Methods).

Population stratification

The presence of stratification in a population-based association study that is not suitably accounted for in case–control analysis can lead to an increase in the false-positive error rate. Furthermore, haplotype analyses in family-based association studies are not robust to population stratification if random mating is assumed among parents in the haplotype estimation step.

We tested for population structure in our primary IMGSAC sample using Structure45, 46 software, and testing 50 unlinked genome-wide SNPs. Comparing the fit of the admixture model for K=1, 2 and 3 strata, we found strongest support for a model of no stratification (K=1) in both of the following groups of individuals: (1) probands, controls and HapMap CEU founders; and (2) parents and HapMap CEU founders. Similarly, no evidence of stratification was detected in the combined IMGSAC primary and IMGSAC-R sample, using 25 unlinked genome-wide SNP markers. These results reassure us that no strong population stratification is present in our IMGSAC primary and IMGSAC-R sample.

Association analysis

The results of the case–control (Cochran–Armitage trend test) and family-based analysis (TDT) are shown in Figure 1 and summarized in Table 2.

Figure 1
figure 1

Graphical representation of chromosome 2 and 7 association results. −Log10 P-values are plotted against the chromosome position. (a) P-values obtained for single markers (Cochran–Armitage trend test) and 2-SNP haplotype case–control association (PLINK). (b) P-values for single-marker TDT and 2-SNP haplotype TDT.

Table 2 Summary of primary association results

Chromosome 2 association results

Three SNPs in the NOSTRIN gene provided the strongest association in the case–control analysis (rs7583629, P=3.2 × 10−5; rs829957, P=9.0 × 10−5; rs482435, P=1.4 × 10−4), followed by rs1020626 (P=3.8 × 10−4) in the FAM130A2 gene.

For the TDT analysis the strongest results came from SNPs in the ZNF533 gene (rs11885327, P=8.0 × 10−4; rs1964081, P=1.4 × 10−3), and an SNP in the UPP2 gene (rs6709528, P=8.0 × 10−4).

Single-marker logistic regression analysis provided a similar ranking of results. In the case–control analysis the most strongly associated SNP rs7583629 in NOSTRIN provided a log10 Bayes factor (logBF) of 2.9, whereas in the family-based analysis the top signal was for rs1139 in the ZNF533 gene (logBF=1.7). GENEBPM multimarker analysis using 5-SNP sliding windows (Supplementary Figure S1) showed increased evidence in favor of association for the NOSTRIN locus (logBF=3.2) in the case–control analysis, but did not identify additional interesting signals. Family-based multimarker analysis revealed an additional association signal with a haplotype spanning 75 kb in the METTL8 gene (logBF=2.3).

Chromosome 7 association results

The strongest signal for the case–control (trend test) analysis was from IMMP2L (rs12537269, P=1.2 × 10−4; rs1528039, 6.3 × 10−4) and just upstream of SMO (rs6962740, P=3.4 × 10−4). The TDT test implicated Plexin A4 (PLXNA4; rs4731863, P=1.0 × 10−4) and cut-like homeobox 1 isoform b (CUX1; rs875659, P=2.0 × 10−4).

Single-marker logistic regression analysis provided a similar ranking of results in the case–control analysis with rs12537269 (logBF=2.9) in IMMP2L showing the most significance. In the family-based analysis, the most significant result was seen for rs4730037 in LHFPL3 (logBF=2.1), closely followed by rs4731863 in PLXNA4 (logBF=2.0). Moreover, GENEBPM revealed a parent-of-origin effect in the IMMP2L locus, with increased risk for causal variants inherited from the father compared to those inherited from the mother. For this reason we investigated SNPs in IMMP2L by parent-specific TDT, which revealed a P-value of 0.01 for rs2030781, with a transmitted/untransmitted allele ratio of 31:14 for paternal transmissions (Table 2).

GENEBPM multimarker analysis using 5-SNP sliding windows showed increased evidence of association for the IMMP2L locus in the case–control analysis (logBF=2.9) and for PLXNA4 (logBF=2.9) in the family-based analysis, but did not identify additional interesting signals (Supplementary Figure S1).

Analysis of the LD landscape across the AUTS1 region, using both HapMap (CEU) and data from the 127 probands used in our primary sample, indicated that the six associated SNPs in IMMP2L (Table 2) are all within a single block of LD, and thus likely to be indexing the same effect. In contrast, the modest association seen in the first intron of the neighboring DOCK4 gene was in a separate block of LD.

Replication

We attempted replication of 56 SNPs (28 on each chromosome) that attained the most significant association results in primary case–control and TDT analyses (Table 3; Supplementary Table S1). The replication population consisted of the IMGSAC-R and the ND collections, including 390 affected individuals (see Table 1; Materials and methods for a description of samples). Family-based analysis of the replication sample showed significant overtransmission of the common allele of SNP rs2217262 in the DOCK4 gene (P=9.2 × 10−4, OR=2.28, confidence interval 1.37–3.77) (Table 3; Supplementary Table S1). This result remains significant after Bonferroni correction for multiple testing (28 SNPs tested on chromosome 7, P=0.026). The trend toward association of rs2217262 (P=0.029) was also seen in the extended ND sample, which included additional subjects fulfilling broader diagnostic criteria (ND-all, 204 affected subjects; Table 1).

Table 3 Family-based analysis of replication samples using UNPHASED

The remaining SNPs did not show significant replication after correction for multiple testing, and no parent-of-origin effects were seen for rs2030781.

Finally, the 56 SNPs selected for replication were investigated in the combined primary and replication data sets; only 7 SNPs attained uncorrected significance of P<0.001 (Table 4). The DOCK4 SNP rs2217262 reached a nominal significance of P=5.23 × 10−5 in the family-based analysis of all cohorts (IMGSAC primary, IMGSAC-R and ND). In the case–control analysis of the combined IMGSAC collections (421 cases and 368 controls), rs12537269 in IMMP2L achieved the most significant result (P=7.3 × 10−5). Additional loci retaining association evidence in case–control meta-analysis were ZNF533 on chromosome 2, and TSPAN12, FEZF1 and SLC13A1 on chromosome 7.

Table 4 Combined analysis of primary and replication samples

Several SNPs in the most interesting genes from the primary analysis were also tested in two additional family collections, which had previously shown evidence of linkage to the chromosome 2q and 7q loci19, 23, 33 (Supplementary Table S1). Five SNPs in NOSTRIN, ZNF533 and OSBPL6 were tested in a sample of 358 multiplex families (‘Mount Sinai’ cohort),23, 33 but no significant results were obtained. Of the 28, 3 AUTS1 replication SNPs in IMMP2L and CUX1 were genotyped in 62 Caucasian families selected for IBD sharing from 222 families showing linkage to the same region of chromosome 7 (‘University of Washington’ sample),19 again with no evidence for association.

Copy number variation

A Mendelian error in one family for SNP rs7585982 pinpointed a potentially interesting deletion in the UPP2 gene on chromosome 2. The deletion boundaries were defined by sequence analysis of additional SNPs flanking rs7585982. Using long-range PCR followed by sequencing, we refined the deletion to 5897 bp of the UPP2 gene (158 681 612–158 687 508 bp; UCSC Build 36), removing two coding exons (exons 6 and 7) and predicted to cause a frameshift leading to a premature termination codon (Supplementary Figure S2A). This deletion was not present in the Database of Genomic Variants (DGV, http://projects.tcag.ca/variation/), suggesting it could be an autism-specific CNV. We screened the same sample used for the SNP association experiment (126 cases and 188 controls) for the presence of this deletion using multiplex PCR (Supplementary Figure S2B). The frequency of the deletion was not significantly different between cases and controls (1.6 and 3.2%, respectively, P=0.2). To investigate if the deletion segregates with the ASD phenotype, we also screened 265 sib-pair families from the IMGSAC collection, including relatives of the 126 cases. Of these, we found 30 families with a parent carrying the deleted allele, and in only 13 families was it transmitted to affected children (in 5 families to both affected siblings and in 8 families to a single affected individual). These results suggest that the UPP2 deletion is not involved in autism susceptibility. The coding sequence of UPP2 was also sequenced in 47 unrelated subjects, including 12 probands carrying the deletion of exons 6 and 7; no novel coding variants were identified, except one silent change in exon 4 in only one individual.

By combining data from both SNP arrays for each candidate region, a sufficient SNP density was achieved to carry out copy number analysis on these samples using QuantiSNP.43 We detected 17 CNVs in seven regions of chromosome 7 and 6 CNVs in five regions of chromosome 2 (Supplementary Table S2). For the chromosome 7 analysis, an 800 kb duplication was detected in family 13-3023 that was transmitted from father to proband (Supplementary Figure S3). This duplication includes two genes: IMMP2L and DOCK4. Another duplication overlapping EMID2 and RABL5 was detected in three families where it was transmitted from mother to proband, whereas a smaller duplication containing only EMID2 was detected in a father, but not transmitted, and in one control. A third CNV in EXOC4 was detected as a nontransmitted loss in a father and as a gain (four copies) in a control.

On chromosome 2, five duplications and one deletion were detected in parents and a single control, but never transmitted to an affected child (Supplementary Table S2).

Most of the identified CNVs are well represented in the DGV, suggesting that they do not have a major function in autism susceptibility. However, the duplication involving IMMP2L and DOCK4 warranted further analysis, as it involved two adjacent genes showing possible SNP association with autism. Therefore we developed a QMPSF assay able to simultaneously test CNVs in exons 2, 3 and 6 of IMMP2L, exon 4 of LRRN3 and the last exon of DOCK4 (number 52). We validated the duplication in family 13-3023, identified by QuantiSNP, and verified that it is transmitted from the father to the affected son, but it was not transmitted to the other affected sib or to an unaffected sib. Screening of 475 UK controls and 285 IMGSAC multiplex families with 487 affected individuals was then carried out using the QMPSF assay, to check if CNVs in these genic regions segregated with the autism phenotype in families and/or have a higher frequency in cases than controls. We identified six additional deletions of different length, of which some were transmitted. One deletion disrupted exons 2 and 3 of IMMP2L and the last exon of DOCK4, and was transmitted from the mother to both affected sons, as well as to a daughter, who did not have an ASD. Both the carrier mother and daughter were reported to have dyslexia. qPCR indicated that the deletion distal breakpoint is located between exons 31 and 14 of DOCK4 (Supplementary Figure S4). Two smaller deletions were transmitted from the parent to only one of their affected children, one was found only in the father but not transmitted and the other two were found in controls. The relative length and position of the CNVs identified are depicted in Figure 2.

Figure 2
figure 2

Summary of IMMP2L and DOCK4 copy number variants (CNVs). Fragments tested by QMSPF are shown as red bars at the top. CNVs from the Database of Genomic Variants (DGV) are shown as orange bars. Deletions and duplications identified in affected individuals and in parents or controls are depicted at the bottom. Dashed and continuous lines indicate the maximum and minimum length of the CNVs, respectively. The distal breakpoint of the deletion in pedigree 15-0084 was defined by qPCR. The distal breakpoint of the duplication in pedigree 13-3023 was not defined precisely.

Discussion

Several linkage studies have suggested that chromosomes 2q and 7q may harbor one or more genes contributing to the risk for developing an ASD. Here, we have presented a comprehensive high-density SNP genotyping, association and CNV study covering the 2q23.3–q32.3 and 7q21.3–q33 chromosome regions. We have tested more than 3000 SNPs in each region, covering all known genes, as well as in highly conserved non-genic sequences.

The complementary case–control and family-based approach taken in our study allowed us to extract the maximum information from our sample, taking into consideration the advantages and disadvantages of the two different approaches. Case–control studies are more powerful compared to family-based approaches, but are sensitive to the presence of population stratification. Structure analysis using 50 genome-wide SNPs did not reveal strong population stratification, although we cannot exclude that undetected low levels may be present. Family-based approaches are more robust to confounding by population stratification and in addition they enable testing for parent-of-origin effects.

Although the strongest signals identified by the two approaches did not coincide, comparison of the results led us to pinpoint the most interesting loci supported by both methods, albeit with different strength. In addition, consistency of the results obtained by frequentist and Bayesian approaches suggested that our strongest signals are independent of the analysis method.

Primary association analysis of the chromosome 2 region identified the most interesting results in NOSTRIN, UPP2 and ZNF533. NOSTRIN encodes the nitric oxide synthase trafficker. Interestingly, the nitric oxide signaling pathway has been recently shown to be overrepresented in genes disrupted by CNVs in schizophrenia.47 However, the NOSTRIN association was stronger in the case–control analysis with only minor support from the TDT, and it was not confirmed in the replication sample or in the combined meta-analysis, suggesting that it might represent a false-positive result.

Similarly, the ZNF533 association was not replicated, however rs7590028 remained one of the strongest signals in case–control combined analysis of IMGSAC samples. ZNF533 encodes a protein containing four matrin-type zinc fingers and is highly conserved in evolution. Given its putative nuclear location, it is thought to act as a repressor of transcription, although no specific targets are currently known. ZNF533 is widely expressed in adult tissues, including brain. Expression of all isoforms in fetal brain was confirmed by reverse transcriptase–PCR (data not shown). Deletions including ZNF533 have been described in several patients with a neurological phenotype including mental retardation,48, 49 and other zinc-finger genes have also been implicated in mental retardation cases.50, 51, 52 The zinc-finger gene ZNF804A was recently identified as the strongest result in a genome-wide association study of schizophrenia and bipolar disorder,53 suggesting that they may act as transcription regulators in a wide range of human cognitive processes.

On chromosome 7, the most significant association result from the primary cohort was in the IMMP2L gene. Although SNPs in this gene failed to replicate in independent samples, the IMMP2L intronic SNP rs12537269 achieved the strongest result in the case–control meta-analysis of the IMGSAC sample (P=7.3 × 10−5). This gene encodes an inner mitochondrial membrane protease-like protein and is a plausible candidate for autism, because it was previously reported to be disrupted in an individual with Tourette syndrome, a complex neuropsychiatric disorder showing phenotypic overlap with ASDs.54 Moreover, IMMP2L contains a neuronal leucine-rich repeat gene (LRRN3) nested within its large third intron. The expression profile of LRRN3 also makes it an interesting candidate gene for autism, as it is most highly expressed in fetal brain. Studies in Drosophila demonstrate that many members of the LRR family provide an essential role in target recognition, axonal pathfinding and cell differentiation during neural development,55 and murine studies suggest these LRR proteins could have similar functions in mammalian neural development.56

The only SNP that achieved significant replication, after Bonferroni correction for multiple testing, is rs2217262 in the neighboring gene DOCK4, also a good autism candidate. This gene encodes a protein that activates Rac GTPase and is often deleted during tumor progression.57 A recent study in rats indicates that DOCK4 is predominantly expressed in the hippocampus as well as in the lung.58 This study further demonstrated that in cultured hippocampal neurons, DOCK4 is upregulated at the same time as dendrites start growing, and that knockdown of this gene by RNA interference results in impaired dendritic morphogenesis.

The association result for rs2217262 indicates that the common allele in the population is associated with increased risk for autism, or the minor allele is a ‘protective’ variant. It has been shown that in presence of missing data, SNPs with a low MAF may show a bias in TDT, resulting in artificial overtransmission of the common allele.59 This problem is not likely to apply to rs2217262, as this association was supported also by case–control analysis.

Although only the rs2217262 association was confirmed by replication analysis, suggesting that the other results may represent false positives, this polymorphism (with MAF only about 5%) would not alone account for the linkage signal seen at AUTS1 in the IMGSAC sample. It is thus possible that multiple loci might contribute to the overall linkage seen for this region, and that the other significant SNPs from primary analysis may in reality be true signals but with lower OR, which our replication study was underpowered to detect. We do recognize that several limitations may have affected our replication sample. The primary sample was composed of trios selected from multiplex families based on IBD sharing, thereby more likely to be enriched for susceptibility alleles. By contrast, the replication population was a more heterogeneous sample, not preselected on linkage, and was mostly composed of singleton families. Power calculation suggested that our replication sample (IMGSAC-R and ND) should give us sufficient power to replicate the most significant primary results. However, the well-known ‘winners curse’ theory also suggests that the effect sizes from the initial study may have been overestimated, thus requiring a much larger sample for replication. We did not detect presence of structure in the combined IMGSAC primary and IMGSAC-R samples, but it is possible that heterogenity may be present among the different samples used in this study (ND, Mount Sinai and University of Washington). This could have also contributed to the lack of replication, as could have gene–environment interactions, when different environmental exposures are present between population samples.

De novo and/or inherited CNVs are emerging as important causes of ASDs and other complex disorders.8, 11, 12, 13 Hence we exploited our dense SNP genotyping data to mine for structural variants. The most interesting discovery is the occurrence of deletions and duplications in four independent families in the IMMP2L/DOCK4 locus, given the coincident SNP association also seen for these genes. A maternal deletion was transmitted to both affected sons and the unaffected daughter in family 15-0084. In all other instances (two deletions and one duplication) the second affected sib did not inherit the CNV. Interestingly, the maternally segregating deletion extends to the 3′ end of the DOCK4 gene, whereas the non-segregating deletions or those identified in controls and in the DGV were limited to IMMP2L. Taken together, these data seem to suggest that a copy number loss of DOCK4 may influence susceptibility to ASDs, whereas duplications may not be damaging. The effect of DOCK4 deletions might be less penetrant in women because the mother and the unaffected daughter also carried the deletion. Larger studies will be needed to confirm this hypothesis.

The predominantly gene-based nature of our study represents a possible limitation, as we may have missed susceptibility alleles in intergenic regions. Recent findings from the ENCODE Consortium emphasize the importance of looking at noncoding sequence, as several functional elements in the genome seem to be in these regions.60 We attempted to minimize this limitation by including several SNPs in non-genic evolutionary conserved elements.

Our study also suggests that no common variants of large effect size are present within genic regions at AUTS1 and AUTS5 and highlights the importance of very large sample sizes for identification of robust associations and rare CNVs with sufficient power for statistical significance. Evidence from recent genome-wide association studies for various disorders clearly shows that effect sizes for loci contributing to complex traits are generally lower than those predicted a few years ago.61 Several whole-genome association and CNV studies for autism are currently in progress by large consortia, and it will be interesting to see if any of the genes highlighted by this study are also identified by these extensive studies.

It is possible that rare variants, both point mutations and CNVs, may account for a larger fraction of the overall genetic risk in complex psychiatric disorders than previously assumed. The present study was not designed to assess the contribution of rare sequence variants and our results do not preclude that the chromosome 2q and 7q linkage regions may harbor rare variation showing allelic heterogeneity across families, which may require resequencing to uncover.

The inconclusive findings identified with this study reflect the status of the field of autism genetics and suggest that classical approaches such as linkage and association analyses alone may not be sufficient to deal with the genetic and phenotypic heterogeneity seen in autism. One recent study of note used homozygosity mapping to uncover a number of large homozygous deletions in consanguineous pedigrees, highlighting the utility of this approach for heterogeneous disorders like autism.10 Another successful study found linkage to 15q13.3–q14 in a subset of families with IQ 70, suggesting that the use of informative subphenotypes to define homogeneous sets of ASD families could be very important in detecting susceptibility loci involved in autism.62 Finally, another report indicated that level of somatic CNVs between MZ twins may be higher than expected.63 If confirmed, this finding could be a powerful tool for identification of autism susceptibility loci in MZ twins with a discordant phenotype. We believe a combination of these (and other) novel approaches, together with traditional methods will be required to uncover all the genes and biological pathways leading to autism.

In summary, the present high-density SNP association and CNV screen have provided evidence that variants in the IMMP2L/DOCK4 locus on chromosome 7 and in ZNF533 on chromosome 2 may increase susceptibility to ASDs. Association of the common allele of SNP rs2217262 in DOCK4 was supported by an independent replication, whereas the associations in IMMP2L and ZNF533 are not sufficiently significant in the context of multiple testing and warrant further studies.