Main

The International Human Genome Sequencing Consortium (IHGSC) used a hierarchical mapping and sequencing strategy to construct the working draft of the human genome. This clone-based approach involves generating an overlapping series of clones that covers the entire genome. Each clone is ‘fingerprinted’ on the basis of the pattern of fragments generated by restriction enzyme digestion1,2. Clones are then selected for shotgun sequencing and the whole genome sequence is reconstructed by map-guided assembly of overlapping clone sequences3.

The availability of the whole-genome clone-based map assisted the sequencing of the human genome in many respects. The fingerprinted BAC map made it possible to select clones for sequencing that would ensure comprehensive coverage of the genome and reduce sequencing redundancy. In addition, the challenge of sequence assembly was minimized by restricting random shotgun sequencing to individual clones. Furthermore, the clone-based map also enabled the identification of large segments of the genome that are repeated, thereby simplifying the assembly. Many IHGSC centres had developed chromosomal maps and resources that were not integrated, so it was essential to have a unifying genome map to enable localization of clones, with respect to previously sequenced clones, before they were sequenced. The accurate fingerprinting and sizing of each clone enabled us to verify the accuracy of shotgun sequence4 assembly of each clone.

The human genome presented unique challenges for the development of a clone-based physical map. Its size of 3.2 gigabases (Gb), which is 25 times as large as any previously mapped genome, meant that proportionately more clones had to be analysed. Its greater complexity also made it more difficult to distinguish true overlaps, which was further complicated by the repeat-rich nature of the genome. Early efforts to construct clone-based regional and even chromosomal physical maps of the human genome using cosmid libraries derived from isolated human chromosomes met with limited success5,6. By contrast, maps based on sequence-tagged site (STS) landmarks provided greater coverage of the genome7,8,9, as did genetic maps based on variations in simple sequence repeats in STS landmarks10,11. The development of P1-artificial chromosome (PAC)12 and bacterial artificial chromosome (BAC)13 cloning systems was pivotal to the success of the whole-genome map. They provided larger inserts, more stable clones and better coverage of the genome.

Clone-based maps similar to that described here have been important in the sequencing of most large genomes, including those of Saccharomyces cerevisiae1, Caenorhabditis elegans2 and Arabidopsis thaliana14. A clone-based map also contributed to the sequencing of the Drosophila melanogaster genome15,16 and a combined mapping and sequencing strategy is being applied to the mouse genome17,18. This work illustrates the benefit of using the clone-based map in the assembly of the human genome sequence.

Construction of the whole-genome BAC map

The pilot phase of the sequencing project began in 1995, at which time efforts were renewed to develop clone-based maps covering specific regions of the genome. To construct these regional maps, we screened PAC and BAC clones for STS markers, fingerprinted the positive clones, integrated them into the existing maps, and selected the largest, intact clones with minimal overlap for sequencing.

To keep pace with the ramping up of the sequencing effort in 1998, the ongoing efforts to construct the whole-genome BAC map were increased approximately tenfold. The whole-genome BAC map was constructed in several steps. First we collected fingerprint data for a large sample of random clones from a genome-wide BAC library. We then assembled the BAC map, first by using the fingerprint data to cluster highly related clones automatically, then by further refining them manually, and last by merging contigs with related clones at their ends. Finally, in parallel with construction of the BAC map, we mapped the chromosomal positions of individual clones on the basis of landmarks from existing landmark maps.

Fingerprinting the BAC clones

In October 1998, we began fingerprinting 300,000 BACs from the RPCI-11 library19 (http://www.chori.org/bacpac/). Redundancy of sampling was vital to achieve high continuity in the final map14. Assuming an average BAC insert size of 150,000 base pairs (bp) and a genome size of 3.2 Gb, this level of fingerprinting would provide roughly 15-fold coverage of the genome. The library was derived from male DNA, providing full coverage of all 24 human chromosomes but with half as much coverage of the sex chromosomes as of the autosomes. Our experience with the library found it to be of high quality with uniformly large-insert clones, few non-recombinant clones and little cross-contamination of source plates. The RPCI-11 library was one of the first libraries to meet the informed consent criteria in accordance with the NHGRI policy for the Use of Human Subjects in Large Scale Sequencing (http://www.nhgri.nih.gov/Grant_info/Funding/Statements/RFA/Human_subjects.html).

To meet the goal of fingerprinting 300,000 BAC clones in one year, we devised a tandem 121-lane agarose gel format, allowing the simultaneous electrophoresis of 50 standard ‘marker’ DNA lanes and of 192 BAC restriction digests (Fig. 1), thereby reducing the number of gels, without loss of restriction fragment size accuracy or fidelity of clone tracking (see Supplementary Information). With these and other improvements in the fingerprinting technology and resources, we increased throughput tenfold to process more than 20,000 fingerprints (which equates to approximately onefold clone coverage of the human genome) each week. We also sampled clones from the RPCI-13 and CT-C/D1 BAC libraries, which were constructed using a different restriction enzyme (Table 1). This provided differential sampling of the genome, given the different distribution of the restriction enzyme sites within the genome. In addition, the RPCI-13 library is derived from female DNA, which improves the representation of the X chromosome in the whole-genome BAC map.

Figure 1: Example of the improved high-throughput fingerprint gel.
figure 1

BAC DNAs are digested with HindIII and visualized on a SYBR-green-stained 1% agarose gel. Every fifth lane contains a mixture of marker DNAs; the sizes of selected marker fragments are indicated. 0, origin of fragment migration.

Table 1 Sources of clones used

Assembling the BAC map

By the end of 1999, with the fingerprint data on the BAC clones entered into an FPC database20,21,22,23 (http://www.sanger.ac.uk/Software/fpc/), we were ready to construct the initial fingerprint assembly that would form the basis for further work on the map. We experimented with various strategies for automated assembly that would be as complete and as consistent as possible (see Supplementary Information).

First, we edited the fingerprint data itself. In early tests of assembly, we found that the variability in the mobility of small fragments (< 600 bp) led to artefactually low estimates of overlaps between clones. We therefore removed fragments smaller than 600bp before assembly. Similarly, variability in estimating band numbers in ‘multiplets’ (instances where more than one fragment is located at nearly the same position on the gel) also caused problems. To reduce the variability between the number of bands called in these multiplet situations and thus increase the reliability with which related clones are correctly overlapped, these fragments were collapsed to only a single band in the resulting fingerprint. This ‘sanitizing’ process resulted in clusters of increased reliability.

Second, we evaluated the impact of varying the threshold for the ‘overlap statistic’, which is a measure of clone similarity, and the tolerance for accepting two bands from different clones as the same. We compared the clusters obtained for consistency with known regions and with other mapping data for the fingerprinted clones (primarily radiation hybrid chromosomal localization data from the Stanford Human Genome Center (SHGC)). The parameters finally used (overlap statistic of 3 × 10-12 or about 75% clone overlap and 0.7 mm tolerance) balanced the total number of clusters (which decreased with less stringent parameters) and the number of chimaeric clusters (which decreased with more stringent parameters). The automated assembly of 283,287 BAC clones resulted in 7,133 clusters containing 93% of all fingerprints in the database (Table 2). The remaining unincorporated clones (singletons) were excluded, as they contained too few bands to be included by automated assembly under these conditions or simply had no closely related clones. These latter clones included artefacts such as clones that had rearranged or had poor quality data, as well as rare clones representing poorly sampled portions of the genome.

Table 2 Status of FPC database after automated assembly and manual editing

As fingerprints from new clones were added after the initial assembly, there was a disproportionate increase in the number of singletons (Table 2). These new data were only incorporated into existing clusters or contigs if they added needed depth or helped to join contigs. We noted a further increase in singletons as new libraries were sampled (particularly from the RPCI-13 and CT-C/D1 libraries). One possible explanation is that these new libraries encompass regions of the genome not represented in the initial RPCI-11 library.

Most clones (97.5%) in the current whole-genome BAC map are derived from RPCI-11 (272,027/69.2%), RPCI-13 (59,051/14.9%) and CT-C/D1(52,725/13.3%) (Table 1). Although only about two-thirds of the fingerprint data are derived from DNA from a single individual, we did not experience any problems in assembly arising from polymorphisms between the individuals from whom the DNA was obtained.

Achieving map continuity

The goals of the manual editing were to refine the ordering of the clones within clusters to create contigs, to disassemble larger chimaeric contigs (representing clusters of two or more sets of non-overlapping clones) and to join contigs. This process involved first editing the fingerprint assemblies (using the tools encapsulated in FPC) to ensure that every clone within a contig was properly situated with respect to its most highly related neighbours, defined by fingerprint similarity14 (see Supplementary Information). About 600 chimaeric clusters were identified and disassembled. To identify potential joins, we then used clones at the extreme ends of each contig to query the FPC database at a lower required fingerprint overlap stringency (overlap statistic of 1 × 10-8 or about 50% clone overlap) than was used during initial assembly. Joins were incorporated into the map if the fingerprinting data was logically consistent with the proposed map order (Fig. 2).

Figure 2: Example contig from the whole-genome BAC map.
figure 2

Portion of contig shown is localized to chromosomal region 8q21, composed of 836 BAC clones ordered by restriction fingerprint mapping. a, Contig summary information. Only 287 of the 836 clones are displayed. Redundant clones are ‘buried’ in their parent clones as indicated by the + and * clone name suffices (see (c)). The contig contains 193 markers; 77 clones have been selected for sequencing. Contig length: 1,552 unique restriction fragments (6Mb). b, Markers associated with clones in the display. Green: specifically associated with clone N0363E06 (aqua in c). There are 69,507 markers currently in the database associated with clones, largely by ePCR. Only one marker of the 62 shown is inconsistent with the 8q21 localization of this contig (D17S978, red underline). This is probably not a unique marker in the genome as the clone with which it is associated also contains several chromosome 8 markers. c, Partial display of the contig, showing 112 of the 287 clones visible with this view. Blue, example clones selected for sequencing. These clones were believed to overlap as they shared several restriction fragments; overlaps have been confirmed by working draft sequence. d, Data associated with the clones in c. FISH data (for example JMF-8q21.1) is consistent, except for one clone (N005M18, 9q22, red underline), probably owing to a clone-tracking error (the placement of the associated accessioned sequence (AC022821) in this location is supported by sequence overlap with surrounding clone sequences (clone N0813B08, AC069139)). Chromosomal localization of clones using STSs derived from BAC end sequences (for example, COX_8) is also consistent, with one exception (N0028G16, chromosome 14 COX_14, red underline), probably owing to incorrect association of an end sequence with a clone name in the BAC end sequence database. GenBank accession numbers are indicated. Sequences were mapped to the associated clone using in silico restriction digests, BAC end sequences and sequence overlap. Around 11.5% of the accessioned sequences have an incorrect clone name in their GenBank record, so proper placement of the sequence relative to the physical map was achieved in this manner. N00792N11 and N0096I13 are associated with accessioned sequences AC026617 and AF181449, respectively. The incorrect clone name referenced in their sequence records is indicated. e, Markers associated with the GeneMap’99 radiation hybrid map. Several are associated with clones in this contig (c), further positioning this contig within the genome.

The most notable effect of the intensive editing was the greater than fivefold reduction in total contigs, from a high of 7,700 contigs after chimaeric contigs had been disassembled, to 1,246 by the 7 October 2000 data freeze of the draft genome sequence3 (Table 2). The longest contig in this set encompasses more than 60 Mb of draft genome sequence and the mean contig size is estimated to be around 2.9 Mb. At the time of writing, the number of contigs had fallen further, to just 965 contigs.

As the contigs became accurately positioned and oriented with respect to one another (see below) and with the emergence of the draft sequence, end clones of adjacent contigs with overlapping sequence were recognized. After inspection of the sequence overlap to rule out shared sequence resulting from internal repeated segments, about half of the candidate joins were well supported by the fingerprint data and were integrated into the map. Another 62 had unconvincing evidence of overlap based on fingerprints but were tagged as overlapping on the basis of sequence alone.

The contigs appeared to be appropriately distributed among the chromosomes on the basis of the expected size of the chromosomes. The number of contigs per chromosome varies with the size of the chromosomes and the efforts made at closure (Table 3). Chromosomes 6, 7, 13, 14, 15, 20 and Y have relatively few remaining gaps, with 21, 29, 15, 21, 19, 10 and 8 contigs, respectively.

Table 3 Chromosomal assignment of contigs

Integration of other map data

To increase the utility of the whole-genome BAC map, we incorporated various map data to anchor the contigs along the 24 chromosomes. Using selected markers from the CEPH Généthon genetic map10, the GeneMap’99 genome-wide radiation hybrid map (http://www.ncbi.nlm.nih.gov/genemap/)24,25,26 and from plasmid library sequences prepared from flow-sorted chromosomes (Sanger Centre, unpublished data), we hybridized 13,695 markers against colony filter replicas of the RPCI-11 library. This enabled us to position 96,283 different BAC clones as genome anchor points for the contigs.

In addition, because the RPCI-11 library was used for other genome initiatives, much additional marker information was available from other laboratories. Importantly, 9,018 STSs derived from BAC end sequences were assigned to chromosomes (D. R. Cox, unpublished data), with many of these selected deliberately because they came from clones in unlocalized contigs. Although over 15% of the available BAC end sequences of the RPCI-11 library are reported to be apparently mislabelled with respect to the microtitre well address from which they originated27, two or more independently derived BAC end positions reliably yielded the correct chromosomal assignment of many contigs. In addition, chromosomal assignment and integration of cytogenetic map positions were achieved by utilizing 3,412 BACs mapped by fluorescence in situ hybridization (FISH) data28.

As the working draft sequence accumulated, known markers within the sequence were readily identified by electronic PCR (ePCR), a program that searches sequence for STSs by identifying the associated primer sequences in the correct orientation and with correct spacing29. These data were incorporated into the FPC database. The combined ePCR and hybridized data sets contained 69,507 markers, including 1,659 polymorphic markers from the Généthon genetic map. We primarily used GeneMap’99 for further anchoring and ordering of contigs, as it has a substantial marker set (> 50,000), is well integrated with the Généthon genetic map and provides local ordering at < 1 Mb resolution. Once sequenced clones could be reliably associated with the fingerprinted clones3, we could use the marker content of sequenced clones determined by ePCR to order and orient contigs more reliably. We used markers found on any of six maps (Généthon genetic map, Marshfield genetic map11, WIBR YAC STS-content (http://carbon.wi.mit.edu:8000/cgi-bin/contig/phys_map), GeneMap’99, SHGC G3 radiation hybrid map (http://www-shgc.stanford.edu/Mapping/rh/index.html) and NCBI framework map (http://www.ncbi.nlm.nih.gov/genome/guide/)) to orient contigs with respect to the majority consensus of all maps examined.

Integration of specific mapping efforts

We integrated regional map data into the whole-genome BAC map from other genome centres (see list at http://www.nhgri.nih.gov), which enriched the map and helped in the selection of clones for sequencing as it minimized redundancy and improved coverage. The regional mapping data included those for chromosomes 12 (ref. 30), 14 (ref. 31) and Y (ref. 32), and 1, 6, 9, 10 13, 20, 22 and X (ref. 33). We also integrated mapping data for chromosome 19 (Lawrence Livermore National Laboratory, http://www-bio.llnl.gov/bbrp/genome/html/chrom_map.html) and a 20-Mb segment of chromosome 15 (University of Washington). Telomeric contigs were identified and positioned where possible, as described elsewhere in this issue34.

Some mapping efforts employed clone resources other than the RPCI-11 BAC library. In these cases, clones were sent by these centres, fingerprinted at the Washington University Genome Sequencing Center (WUGSC) and incorporated into the whole-genome BAC map and FPC database. These clones included those from regions of chromosomes 5 (J. Cheng), 8 (A. Rosenthal and N. Shimizu35), 11 (Y. Sakaki) and 17 (J. Ramser). In addition, we used computer-generated restriction digests, or in silico digests, of sequences in GenBank to incorporate these clones into the whole-genome BAC map.

Accuracy of chromosomal positions

As an independent assessment of the accuracy of assigning a chromosomal position to contigs, we used aliquots of the BAC DNA from 96 fingerprinted clones (RPCI-11, clones 512M01–512O24) as FISH probes to metaphase chromosomes (see Methods). Of the 96 BACs examined, 87 were successfully assigned to a single chromosome band. The remaining clones either failed to label (six) or were associated with multiple chromosome bands (three). The chromosomal localization of 82 (94%) of the mapped BACs agreed unambiguously with the derived chromosomal assignment, based on STS content, of the contig into which the corresponding fingerprint had assembled. A single BAC mapped to one of the two positions that were equally well supported by the marker content of its associated contig. The remaining four BACs were associated with fingerprints in contigs that had no mapped marker content and thus were not previously localized. In summary, the FISH mapping data did not conflict with any of the chromosomal assignments of the contigs examined.

In addition, we selected a minimal tiling path of eight clones from a random contig. DNA remaining from the fingerprinting of these clones was used for FISH mapping. All eight clones co-localized to chromosomal segment 8q21.1. This was consistent with other FISH data (B. Trask; 8q21.1), radiation hybrid data (D. R. Cox; chromosome 8) and ePCR of 12 markers mapping to chromosome 8 (Fig. 2).

Accuracy of clone order

The integration of independent map data and the emerging sequence information enabled us to monitor the fidelity of the developing map. We regularly checked that the predicted clone order was reflected in the overlaps of the sequenced clones. The ongoing assignment of chromosomally positioned markers to clones and contigs provided a useful check for possible false joins between unrelated contigs. These checks for clone order and contig fidelity were carried out much more extensively once the draft genome sequence was assembled and additional marker data incorporated. Overall, local clone order agreed with the overlaps demonstrated by sequence.

Comparison of the chromosome 12 STS-content BAC map30 with the fingerprint BAC map of the same chromosome provided an important test of the accuracy of clone ordering. The two maps were derived independently, but used the same RPCI-11 library. The maps are consistent in clone ordering and provide complimentary resources: the chromosome 12 STS-content BAC map provides more accurate contig anchoring and orientation, and our map provides more depth of clone coverage. Furthermore, these maps, while sharing some gaps, largely closed gaps for each other, underscoring the benefit of the complimentary mapping strategies. After integration, the resulting map consisted of four contigs on the short arm and 34 on the long arm, and this has been further reduced to 20 contigs by continued gap closure methods30.

Duplications and repeats

Two problematic aspects of the genome still need to be resolved: large (> 150 kilobase (kb)) recently duplicated segments and smaller tandemly repeated sequences extending for > 100 kb. Analysis of the total clone population shows that about 1% of clones have unusually high numbers of closely related clones (>75% shared bands), indicative of large repeated sequences. In some cases, minor differences in band patterns have allowed some complex repeats to be tentatively teased apart, but many of these have yet to be investigated in detail at the sequence level (an exception is the Y chromosome32). In other cases, only more complete and finished sequence will clarify mapping data for these regions.

The presence of extensive smaller tandemly repeated sequences (which sometimes are not even successfully cloned) results in clones that resemble small insert and badly deleted clones, which we avoided including in the map. However, unlike the small and deleted clones, tandem repeats are present in multiple independent clones that display a similar fingerprint pattern. Sequence analysis of some of these repeat sequences shows that they are related to centromeric and ribosomal-repeat-related repeats, among others.

Gaps in the map

The remaining gaps, currently fewer than 1,000, are likely to stem from a variety of causes. There may be overlaps between end clones that are too small to be detected by fingerprints and which will only be recognized once the end clones are sequenced. Gaps can also arise because of misassemblies in the map, particularly where a duplicated segment is inappropriately designated to represent just one region. Some gaps may be detected from analysis of other BAC DNA libraries constructed using different restriction enzymes. Other gaps may arise simply because clones are not recovered at sufficient frequencies in BAC or PAC large insert libraries—clones spanning these gaps could potentially be detected in YAC libraries or might need to be recovered using special approaches.

Coverage of the genome

To estimate the fraction of the genome that was represented in the whole-genome map, we analysed chromosomes 21 and 22. Using in silico digest methods, we estimated the coverage of the fingerprint map encompassing finished chromosomes 21 (ref. 36) and 22 (ref. 37). Simulated 175-kb clones were created and digested in silico from the contiguous sequences for these chromosomes; each clone overlapped by 40%. We compared these digested simulated clones against the FPC database at high fingerprint overlap stringency. For chromosome 21, 316 simulated clones were created, of which 315 had at least 15 HindIII restriction fragments; clones containing fewer bands are difficult to compare. Of the 315 simulated fingerprints, 309 (98%) matched a related clone in the whole-genome BAC fingerprint FPC database. Similarly, for chromosome 22, 308 simulated clones were created. Of those, 303 had more than 15 HindIII restriction fragments and, when compared to the entire FPC fingerprint database, 297 (98%) found a related clone. This analysis was repeated using a 210-kb in silico clone size with 90% overlap, with similar results. Each of these chromosomes has four sequence gaps that are estimated to encompass 1.6% of the chromosome; therefore, the confirmed clone coverage of the euchromatic region of these chromosomes is around 96%. Collectively, these chromosomes represent approximately 3% of the genome. It is probably reasonable to extrapolate that this level of clone coverage will be found throughout the genome.

Clone selection for sequencing

The whole-genome BAC map was, and continues to be, used to select clones for sequencing. We devised algorithms for automatic high-throughput selection of BACs, which specifically chooses clones from contigs lacking sequenced clones (‘seed clones’) and clones that extend from already selected clones (see Supplementary Information). We took several issues into account when developing these programs. First, we had to devise methods compatible with a constantly and rapidly evolving map, which had considerable new information added to it each week. Second, we had to avoid clones representing genomic regions already sequenced from libraries other than RPCI-11. Third, we wished to select only clones not deleted or otherwise rearranged and thus faithfully represent the underlying genome.

The fingerprint map was used initially to identify nonredundant seed clones for sequencing when only a small portion of the RPCI-11 clones had been fingerprinted. As described above, we used all available forms of mapping data to localize the clones, and thus the contig. Once an appropriate contig was identified, the program looked for the largest clone (smaller than 225 kb to avoid artefacts) in the contig. The program also checked the fidelity of the clone by comparing its bands against other clones. We avoided end clones, as they inevitably had bands that could not be confirmed. In addition, a clone registry was developed (NCBI) to track clones selected for sequencing by any centre, and contigs with these clones were also avoided, as were contigs containing clones with similarities to other clones in GenBank as detected by in silico digest.

The next step in automated clone selection was to extend progressively from the seed clones using tools to search for appropriately overlapping clones. Neighbouring clones were evaluated using the overlap statistic to provide a tentative clone order. Clones within a specified range of overlap statistic were evaluated for the total size of shared bands. The amount of acceptable overlap was also specified (typically 25%). Any candidate in turn was evaluated against an intermediately positioned clone to ensure that the overlap was genuine and was compared to existing data using the clone registry and in silico digests to avoid redundancy.

These automated tools were used until late January 2000, when the manually evaluated contigs became available, allowing the selection of minimal tiling paths based on these clone orders. In the course of generating the working draft, more than 10,000 BAC clones were selected for the sequencing pipelines at the WUGSC, Whitehead Institute for Biomedical Research and the Stanford Genome and Technology Center using these tools and the evolving whole-genome BAC map. A check of 518 overlaps between finished clones selected both manually and through the automated methods at WUGSC shows that they have an average overlap of 47.5 kb with their neighbours, or about 28%: this is an acceptable degree of overlap, given the relatively dense seeding that occurred, and the importance placed on achieving coverage.

Sequence map of the human genome

Although the whole-genome BAC map was constructed primarily to exploit the coverage of high-redundancy BAC libraries for use in sequencing the human genome, it has served to integrate the sequences in GenBank38 with the physical map. This was needed to guide the long-range assembly of the working draft sequence and to identify all remaining gaps in this sequence map so that spanning clones could be selected. By using in silico digests to generate fragment size information that could be compared to the fingerprints in the FPC database, virtually all except for short sequences (such as individual cosmids) in GenBank were positioned onto the whole-genome BAC map (see Methods). Additional information, such as BAC end sequence alignment and clone sequence overlap, was used to augment the in silico digest placement (if needed) and, in some cases, multiple sequences were positioned as part of larger assemblies of overlapping sequences (NT segments, NCBI). From these analyses, we determined that as many as 11.5% of the sequences in GenBank had incorrect clone names referenced in their GenBank records, probably owing mostly to data tracking and clone retrieval errors at the genome centres. A consequence of these naming errors was that many contigs contained clones associated with multiple markers determined by ePCR that mapped collectively to a single region of the genome, but were inconsistent with the remaining clone-to-marker associations in the contig. This was a direct result of incorrect clone names being associated with sequences and hence, incorrect assignment of the markers to those clones in the FPC database. Mapping of sequences to the clone map corrected the naming errors and resolved seemingly out of place ePCR markers once the sequence in which they were detected was properly situated within the ordered contigs. We have found that the correct clone can be retrieved 95% of the time using the whole-genome BAC map, as judged by comparing the fingerprint obtained for the retrieved clone with that in the database. Some clones could not be retrieved owing to growth failures, and others represent data-tracking errors within the fingerprint set. The high level of redundancy of the whole-genome BAC map allows a substitute clone to be readily selected to replace the 5% of clones that are not recovered on the first attempt.

Once the sequences were aligned to the whole-genome BAC fingerprint map, we used these data as a foundation for determining a nonredundant sequence path across the genome. The map order and placement of the sequences with respect to the whole-genome BAC map were considered in the sequence assembly to minimize errors due to potential false assignment of overlaps between related but not identical sequences. The BAC map placements were used as a localization guide only and did not completely constrain the sequence assembly, to avoid any propagation of errors and imprecision of clone placement. The analysis of markers identified within the genome sequence enables a detailed comparison of the whole-genome BAC map with other established landmark-content, radiation hybrid and genetic maps3. There was overall agreement between the sequence assembly that overlays the whole-genome BAC map and other existing maps, with local exceptions. In most instances, these local disagreements indicated the need simply to reverse the current orientation of the underlying BAC contig, and this has been done in the present version of the map.

Conclusions

The whole-genome BAC map allowed the integration of a range of data, including FISH cytogenetic clone localizations, landmark data obtained by PCR and hybridization screening, clones from other libraries with associated map data, and working draft and finished clone sequence and associated ePCR landmarks. New data will continue to be incorporated into this growing database. The entire FPC database of the human genome BAC fingerprint map can be obtained from http://genome.wustl.edu/gsc/human/Mapping/index.shtml. A searchable AceDb39 version of the whole-genome BAC map is also accessible at http://genome.wustl.edu/gsc/Search/db.shtml, and an overview of the map is available as Supplementary Information.

This clone-based map has been vital for the accurate assembly of the human genome sequence3. The BAC clones comprising the clone-based map also provide an integrated resource for analysis of chromosome structure, comparative genome hybridization40 and functional genetics, including gene inactivation41. Together, the human genome clone map and the anchored sequence map provide synergistic resources for future analysis of the human genome.

Methods

Regional approach to large-scale physical map construction

The general approach involved screening genomic BAC and PAC libraries by PCR or by probe hybridization using overgo probes42 to identify clones corresponding to specific STS markers. Overgo probes are made by filling in the single-stranded overhangs of two overlapping oligonucleotides using radiolabelled nucleotides and Klenow polymerase. Typically, we used two 24-mers overlapping by 8 bp to generate a radiolabelled double-stranded 40-mer. Overgo probes were arranged in three-dimensional arrays with six probes on each axis (giving 216 probes each). A five-directional pooling strategy allowed resolution of 80–90% of all markers with only 30 hybridizations. More than 25,000 human and mouse markers have been associated with BACs using this probe type at the WUGSC (J. McPherson). Once identified, fingerprints were generated from marker-positive clones using HindIII restriction enzyme digests with fragment separation on 1% agarose gels43, analysed using Image (http://www.sanger.ac.uk/Software/Image/)20,21 and the fingerprints examined manually within FPC to build contigs and to select clones for sequencing that span contigs. Manual editing of the automated band calls was required because of inconsistencies in band identification.

Fluorescence in situ hybridization

Probes were generated from aliquots of the BAC DNA used to generate the HindIII fingerprints using the Prime-it Fluor labelling kit (Stratagene), which incorporates fluor-12-dUTP into the probe fragments by random priming. Probes were hybridized to chromosome spreads on slides with competitor DNA present. Slides were processed essentially according to standard methods44. Data were collected and analysed using a Zeiss Axiophot microscope equipped with the Genus camera setup and software (Applied Imaging Corporation).

Integration of sequenced clones using synthetic fingerprints from in silico digests

BAC-sized clones were simulated from finished contiguous sequenced regions of DNA. The sequences were cut into 175-kb fragments each with 40% overlap with the previous segments. Bands less than 600 bp were then removed from consideration to be consistent with the fingerprint data. Fingerprint data were converted from mobilities to sizes and clones from the fingerprinting effort could then be directly compared to sequenced clones from any library or group (when comparing size data, the FPC tolerance variable was set to 10). For clones that were not finished, each contig of the sequence was digested and all end fragments were removed. The remaining fragments were summed to create an in silico digest for unfinished clones.