Identification of six novel genes by experimental validation of GeneMachine predicted genes☆
Introduction
Near-complete genome sequence is available for several organisms including Caenorhabditis elegans, Saccharomyces cervisiae, and Drosophila melanogaster. With the recent successful completion of the human genome project making >90% of the sequence available to researchers worldwide (Lander et al., 2001) we have entered the transcriptome and proteome era. Major emphasis during this phase is on the identification and assignment of the function to all genes encoded by the human genome. Preliminary analysis of the draft sequence suggests that there are approximately 30,000–40,000 genes coded by the human genome (Lander et al., 2001, Venter et al., 2001). However, GenBank currently contains sequences for only ∼16,000 of these genes. Despite the fact that over 4 million ESTs have been sequenced, we still do not have even partial sequences for many genes. This poses a real problem when examining disease gene candidate regions, where a complete gene content map is critical. Therefore, the identification of genes from finished and unfinished human genome sequence becomes critically important. Molecular techniques for the identification of coding sequences (e.g. cDNA selection, exon trapping, expressed sequence tags) are extremely laborious and time consuming. Hence, gene mapping and gene identification in silico has become the method of choice for many laboratories seeking to gain insights into the gene content of genomic regions implicated in diseases. Although in silico methods for gene identification are constantly improving, the false positives ratio is still high. It is difficult to select the best candidates from predicted genes as the criteria for selection and limitations of gene identification methods are not well established. Although the whole human genome sequence has been analyzed and annotated, many genes are still either completely missed or incorrectly predicted as shown by Wiemann et al. (2001) on analysis of chromosome 21.
The goals of our study were to identify novel candidate genes, as well as to evaluate the performance of the in silico gene identification methods and to establish criteria for selecting the best candidates from the genes predicted from either finished or draft sequences. Therefore, we performed computational and experimental analysis of approximately 3 Mb of human genomic sequence from chromosome 1. Roughly half of the sequence is from 1q25, a region encompassing the hereditary prostate cancer (HPC1) locus (Smith et al., 1996). We have previously reported a physical map of the HPC1 region (Carpten et al., 2000) and cloning of several novel transcripts (Sood et al., 2001). As an ongoing effort to identify the HPC1 gene, we have analyzed available genomic sequences to identify novel candidate genes. In addition we analyzed sequences from other regions on chromosome 1 since abnormalities of this chromosome are involved in many cancers including malignant melanoma. Several chromosome 1 rearrangement breakpoints involved in the pathogenesis of melanoma have been reported (Smedley et al., 2000).
Section snippets
Sequence analysis
Human chromosome 1 sequences representing 20 BAC clones were downloaded from GenBank. Of the 20 clones, three were finished and 17 were in assembly phase, thereby having several sequence contigs for each of these clones (Table 1).
Sequence analysis was performed using an analysis and annotation tool called GeneMachine (Makalowska et al., 2001). GeneMachine is a suite of Perl programs and modules, each of which runs gene prediction or similarity search program, parses the resulting output, and
Gene/exon prediction
We used GeneMachine, a software suite comprised of four exon/gene prediction programs, each utilizing a different gene identification algorithm and model, to predict individual exons or partial genes. Since most of our sequence was in relatively short contigs from unfinished sequences, we expected only fragments of genes and not entire gene sequences, which makes prediction more difficult when the gene(s) in question have incomplete genomic context. In such a situation criteria for selecting
Discussion
One of the most critical steps in the examination of the disease gene candidate region is to identify all genes and to build a saturated transcript map. Recently transcript identification has moved from the wet lab to in silico approaches as nearly complete human genome sequence is available. A variety of programs, utilizing different algorithms and models, have been developed. Here we present the identification of novel genes using a sequence analysis tool GeneMachine, a software suite
Acknowledgements
The authors would like to thank Jennifer Shehadeh for assistance with the artwork and Tracy Moses for outstanding technical support. This work was supported through the National Human Genome Research Institute Intramural Program.
References (28)
- et al.
Prediction of complete gene structures in human genomic DNA
J. Mol. Biol.
(1997) - et al.
Evaluation of gene structure prediction programs
Genomics
(1996) A 6-Mb high-resolution physical and transcription map encompassing the hereditary prostate cancer 1 (HPC1) region
Genomics
(2000)Genomic scrap yard: how genomes utilize all that junk
Gene
(2000)- et al.
Alu sequences in the coding regions of mRNA: a source of protein variability
Trends Genet.
(1994) - et al.
Computational and experimental analysis identifies many novel human genes
Biochem. Biophys. Res. Commun.
(2000) - et al.
Criteria for gene identification and features of genome organization: analysis of 6.5 Mb of DNA sequence from human chromosome 21
Gene
(2000) - et al.
Cloning and characterization of 13 novel transcripts and the human rgs8 gene from the 1q25 region encompassing the hereditary prostate cancer (hpc1) locus
Genomics
(2001) - et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
(1997) Predictive methods using DNA sequences
Computational methods for the identification of genes in vertebrate genomic sequences
Hum. Mol. Genet.
Assessment of protein coding measures
Nucleic Acids Res.
Large-scale sequencing of two regions in human chromosome 7q22: analysis of 650 kb of genomic sequence around the EPO and CUTL1 loci reveals 17 genes
Genome Res.
An assessment of gene prediction accuracy in large DNA sequences [in process citation]
Genome Res.
Cited by (6)
Skin cancer models
2005, Drug Discovery Today: Disease ModelsSignificance of various experimental models and assay techniques in cancer diagnosis
2016, Mini-Reviews in Medicinal ChemistryGene finding in the chicken genome
2005, BMC BioinformaticsOrigin and evolution of the chicken leukocyte receptor complex
2005, Proceedings of the National Academy of Sciences of the United States of AmericaGenetics of human taste perception
2004, Journal of Dental ResearchTransposable elements and vertebrate protein diversity
2003, Genetica