Elsevier

Gene

Volume 284, Issues 1–2, 6 February 2002, Pages 203-213
Gene

Identification of six novel genes by experimental validation of GeneMachine predicted genes

https://doi.org/10.1016/S0378-1119(01)00897-6Get rights and content

Abstract

In silico gene identification from finished and unfinished human genome sequence has become critically important in many projects seeking to gain insights into the gene content of genomic regions implicated in diseases. To establish limitations and criteria for in silico gene identification, and to identify novel genes of potential relevance to human prostate cancer and melanoma, 3 Mb of chromosome 1 sequence have been analyzed using GeneMachine. This program is a software suite comprising of sequence similarity programs and four gene identification programs. A total of 49 potential transcripts were selected and 37 of them were selected for experimental validation. We verified 16 of the predicted genes by experimental analysis. The comparison of the predicted transcripts with their cloned forms helped to refine predicted gene models as well as to identify splice variants for several of them. Although sequences matching with ten of our verified genes have been recently deposited in the GenBank, six of them remain novel. Our studies support the feasibility of identifying novel genes from regions of interest using draft human genome sequence.

Introduction

Near-complete genome sequence is available for several organisms including Caenorhabditis elegans, Saccharomyces cervisiae, and Drosophila melanogaster. With the recent successful completion of the human genome project making >90% of the sequence available to researchers worldwide (Lander et al., 2001) we have entered the transcriptome and proteome era. Major emphasis during this phase is on the identification and assignment of the function to all genes encoded by the human genome. Preliminary analysis of the draft sequence suggests that there are approximately 30,000–40,000 genes coded by the human genome (Lander et al., 2001, Venter et al., 2001). However, GenBank currently contains sequences for only ∼16,000 of these genes. Despite the fact that over 4 million ESTs have been sequenced, we still do not have even partial sequences for many genes. This poses a real problem when examining disease gene candidate regions, where a complete gene content map is critical. Therefore, the identification of genes from finished and unfinished human genome sequence becomes critically important. Molecular techniques for the identification of coding sequences (e.g. cDNA selection, exon trapping, expressed sequence tags) are extremely laborious and time consuming. Hence, gene mapping and gene identification in silico has become the method of choice for many laboratories seeking to gain insights into the gene content of genomic regions implicated in diseases. Although in silico methods for gene identification are constantly improving, the false positives ratio is still high. It is difficult to select the best candidates from predicted genes as the criteria for selection and limitations of gene identification methods are not well established. Although the whole human genome sequence has been analyzed and annotated, many genes are still either completely missed or incorrectly predicted as shown by Wiemann et al. (2001) on analysis of chromosome 21.

The goals of our study were to identify novel candidate genes, as well as to evaluate the performance of the in silico gene identification methods and to establish criteria for selecting the best candidates from the genes predicted from either finished or draft sequences. Therefore, we performed computational and experimental analysis of approximately 3 Mb of human genomic sequence from chromosome 1. Roughly half of the sequence is from 1q25, a region encompassing the hereditary prostate cancer (HPC1) locus (Smith et al., 1996). We have previously reported a physical map of the HPC1 region (Carpten et al., 2000) and cloning of several novel transcripts (Sood et al., 2001). As an ongoing effort to identify the HPC1 gene, we have analyzed available genomic sequences to identify novel candidate genes. In addition we analyzed sequences from other regions on chromosome 1 since abnormalities of this chromosome are involved in many cancers including malignant melanoma. Several chromosome 1 rearrangement breakpoints involved in the pathogenesis of melanoma have been reported (Smedley et al., 2000).

Section snippets

Sequence analysis

Human chromosome 1 sequences representing 20 BAC clones were downloaded from GenBank. Of the 20 clones, three were finished and 17 were in assembly phase, thereby having several sequence contigs for each of these clones (Table 1).

Sequence analysis was performed using an analysis and annotation tool called GeneMachine (Makalowska et al., 2001). GeneMachine is a suite of Perl programs and modules, each of which runs gene prediction or similarity search program, parses the resulting output, and

Gene/exon prediction

We used GeneMachine, a software suite comprised of four exon/gene prediction programs, each utilizing a different gene identification algorithm and model, to predict individual exons or partial genes. Since most of our sequence was in relatively short contigs from unfinished sequences, we expected only fragments of genes and not entire gene sequences, which makes prediction more difficult when the gene(s) in question have incomplete genomic context. In such a situation criteria for selecting

Discussion

One of the most critical steps in the examination of the disease gene candidate region is to identify all genes and to build a saturated transcript map. Recently transcript identification has moved from the wet lab to in silico approaches as nearly complete human genome sequence is available. A variety of programs, utilizing different algorithms and models, have been developed. Here we present the identification of novel genes using a sequence analysis tool GeneMachine, a software suite

Acknowledgements

The authors would like to thank Jennifer Shehadeh for assistance with the artwork and Tracy Moses for outstanding technical support. This work was supported through the National Human Genome Research Institute Intramural Program.

References (28)

  • J.M. Claverie

    Computational methods for the identification of genes in vertebrate genomic sequences

    Hum. Mol. Genet.

    (1997)
  • J.W. Fickett et al.

    Assessment of protein coding measures

    Nucleic Acids Res.

    (1992)
  • G. Glockner et al.

    Large-scale sequencing of two regions in human chromosome 7q22: analysis of 650 kb of genomic sequence around the EPO and CUTL1 loci reveals 17 genes

    Genome Res.

    (1998)
  • R. Guigo et al.

    An assessment of gene prediction accuracy in large DNA sequences [in process citation]

    Genome Res.

    (2000)
  • Cited by (6)

    GenBank Accession Numbers: AF387611–AF387620.

    1

    These authors contributed equally.

    View full text