Regular ArticleEvaluation of Gene Structure Prediction Programs
Abstract
We evaluate a number of computer programs designed to predict the structure of protein coding genes in genomic DNA sequences. Computational gene identification is set to play an increasingly important role in the development of the genome projects, as emphasis turns from mapping to large-scale sequencing. The evaluation presented here serves both to assess the current status of the problem and to identify the most promising approaches to ensure further progress. The programs analyzed were uniformly tested on a large set of vertebrate sequences with simple gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The results indicated that the predictive accuracy of the programs analyzed was lower than originally found. The accuracy was even lower when considering only those sequences that had recently been entered and that did not show any similarity to previously entered sequences. This indicates that the programs are overly dependent on the particularities of the examples they learn from. For most of the programs, accuracy in this test set ranged from 0.60 to 0.70 as measured by the Correlation Coefficient (where 1.0 corresponds to a perfect prediction and 0.0 is the value expected for a random prediction), and the average percentage of exons exactly identified was less than 50%. Only those programs including protein sequence database searches showed substantially greater accuracy. The accuracy of the programs was severely affected by relatively high rates of sequence errors. Since the set on which the programs were tested included only relatively short sequences with simple gene structure, the accuracy of the programs is likely to be even lower when used for large uncharacterized genomic sequences with complex structure. While in such cases, programs currently available may still be of great use in pinpointing the regions likely to contain exons, they are far from being powerful enough to elucidate its genomic structure completely.
References (0)
Cited by (582)
Genome annotation: From human genetics to biodiversity genomics
2023, Cell GenomicsWithin the next decade, the genomes of 1.8 million eukaryotic species will be sequenced. Identifying genes in these sequences is essential to understand the biology of the species. This is challenging due to the transcriptional complexity of eukaryotic genomes, which encode hundreds of thousands of transcripts of multiple types. Among these, a small set of protein-coding mRNAs play a disproportionately large role in defining phenotypes. Due to their sequence conservation, orthology can be established, making it possible to define the universal catalog of eukaryotic protein-coding genes. This catalog should substantially contribute to uncovering the genomic events underlying the emergence of eukaryotic phenotypes. This piece briefly reviews the basics of protein-coding gene prediction, discusses challenges in finalizing annotation of the human genome, and proposes strategies for producing annotations across the eukaryotic Tree of Life. This lays the groundwork for obtaining the catalog of all genes—the Earth’s code of life.
Optimum window based modified periodicity spectrum method for the detection of protein coding regions in DNA sequences
2023, Digital Signal Processing: A Review JournalIdentification of protein-coding regions with high accuracy in eukaryotic genomes is considered as a challenging task because these regions remain in non continuous fashion along the length of DNA sequences. Various frequency domain algorithms have been designed for the detection of protein-coding regions since the beginning of twentieth century. The basic functionality of frequency domain approaches is to convert the signal from one domain to another and consequently probability of loss of important information is quite high. In this paper modified periodicity spectrum based algorithm (MPSBA) is proposed for the identification of protein-coding regions in eukaryotic genomes. There is no domain transformation requirement in the proposed algorithm. The key contribution of proposed algorithm is optimization of the window length by varying between 27 to 351 in step size of 3 corresponding to maximum area under curve (AUC). For the testing of applicability of proposed algorithm, benchmark data sequence F56F11.4 & thereafter bigger data sets HMR195, and BG570 have been employed. The recent state of art algorithms have been compared with proposed algorithm for performance assessment. The results obtained reflect the superiority of proposed algorithm and its applicability to identify the protein-coding regions of short and big sizes as well.
Identification of exon locations in DNA sequences using a fractional digital anti-notch filter
2023, Biomedical Signal Processing and ControlIdentification of protein coding region (exon) locations in DNA sequences is a fundamental initial step in genomic signal processing (GSP). Several techniques have already been applied to achieve this challenging task. However, improvements are still needed. Transforms-based methods and digital filtering are among those techniques that have been widely used. These techniques exploit the period-3 property of protein coding regions. This paper proposes the application of a narrowband bandpass fractional digital filter to extract more selectively the single frequency component corresponding to the frequency from DNA sequences. The ideal fractional digital anti-notch filter has an infinite amplitude at the central frequency and two tuning parameters which may be used to independently adjust the central frequency and the amplitude frequency response. The ideal filter has been approximated and implemented efficiently as an infinite impulse response (IIR) filter. The effectiveness of the proposed method has been assessed in terms of common performance evaluation metrics computed from the results obtained using DNA sequences taken from the National Center for Biotechnology Information (NCBI) and HMR195 datasets using different numerical transformations including Voss mapping and electron–ion potential (EIIP) representation. In addition to overcome the problem of sliding window size encountered in transform-based methods, comparison with existing state-of-the-art methods for exon location identification has demonstrated superiority of the proposed method on benchmark datasets.
SAVMD: An adaptive signal processing method for identifying protein coding regions
2021, Biomedical Signal Processing and ControlThe identification of protein coding regions is a major topic of research in the field of gene prediction. A number of digital signal processing (DSP) based approaches, which exploit 3-base periodicity to detect coding regions, have been proposed. According to these previously published approaches, we summarize that an effective method or filter for identifying protein coding regions should fulfill three important properties, including the independence of the window length, an effective and adaptive frequency response, a fixed basic frequency of . However, most of published approaches cannot simultaneously satisfy these three points, which causes that their identification accuracy is still limited. In this paper, we propose an adaptive signal processing method, called sinusoidal-assisted variational mode decomposition (SAVMD) for identifying coding regions. The adaptability of SAVMD reflects in two aspects including: (i) The proposed method analyzes numerical sequences without needing any window information; (ii) The spectrum of period-3 component can be automatically fitted by SAVMD in Fourier domain. From this, our proposed method outperforms other DSP-based methods in terms of identification accuracy, which is verified by the experimental results on five benchmark datasets. When processing the dataset where most sequences contain undetermined nucleotides (UDT), SAVMD shows more superior performance than the model-dependent method AUGUSTUS as well as other model-independent methods. In addition, we conduct a comparative analysis on different numerical conversions of DNA sequences using SAVMD. Several applicable methods for SAVMD, which are selected from this experimentation, can provide a reference to the applications of other time–frequency decomposition methods in the field of gene prediction.
Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions
2021, Biocybernetics and Biomedical EngineeringThe analysis of protein coding regions of DNA sequences is one of the most fundamental applications in bioinformatics. A number of model-independent approaches have been developed for differentiating between the protein-coding and non-protein-coding regions of DNA. However, these methods are often based on univariate analysis algorithms, which leads to the loss of joint information among four nucleotides of DNA. In this article, we introduce a method on basis of the noise-assisted multivariate empirical mode decomposition (NA-MEMD) and the modified Gabor-wavelet transform (MGWT). The NA-MEMD algorithm, as a multivariate analysis tool, is utilized to reconstruct the numerical analyzed sequence since it enables a matched-scale decomposition across all variables and eliminates the mode mixing. By virtues of NA-MEMD, the MGWT method achieves a stable improvement on the general identification performance. We compare our method with other Digital Signal Processing (DSP) methods on two representative DNA sequences and three benchmark datasets. The results reveal that our method can enhance the spectra of the analyzed sequences, and improve the robustness of MGWT to different DNA sequences, thus obtaining higher identification accuracies of protein coding regions over other applied methods. In addition, another comparative experiment with the model-dependent method (AUGUSTUS) on the recently proposed benchmark dataset G3PO verifies the superiority of model-independent methods (especially NA-MEMD-MGWT) for identifying coding regions of the poor-quality DNA sequences.
The Chain Alignment Problem
2020, Journal of Computational ScienceThis paper introduces two new combinatorial optimization problems involving strings, namely, the Chain Alignment Problem, and a multiple version of it, the Multiple Chain Alignment Problem. For the first problem, a polynomial-time algorithm using dynamic programming is presented, and for the second one, a proof of its -hardness and in approximability are provided, jointly with the main ideas of three heuristics proposed for it. The three heuristics are assessed with simulated data and the applicability of both problems here introduced is attested by their good results when modeling the Gene Identification Problem.
- 1
To whom correspondence should be addressed at Institut Municipal d'Investigació Mèdica (IMIM), C/Dr. Aiguader 80, E-08003 Barcelona, Spain. Telephone: (343) 221-1009. Fax: (343) 221-3237. E-mail: [email protected].