Abstract
We present here a new algorithm for functional site analysis. It is based on four main assumptions: each variation of nucleotide composition makes a different contribution to the overall binding free energy of interaction between a functional site and another molecule; nonfunctioning site-like regions (pseudosites) are absent or rare in genomes; there may be errors in the sample of sites; and nucleotides of different site positions are considered to be mutually dependent. In this algorithm, the site set is divided into subsets, each described by a certain consensus. Donor splice sites of the human protein-coding genes were analyzed. Comparing the results with other methods of donor splice site prediction has demonstrated a more accurate prediction of consensus sequences AG/GU(A,G), G/GUnAG, /GU(A,G)AG, /GU(A,G)nGU, and G/GUA than is achieved by weight matrix and consensus (A,C)AG/GU(A,G)AGU with mismatches. The probability of the first type error, El, for the obtained consensus set was about 0.05, and the probability of the second type error, E2, was 0.15. The analysis demonstrated that accuracy of the functional site prediction could be improved if one takes into account correlations between the site positions. The accuracy of prediction by using human consensus sequences was tested on sequences from different organisms. Some differences in consensus sequences for the plant Arabidopsis sp., the invertebrate Caenorhabditis sp., and the fungus Aspergillus sp. were revealed. For the yeast Saccharomyces sp. only one conservative consensus, /GUA(U,A,C)G(U,A,C), was revealed (El = 0.03, E2 = 0.03). Yeast is a very interesting model to use for analysis of molecular mechanisms of splicing.
Similar content being viewed by others
References
Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York
Balvay L, Libri D, Fiszman MY (1993) Pre-mRNA secondary structure and the regulation of splicing. Bioessays 15:165–169
Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins. I. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193:723–750
Breathnach R, Chambon P (1981) Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem 50:349–383
Brendel V, Beckmann JS, Trifonov EN (1986) Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn 4:11–21
Brunak S, Engelbreacht J, Knudsen S (1990) Neural network detects errors in the assignment of mRNA splice sites. Nucleic Acids Res 18:4797–4801
Brunak S, Engelbreacht J, Knudsen S (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequences. J Mol Biol 220:49–66
Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34:353–367
Chiu DKY, Kolodziejczak T (1991) Inferring consensus structure from nucleic acid sequences. Comput Appl Biosci 7:347–352
Cornish-Bowden A (1985) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendation. Nucleic Acids Res 13:3021–3030
Csank C, Taylor FM, Martindale DW (1992) Nuclear pre-mRNA introns: analysis and comparison of intron sequences from Tetrahymena themophila and other eukaryotes. Nucleic Acids Res 18:5133–5141
Dietz HC, Kendzior RJ (1994) Maintenance of an open reading frame as an additional level of scrutiny during splice site selection. Nat Genet 8:183–188
Dujon B (1996) The yeast genome project: what did we learn? Trends Genet 12:263–270
Engelbreacht J, Knudsen S, Brunak S (1992) G + C-rich tract in 5′ end of human introns. J Mol Biol 227:108–113
Fichant GA (1992) Constraints acting on the exon positions of the splice site sequences and local amino acid composition of the protein. Hum Mol Genet 1:259–267
Fickett JW, Tung C-S (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6441–6450
Fields CA (1990) Information content of Caenorhabditis elegans splice site sequences varies with intron length. Nucleic Acids Res 18:1509–1512
Fisher RA (1935) The logic of inductive inference. J R Stat Soc Ser A 98:39–54
Gelfand MS (1989) Statistical analysis of mammalian pre-mRNA splicing sites. Nucleic Acids Res 17:6369–6382
Gelfand MS (1995) Prediction of function in DNA sequence analysis. J Comput Biol 2:87–115
Guigo R, Knudsen S, Drake N, Smith T (1992) Prediction of gene structure. J Mol Biol 226:141–157
Gutell RR, Power A, Hertz GZ, Putz EJ, Stormo GD (1992) Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acid Res 20:5785–5795
Hall SL, Padgett RA (1994) Conserved sequences in a class of rare eukaryotic nuclear introns with nonconsensus splice sites. J Mol Biol 239:357–365
Hawkins JD (1988) A survey on intron and exon length. Nucleic Acids Res 16:9893–9905
Herzel H, Grobe I (1995) Measuring correlations in symbol sequences. Physica A 216:518–542
Iida Y (1987) DNA sequences and multivariate statistical analysis. Categorical discriminant approach to 5′ splice site signals of mRNA precursors in higher eukaryotes genes. Comput Appl Biosci 3:93–98
Iida Y, Sasaki F (1983) Recognition patterns for exon-intron junctions in higher organisms as revealed by a computer search. J Biochem 94:1731–1738
Jacob M, Gallinaro H (1989) The 5′ splice site: phylogenetic evolution and variable geometry of association with U1RNA. Nucleic Acids Res 17:2159–2180
Jurka J, Milosavljevic A (1991) Reconstruction and analysis of human Alu genes. J Mol Evol 32:105–121
Kel AE, Ponomarenko MP, Likhachev EA, Orlov YL, Ischenko IV, Milanesi L, Kolchanov NA (1993) SITEVIDEO: a computer system for functional site analysis and recognition. Investigation of human splice sites. Comput Appl Biosci 9:617–627
Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V (1996) Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. Nucleic Acids Res 24:4709–4718
Klinger TM, Brutlag DL (1993) Detection of correlations in tRNA sequences with structural implications. In: Hunter L, Searls D, Shavlik J (eds) Proceedings first international conference on intelligent systems for molecular biology. AAAI Press, Menlo Park, p 225
Klinger TD, Brutlag DL (1994) Discovering structural correlations in a-helices. Protein Science 3:1847–1857
Kudo M, Iida Y, Shimbo M (1987) Syntactic pattern analysis of 5′-splice site sequences of mRNA precursors in higher eukaryote genes. Comput Appl Biosci 3:319–324
Kudo M, Kitamura-Abe S, Shimbo M, Iida Y (1992) Analysis of context of 5′-splice site sequences in mammalian pre-mRNA by subclass method. Comput Appl Biosci 8:367–376
Lapedes A, Barnes C, Burks C, Farber R, Sirotkin K (1990) Application of neural networks and other machine learning algorithms to DNA sequence analysis. In: Bell GI, Marr TG (eds) Computers and DNA. Addison-Wesley, New York, p 157
Lerner MR, Boyle JA, Mount SM, Wollin SL, Steiz JA (1980) Are snRNPs involved in splicing? Nature 283:220–224
Maniatis T, Reed R (1987) The role of small nuclear ribonucleoprotein particles in preRNA splicing. Nature 325:673–678
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451
Mayeda A, Ohshima Y (1990) β-Globin transcripts carrying a single intron with three adjacent nucleotides of 5′ exon are efficiently spliced in vivo irrespective of intron position or surrounding exon sequences. Nucleic Acids Res 18:4671–4676
Mengeritsky G, Smith TF (1989) New analytical tool for analysis of splice site sequence determinants. Comput Appl Biosci 5:97–100
Milanesi L, Kolchanov NA, Rogozin IB, Kel AE, Titov I (1994) Sequence functional inference. In: Bishop MJ (ed) Guide to human genome computing. Academic Press, Cambridge, p 249
Milosavljevic A, Haussler D, Jurka J (1989) Informed parsimonious inference of prototypical genetic sequences. In: Rivest R, Haussler D, Warmuth MK (eds) Proceedings of the second annual workshop on computational learning theory. Morgan Kaufman, San Mateo, p 102
Mount SM (1982) A catalogue of splice junction sequences. Nucleic Acids Res 10:459–472
Mount SM, Burks C, Hertz G, Stormo GD, White O, Fields C (1992) Splicing signals in Drosophila: intron size, information content, and consensus sequences. Nucleic Acids Res 20:4255–4262
Nakata K, Kanehisa M, DeLisi C (1985) Prediction of splice junctions in mRNA sequences. Nucleic Acids Res 13:5327–5340
Nesti C, Poli G, Chicca M, Ambrosino P, Scapoli C, Barrai I (1995) Phylogeny inferred from codon usage pattern in 31 organisms. Comput Appl Biosci 12:167–171
Nussinov R (1988) Conserved quartets near 5′ intron junctions in primate nuclear pre-mRNA. J Theor Biol 133:73–84
Ohshima Y, Gotoh Y (1987) Signals for selection of a splice site in pre-mRNA. Computer analysis of splice junction sequences and like sequences. J Mol Biol 195:247–259
Quinqueton J, Moreau J (1985) Application of learning techniques to splicing site recognition. Biochemie 67:541–548
Padgett RA, Grabowski PJ, Konarska MM, Seiler S, Sharp PA (1986) Splicing of messenger RNA precursors. Annu Rev Biochem 55:1119–1150
Penotti FE (1991) Human pre-mRNA splicing signals. J Theor Biol 150:385–420
Rice CM, Fuchs R, Higgins DG, Stoehr PJ, Cameron GN (1993) The EMBL data library. Nucleic Acids Res 21:2967–2971
Rogozin IB, Kolchanov NA (1992) Somatic hypermutagenesis in immunoglobulin genes. II. Influence of neighbouring base sequences on mutagenesis. Biochim Biophys Acta 1171:11–18
Rogozin IB, Milanesi L, Kolchanov NA (1994) Use of the classification approach in the construction of consensuses (human splice sites as an example). In: Kolchanov NA, Lim HA (eds) Computer analysis of genetic macromolecules: structure, function and evolution. World Scientific, Singapore, p 21
Sarai A (1989) Molecular recognition and information and information gain. J Theor Biol 140:137–143
Senapathy P, Shapiro MB, Harris NL (1990) Splice junctions, branch point sites, and exons: sequences statistics, identification, and application to Genome Project. Methods Enzymol 183:252–278
Shapiro MB, Senapathy P (1987) RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acid Res 15:7155–7174
Solovyev VV, Salamov AA, Lawrence CB (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res 22:5156–5163
Staden R (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12:505–519
Staden R (1985) Computer methods to locate genes and signals in nucleic acids sequences. In: Setlow JK, Hollaender A (eds) Genetic engineering, principle and methods v 7. Plenum Press, New York, p 67
Stephens KM, Schneider TD (1992) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J Mol Biol 228:1124–1136
Stormo GD (1987) Identifying coding sequences. In: Bishop MJ, Rawlings CJ (eds) Nucleic acid and protein sequence analysis: a practical approach. IRL Perss, Oxford, p 359
Stormo GD (1988) Computer methods for analyzing sequence recognition of nucleic acids. Annu Rev Biophys Chem 17:241–263
White O, Soderlund C, Shanmugan P, Fields C (1992) Information contents and dinucleotide compositions of plant intron sequences vary with evolutionary origin. Plant Mol Biol 19:1057–1063
Wu TD, Brutlag DL (1995) Identification of protein motifs using conserved amino acid properties and partitioning techniques. In: Rawlings C, Clark D, Altman R, Hunter L, Lengauer T, Wodak S (eds) Proceedings third international conference on intelligent systems for molecular biology. AAAI Press, Menlo Park, p 402
Zhang MQ, Marr TG (1993) A weight array method for splicing signal analysis. Comput Appl Biosci 9:499–510
Zhang MQ, Marr TG (1995) Correlations and constraints among different splicing sequence features in human genes. In: Notes of gene-finding and gene structure prediction workshop. Penn Tower Hotel, Philadelphia
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rogozin, I.B., Milanesi, L. Analysis of donor splice sites in different eukaryotic organisms. J Mol Evol 45, 50–59 (1997). https://doi.org/10.1007/PL00006200
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/PL00006200