Skip to main content
Log in

Analysis of donor splice sites in different eukaryotic organisms

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

We present here a new algorithm for functional site analysis. It is based on four main assumptions: each variation of nucleotide composition makes a different contribution to the overall binding free energy of interaction between a functional site and another molecule; nonfunctioning site-like regions (pseudosites) are absent or rare in genomes; there may be errors in the sample of sites; and nucleotides of different site positions are considered to be mutually dependent. In this algorithm, the site set is divided into subsets, each described by a certain consensus. Donor splice sites of the human protein-coding genes were analyzed. Comparing the results with other methods of donor splice site prediction has demonstrated a more accurate prediction of consensus sequences AG/GU(A,G), G/GUnAG, /GU(A,G)AG, /GU(A,G)nGU, and G/GUA than is achieved by weight matrix and consensus (A,C)AG/GU(A,G)AGU with mismatches. The probability of the first type error, El, for the obtained consensus set was about 0.05, and the probability of the second type error, E2, was 0.15. The analysis demonstrated that accuracy of the functional site prediction could be improved if one takes into account correlations between the site positions. The accuracy of prediction by using human consensus sequences was tested on sequences from different organisms. Some differences in consensus sequences for the plant Arabidopsis sp., the invertebrate Caenorhabditis sp., and the fungus Aspergillus sp. were revealed. For the yeast Saccharomyces sp. only one conservative consensus, /GUA(U,A,C)G(U,A,C), was revealed (El = 0.03, E2 = 0.03). Yeast is a very interesting model to use for analysis of molecular mechanisms of splicing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Anderberg MR (1973) Cluster analysis for applications. Academic Press, New York

    Google Scholar 

  • Balvay L, Libri D, Fiszman MY (1993) Pre-mRNA secondary structure and the regulation of splicing. Bioessays 15:165–169

    Article  PubMed  CAS  Google Scholar 

  • Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins. I. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193:723–750

    Article  PubMed  CAS  Google Scholar 

  • Breathnach R, Chambon P (1981) Organization and expression of eucaryotic split genes coding for proteins. Annu Rev Biochem 50:349–383

    Article  PubMed  CAS  Google Scholar 

  • Brendel V, Beckmann JS, Trifonov EN (1986) Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn 4:11–21

    PubMed  CAS  Google Scholar 

  • Brunak S, Engelbreacht J, Knudsen S (1990) Neural network detects errors in the assignment of mRNA splice sites. Nucleic Acids Res 18:4797–4801

    Article  PubMed  CAS  Google Scholar 

  • Brunak S, Engelbreacht J, Knudsen S (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequences. J Mol Biol 220:49–66

    Article  PubMed  CAS  Google Scholar 

  • Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34:353–367

    Article  PubMed  CAS  Google Scholar 

  • Chiu DKY, Kolodziejczak T (1991) Inferring consensus structure from nucleic acid sequences. Comput Appl Biosci 7:347–352

    PubMed  CAS  Google Scholar 

  • Cornish-Bowden A (1985) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendation. Nucleic Acids Res 13:3021–3030

    Article  PubMed  CAS  Google Scholar 

  • Csank C, Taylor FM, Martindale DW (1992) Nuclear pre-mRNA introns: analysis and comparison of intron sequences from Tetrahymena themophila and other eukaryotes. Nucleic Acids Res 18:5133–5141

    Article  Google Scholar 

  • Dietz HC, Kendzior RJ (1994) Maintenance of an open reading frame as an additional level of scrutiny during splice site selection. Nat Genet 8:183–188

    Article  PubMed  CAS  Google Scholar 

  • Dujon B (1996) The yeast genome project: what did we learn? Trends Genet 12:263–270

    Article  PubMed  CAS  Google Scholar 

  • Engelbreacht J, Knudsen S, Brunak S (1992) G + C-rich tract in 5′ end of human introns. J Mol Biol 227:108–113

    Article  Google Scholar 

  • Fichant GA (1992) Constraints acting on the exon positions of the splice site sequences and local amino acid composition of the protein. Hum Mol Genet 1:259–267

    Article  PubMed  CAS  Google Scholar 

  • Fickett JW, Tung C-S (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6441–6450

    Article  PubMed  CAS  Google Scholar 

  • Fields CA (1990) Information content of Caenorhabditis elegans splice site sequences varies with intron length. Nucleic Acids Res 18:1509–1512

    Article  PubMed  CAS  Google Scholar 

  • Fisher RA (1935) The logic of inductive inference. J R Stat Soc Ser A 98:39–54

    Article  Google Scholar 

  • Gelfand MS (1989) Statistical analysis of mammalian pre-mRNA splicing sites. Nucleic Acids Res 17:6369–6382

    Article  PubMed  CAS  Google Scholar 

  • Gelfand MS (1995) Prediction of function in DNA sequence analysis. J Comput Biol 2:87–115

    Article  PubMed  CAS  Google Scholar 

  • Guigo R, Knudsen S, Drake N, Smith T (1992) Prediction of gene structure. J Mol Biol 226:141–157

    Article  PubMed  CAS  Google Scholar 

  • Gutell RR, Power A, Hertz GZ, Putz EJ, Stormo GD (1992) Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acid Res 20:5785–5795

    Article  PubMed  CAS  Google Scholar 

  • Hall SL, Padgett RA (1994) Conserved sequences in a class of rare eukaryotic nuclear introns with nonconsensus splice sites. J Mol Biol 239:357–365

    Article  PubMed  CAS  Google Scholar 

  • Hawkins JD (1988) A survey on intron and exon length. Nucleic Acids Res 16:9893–9905

    Article  PubMed  CAS  Google Scholar 

  • Herzel H, Grobe I (1995) Measuring correlations in symbol sequences. Physica A 216:518–542

    Article  CAS  Google Scholar 

  • Iida Y (1987) DNA sequences and multivariate statistical analysis. Categorical discriminant approach to 5′ splice site signals of mRNA precursors in higher eukaryotes genes. Comput Appl Biosci 3:93–98

    Google Scholar 

  • Iida Y, Sasaki F (1983) Recognition patterns for exon-intron junctions in higher organisms as revealed by a computer search. J Biochem 94:1731–1738

    PubMed  CAS  Google Scholar 

  • Jacob M, Gallinaro H (1989) The 5′ splice site: phylogenetic evolution and variable geometry of association with U1RNA. Nucleic Acids Res 17:2159–2180

    Article  PubMed  CAS  Google Scholar 

  • Jurka J, Milosavljevic A (1991) Reconstruction and analysis of human Alu genes. J Mol Evol 32:105–121

    Article  PubMed  CAS  Google Scholar 

  • Kel AE, Ponomarenko MP, Likhachev EA, Orlov YL, Ischenko IV, Milanesi L, Kolchanov NA (1993) SITEVIDEO: a computer system for functional site analysis and recognition. Investigation of human splice sites. Comput Appl Biosci 9:617–627

    PubMed  CAS  Google Scholar 

  • Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V (1996) Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. Nucleic Acids Res 24:4709–4718

    Article  PubMed  CAS  Google Scholar 

  • Klinger TM, Brutlag DL (1993) Detection of correlations in tRNA sequences with structural implications. In: Hunter L, Searls D, Shavlik J (eds) Proceedings first international conference on intelligent systems for molecular biology. AAAI Press, Menlo Park, p 225

    Google Scholar 

  • Klinger TD, Brutlag DL (1994) Discovering structural correlations in a-helices. Protein Science 3:1847–1857

    Article  Google Scholar 

  • Kudo M, Iida Y, Shimbo M (1987) Syntactic pattern analysis of 5′-splice site sequences of mRNA precursors in higher eukaryote genes. Comput Appl Biosci 3:319–324

    PubMed  CAS  Google Scholar 

  • Kudo M, Kitamura-Abe S, Shimbo M, Iida Y (1992) Analysis of context of 5′-splice site sequences in mammalian pre-mRNA by subclass method. Comput Appl Biosci 8:367–376

    PubMed  CAS  Google Scholar 

  • Lapedes A, Barnes C, Burks C, Farber R, Sirotkin K (1990) Application of neural networks and other machine learning algorithms to DNA sequence analysis. In: Bell GI, Marr TG (eds) Computers and DNA. Addison-Wesley, New York, p 157

    Google Scholar 

  • Lerner MR, Boyle JA, Mount SM, Wollin SL, Steiz JA (1980) Are snRNPs involved in splicing? Nature 283:220–224

    Article  PubMed  CAS  Google Scholar 

  • Maniatis T, Reed R (1987) The role of small nuclear ribonucleoprotein particles in preRNA splicing. Nature 325:673–678

    Article  PubMed  CAS  Google Scholar 

  • Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451

    PubMed  CAS  Google Scholar 

  • Mayeda A, Ohshima Y (1990) β-Globin transcripts carrying a single intron with three adjacent nucleotides of 5′ exon are efficiently spliced in vivo irrespective of intron position or surrounding exon sequences. Nucleic Acids Res 18:4671–4676

    Article  PubMed  CAS  Google Scholar 

  • Mengeritsky G, Smith TF (1989) New analytical tool for analysis of splice site sequence determinants. Comput Appl Biosci 5:97–100

    PubMed  CAS  Google Scholar 

  • Milanesi L, Kolchanov NA, Rogozin IB, Kel AE, Titov I (1994) Sequence functional inference. In: Bishop MJ (ed) Guide to human genome computing. Academic Press, Cambridge, p 249

    Google Scholar 

  • Milosavljevic A, Haussler D, Jurka J (1989) Informed parsimonious inference of prototypical genetic sequences. In: Rivest R, Haussler D, Warmuth MK (eds) Proceedings of the second annual workshop on computational learning theory. Morgan Kaufman, San Mateo, p 102

    Google Scholar 

  • Mount SM (1982) A catalogue of splice junction sequences. Nucleic Acids Res 10:459–472

    Article  PubMed  CAS  Google Scholar 

  • Mount SM, Burks C, Hertz G, Stormo GD, White O, Fields C (1992) Splicing signals in Drosophila: intron size, information content, and consensus sequences. Nucleic Acids Res 20:4255–4262

    Article  PubMed  CAS  Google Scholar 

  • Nakata K, Kanehisa M, DeLisi C (1985) Prediction of splice junctions in mRNA sequences. Nucleic Acids Res 13:5327–5340

    Article  PubMed  CAS  Google Scholar 

  • Nesti C, Poli G, Chicca M, Ambrosino P, Scapoli C, Barrai I (1995) Phylogeny inferred from codon usage pattern in 31 organisms. Comput Appl Biosci 12:167–171

    Google Scholar 

  • Nussinov R (1988) Conserved quartets near 5′ intron junctions in primate nuclear pre-mRNA. J Theor Biol 133:73–84

    Article  PubMed  CAS  Google Scholar 

  • Ohshima Y, Gotoh Y (1987) Signals for selection of a splice site in pre-mRNA. Computer analysis of splice junction sequences and like sequences. J Mol Biol 195:247–259

    Article  PubMed  CAS  Google Scholar 

  • Quinqueton J, Moreau J (1985) Application of learning techniques to splicing site recognition. Biochemie 67:541–548

    Article  CAS  Google Scholar 

  • Padgett RA, Grabowski PJ, Konarska MM, Seiler S, Sharp PA (1986) Splicing of messenger RNA precursors. Annu Rev Biochem 55:1119–1150

    Article  PubMed  CAS  Google Scholar 

  • Penotti FE (1991) Human pre-mRNA splicing signals. J Theor Biol 150:385–420

    Article  PubMed  CAS  Google Scholar 

  • Rice CM, Fuchs R, Higgins DG, Stoehr PJ, Cameron GN (1993) The EMBL data library. Nucleic Acids Res 21:2967–2971

    Article  PubMed  CAS  Google Scholar 

  • Rogozin IB, Kolchanov NA (1992) Somatic hypermutagenesis in immunoglobulin genes. II. Influence of neighbouring base sequences on mutagenesis. Biochim Biophys Acta 1171:11–18

    PubMed  CAS  Google Scholar 

  • Rogozin IB, Milanesi L, Kolchanov NA (1994) Use of the classification approach in the construction of consensuses (human splice sites as an example). In: Kolchanov NA, Lim HA (eds) Computer analysis of genetic macromolecules: structure, function and evolution. World Scientific, Singapore, p 21

    Google Scholar 

  • Sarai A (1989) Molecular recognition and information and information gain. J Theor Biol 140:137–143

    Article  PubMed  CAS  Google Scholar 

  • Senapathy P, Shapiro MB, Harris NL (1990) Splice junctions, branch point sites, and exons: sequences statistics, identification, and application to Genome Project. Methods Enzymol 183:252–278

    Article  PubMed  CAS  Google Scholar 

  • Shapiro MB, Senapathy P (1987) RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acid Res 15:7155–7174

    Article  PubMed  CAS  Google Scholar 

  • Solovyev VV, Salamov AA, Lawrence CB (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res 22:5156–5163

    Article  PubMed  CAS  Google Scholar 

  • Staden R (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12:505–519

    Article  PubMed  CAS  Google Scholar 

  • Staden R (1985) Computer methods to locate genes and signals in nucleic acids sequences. In: Setlow JK, Hollaender A (eds) Genetic engineering, principle and methods v 7. Plenum Press, New York, p 67

    Google Scholar 

  • Stephens KM, Schneider TD (1992) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J Mol Biol 228:1124–1136

    Article  PubMed  CAS  Google Scholar 

  • Stormo GD (1987) Identifying coding sequences. In: Bishop MJ, Rawlings CJ (eds) Nucleic acid and protein sequence analysis: a practical approach. IRL Perss, Oxford, p 359

    Google Scholar 

  • Stormo GD (1988) Computer methods for analyzing sequence recognition of nucleic acids. Annu Rev Biophys Chem 17:241–263

    Article  CAS  Google Scholar 

  • White O, Soderlund C, Shanmugan P, Fields C (1992) Information contents and dinucleotide compositions of plant intron sequences vary with evolutionary origin. Plant Mol Biol 19:1057–1063

    Article  PubMed  CAS  Google Scholar 

  • Wu TD, Brutlag DL (1995) Identification of protein motifs using conserved amino acid properties and partitioning techniques. In: Rawlings C, Clark D, Altman R, Hunter L, Lengauer T, Wodak S (eds) Proceedings third international conference on intelligent systems for molecular biology. AAAI Press, Menlo Park, p 402

    Google Scholar 

  • Zhang MQ, Marr TG (1993) A weight array method for splicing signal analysis. Comput Appl Biosci 9:499–510

    PubMed  CAS  Google Scholar 

  • Zhang MQ, Marr TG (1995) Correlations and constraints among different splicing sequence features in human genes. In: Notes of gene-finding and gene structure prediction workshop. Penn Tower Hotel, Philadelphia

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rogozin, I.B., Milanesi, L. Analysis of donor splice sites in different eukaryotic organisms. J Mol Evol 45, 50–59 (1997). https://doi.org/10.1007/PL00006200

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/PL00006200

Key words

Navigation