Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Issues in searching molecular sequence databases

Abstract

Sequence similarity search programs are versatile tools for the molecular biologist, frequently able to identify possible DNA coding regions and to provide clues to gene and protein structure and function. While much attention had been paid to the precise algorithms these programs employ and to their relative speeds, there is a constellation of associated issues that are equally important to realize the full potential of these methods. Here, we consider a number of these issues, including the choice of scoring systems, the statistical significance of alignments, the masking of uninformative or potentially confounding sequence regions, the nature and extent of sequence redundancy in the databases and network access to similarity search services.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

References

  1. Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol. 219, 556–565 (1991).

    Article  Google Scholar 

  2. Altschul, S.F. A protein alignment scoring system sensitive at all evolutionary distances. J. molec. Evol. 36, 290–300 (1993).

    Article  CAS  PubMed  Google Scholar 

  3. States, D.J., Gish, W. & Altschul, S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3, 66–70 (1991).

    Article  CAS  Google Scholar 

  4. Gish, W. & States, D.J. Identification of protein coding regions by database similarity search. Nature Genet. 3, 266–272 (1993).

    Article  CAS  PubMed  Google Scholar 

  5. Claverie, J.-M. Detecting frameshifts by amino acid sequence comparison. J. molec. Biol. 234, 1140–1157 (1993).

    Article  CAS  PubMed  Google Scholar 

  6. Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268 (1990).

    Article  CAS  Google Scholar 

  7. Karlin, S., Dembo, A. & Kawabata, T. Statistical composition of high-scoring segments from molecular sequences. Ann. Stat. 18, 571–581 (1990).

    Article  Google Scholar 

  8. Dembo, A. & Karlin, S. Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann. Prob. 19, 1737–1755 (1991).

    Google Scholar 

  9. Karlin, S. & Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. natn. Acad. Sci. U.S.A. 90, 5873–5877 (1993).

    Article  CAS  Google Scholar 

  10. Smith, T.F., Waterman, M.S. & Burks, C. The statistical distribution of nucleic acid similarities. Nucl. Acids Res. 13, 645–656 (1985).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Altschul, S.F. & Erickson, B.W. A nonlinear measure of subalignment similarity and its significance levels. Bull. math. Biol. 48, 617–632 (1986).

    Article  CAS  PubMed  Google Scholar 

  12. Collins, J.F., Coulson, A.F.W. & Lyall, A. The significance of protein sequence similarities. CABIOS 4, 67–71 (1988).

    CAS  PubMed  Google Scholar 

  13. Mott, R. Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. math. Biol. 54, 59–75 (1992).

    Article  CAS  PubMed  Google Scholar 

  14. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  15. Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. molec. Biol. 48, 443–453 (1970).

    Article  CAS  PubMed  Google Scholar 

  16. Sellers, P.H. On the theory and computation of evolutionary distances. SIAM J. appl. Math. 26, 787–793 (1974).

    Article  Google Scholar 

  17. Sankoff, D. & Kruskal, J.B. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison (Addison-Wesley, Reading, M.A, 1983).

    Google Scholar 

  18. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. molec. Biol. 147, 195–197 (1981).

    Article  CAS  PubMed  Google Scholar 

  19. Goad, W.B. & Kanehisa, M.I. Pattern recognition in nucleic acid sequences. I.A general method for finding local homologies and symmetries. Nucl. Acids Res. 10, 247–263 (1982).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Sellers, P.H. Pattern recognition in genetic sequences by mismatch density. Bull. math. Biol. 46, 501–514 (1984).

    Article  CAS  Google Scholar 

  21. Waterman, M.S. & Eggert, M. A new algorithm for best subsequence alignments with applications to tRNA-rRNA comparisons. J. molec. Biol. 197, 723–728 (1987).

    Article  CAS  PubMed  Google Scholar 

  22. Coulson, A.F.W., Collins, J.F. & Lyall, A. Protein and nucleic acid database searching: a suitable case for parallel processing. Comp. J. 30, 420–424 (1987).

    Article  Google Scholar 

  23. Chow, E.T., Hunkapiller, T., Peterson, J.C., Zimmerman, B.A. & Waterman, M.S. in Proc. 1991 Int. Conf. on Supercomputing, 216–223 (ACMPress, New York, 1991).

    Google Scholar 

  24. Jones, R. Sequence pattern matching on a massively parallel computer. CABIOS 8, 377–383 (1992).

    CAS  PubMed  Google Scholar 

  25. Brutlag, D.L. et al. BLAZE: an implementation of the Smith-Waterman sequence comparison algorithm on a massively parallel computer. Comput. Chem. 17, 203–207 (1993).

    Article  CAS  Google Scholar 

  26. Sturrock, S.S. & Collins, J.F. MPsrch version 1.3. (Biocomputing Research Unit, University of Edinburgh, 1993).

    Google Scholar 

  27. Lipman, D.J. & Pearson, W.R. Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985).

    Article  CAS  PubMed  Google Scholar 

  28. Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448 (1988).

    Article  CAS  Google Scholar 

  29. White, C.T. et al. in Proc. 1991 IEEE Int. Conf. Comp. Design: VLSI in Computers and Processors, 504–509 (IEEE Comp. Soc. Press, Los Alamitos, CA, 1991).

    Google Scholar 

  30. Pearson, W.R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11, 635–650 (1991).

    Article  CAS  PubMed  Google Scholar 

  31. Altschul, S.F. & Lipman, D.J. Protein database searches for multiple alignments. Proc. natn. Acad. Sci. U.S.A. 87, 5509–5513 (1990).

    Article  CAS  Google Scholar 

  32. Argos, P. A sensitive procedure to compare amino acid sequences. J. molec. Biol. 193, 385–396 (1987).

    Article  CAS  PubMed  Google Scholar 

  33. Vogt, G. & Argos, P. Searching for distantly related protein sequences in large databases by parallel processing on a transputer machine. CABIOS 8, 49–55 (1992).

    CAS  PubMed  Google Scholar 

  34. McLachlan, A.D. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome C551 . J. molec. Biol. 61, 409–424 (1971).

    Article  CAS  PubMed  Google Scholar 

  35. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M.O. Dayhoff) 345–352 (Natn. Biomed. Res. Found., Washington, 1978).

    Google Scholar 

  36. Schwartz, R.M. & Dayhoff, M.O. . in Atlas of Protein Sequence and Structure vol. 5, suppl. 3 (ed. M. O. Dayhoff) 353–358 (Natn. Biomed. Res. Found., Washington, 1978).

    Google Scholar 

  37. Feng, D.F., Johnson, M.S. & Doolittle, R.F. Aligning amino acid sequences: comparison of commonly used methods. J. molec. Evol. 21, 112–125 (1985).

    Article  CAS  Google Scholar 

  38. Rao, J.K.M. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. peptide protein Res. 29, 276–281 (1987).

    Article  CAS  Google Scholar 

  39. Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J. molec. Biol. 204, 1019–1029 (1988).

    Article  CAS  PubMed  Google Scholar 

  40. Gonnet, G.H., Cohen, M.A. & Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 256, 1443–1445 (1992).

    Article  CAS  PubMed  Google Scholar 

  41. Henikoff, S. & Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. natn. Acad. Sci. U.S.A 89, 10915–10919 (1992).

    Article  CAS  Google Scholar 

  42. Jones, D.T., Taylor, W.R. & Thornton, J.M. The rapid generation of mutation data matrices from protein sequences. CABIOS 8, 275–282 (1992).

    CAS  PubMed  Google Scholar 

  43. Overington, J., Donnelly, D., Johnson, M.S., Sali, A. & Blundell, T.L. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Prot. Sci. 1, 216–226 (1992).

    Article  CAS  Google Scholar 

  44. Wilbur, W.J. On the PAM matrix model of protein evolution. Molec. Biol. Evol. 2, 434–447 (1985).

    CAS  PubMed  Google Scholar 

  45. Henikoff, S. & Henikoff, J.G. Performance evaluation of amino acid substitution matrices. Proteins 17, 49–61 (1993).

    Article  CAS  PubMed  Google Scholar 

  46. Waterman, M.S., Gordon, L. & Arratia, R. Phase transitions in sequence matches and nucleic acid structure. Proc. natn. Acad. Sci. U.S.A. 84, 1239–1243 (1987).

    Article  CAS  Google Scholar 

  47. Fitch, W.M. & Smith, T.F. Optimal sequence alignments. Proc. natn. Acad. Sci. U.S.A. 80, 1382–1386 (1983).

    Article  CAS  Google Scholar 

  48. Gotoh, O. An improved algorithm for matching biological sequences. J. molec. Biol. 162, 705–708 (1982).

    Article  CAS  PubMed  Google Scholar 

  49. Altschul, S.F. & Erickson, B.W. Optimal sequence alignment using affine gap costs. Bull. math. Biol. 48, 603–616 (1986).

    Article  CAS  PubMed  Google Scholar 

  50. Myers, E.W. & Miller, W. Optimal alignments in linear space. CABIOS 4, 11–17 (1988).

    CAS  PubMed  Google Scholar 

  51. Miller, W. & Myers, E.W. Sequence comparison with concave weighting functions. Bull. math. Biol. 50, 97–120 (1988).

    Article  CAS  PubMed  Google Scholar 

  52. Pascarella, S. & Argos, P. Analysis of insertions/deletions in protein structures. J. molec. Biol. 224, 461–471 (1992).

    Article  CAS  PubMed  Google Scholar 

  53. Benner, S.A., Cohen, M.A. & Gonnet, G.H. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. molec. Biol. 229, 1065–1082 (1993).

    Article  CAS  PubMed  Google Scholar 

  54. Benson, D., Lipman, D.J. & Ostell, J. GenBank. Nucl. Acids Res. 21, 2963–2965 (1993).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Rice, C.M., Fuchs, R., Higgins, D.G., Stoehr, P.J. & Cameron, G.N. The EMBL data library. Nucl. Acids Res. 21, 2967–2971 (1993).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Barker, W.C., George, D.G., Mewes, H.-W., Pfeiffer, F. & Tsugita, A. The PIR-International databases. Nucl. Acids Res. 21, 3089–3092 (1993).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Adams, M.D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991).

    Article  CAS  PubMed  Google Scholar 

  58. Sikela, J.M. & Auffray, C. Finding new genes faster than ever. Nature Genet. 3, 189–191 (1993).

    Article  CAS  PubMed  Google Scholar 

  59. Davies, K. The EST express gathers steam. Nature 364, 554 (1993).

    Article  Google Scholar 

  60. Boguski, M.S., Lowe, T.M.J. & Tolstoshev, C.M. dbEST — database for “expressed sequence tags”. Nature Genet. 4, 332–333 (1993).

    Article  CAS  PubMed  Google Scholar 

  61. Bleasby, A.J. & Wootton, J.C. Construction of validated, non-redundant composite sequence databases. Protein Eng. 3, 153–159 (1990).

    Article  CAS  PubMed  Google Scholar 

  62. Benson, D., Boguski, M., Lipman, D.J. & Ostell, J. The national center for biotechnology information. Genomics 6, 389–391 (1990).

    Article  CAS  PubMed  Google Scholar 

  63. Bairoch, A. & Boeckmann, B. The SWISS-PROT protein sequence data bank, recent developments. Nucl. Acids Res. 21, 3093–3096 (1993).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Henikoff, S. Sequence analysis by electronic mail server. Trends biochem. Sci. 18, 267–268 (1993).

    Article  CAS  PubMed  Google Scholar 

  65. Krol, E. The Whole Internet User's Guide & Cataolog. (O'Reilly & Assoc., Inc., Sebastopol, CA, 1992).

    Google Scholar 

  66. Network Entrez. NCBI News 2(2), 1 (National Library of Medicine, Bethesda, MD, 1993).

  67. Wootton, J.C. & Federhen, S. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149–163 (1993).

    Article  CAS  Google Scholar 

  68. Green, P., Lipman, D., Hillier, L., Waterston, R., States, D.J. & Claverie, J.-M. Ancient conserved regions in new gene sequences. Science 259, 1711–1716 (1993).

    Article  CAS  PubMed  Google Scholar 

  69. Riggins, G.J. et al. Human genes containing polymorphic trinucleotide repeats. Nature Genet. 2, 186–191 (1992).

    Article  CAS  PubMed  Google Scholar 

  70. Harding, R.M., Boyce, A.J. & Clegg, J.B. The evolution of tandemly repetitive DNA: recombination rules. Genetics 132, 847–859 (1992).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Karlin, S. & Brendel, V. Charge configurations in viral proteins. Proc. natn. Acad. Sci. U.S.A. 85, 9396–9400 (1988).

    Article  CAS  Google Scholar 

  72. Karlin, S. & Brendel, V. Charge and statistical significance in protein and DNA sequence analysis. Science 257, 39–49 (1992).

    Article  CAS  PubMed  Google Scholar 

  73. Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E. & Karlin, S. Methods and algorithms for statistical analysis of protein sequences. Proc. natn. Acad. Sci. U.S.A. 89, 2002–2006 (1992).

    Article  CAS  Google Scholar 

  74. Claverie, J.-M. & States, D.J. Information enchancement methods for large scale sequence analysis. Comput. Chem. 17, 191–201 (1993).

    Article  CAS  Google Scholar 

  75. Jurka, J., Walichiewicz, J. & Milosavljevic, A. Prototypic sequences for human repetitive DNA. J. molec. Evol. 35, 286–291 (1992).

    Article  CAS  PubMed  Google Scholar 

  76. Hanks, S.K. & Quinn, A.M. Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. Meth. Enzymol. 200, 38–62 (1991).

    Article  CAS  Google Scholar 

  77. Collins, F. & Galas, D. A new five-year plan for the U.S. human genome project. Science 262, 43–46 (1993).

    Article  CAS  PubMed  Google Scholar 

  78. Gumbel, E.J. Statistics of extremes. (Columbia Univ. Press, New York, 1958).

    Book  Google Scholar 

  79. Arratia, R., Gordon, L. & Waterman, M.S. An extreme value theory for sequence matching. Ann. Stat. 14, 971–993 (1986).

    Article  Google Scholar 

  80. Arratia, R., Morris, P. & Waterman, M.S. Stochastic scrabble: large deviations for sequences with scores. J. appl. Prob. 25, 106–119 (1988).

    Article  Google Scholar 

  81. Arratia, R. & Waterman, M.S. The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann. Prob. 17, 1152–1169 (1989).

    Google Scholar 

  82. Salamon, P. & Konopka, A.K. A maximum entropy principle for distribution of local complexity in naturally occurring nucleotide sequences. Comput. Chem. 16, 117–124 (1992).

    Article  CAS  Google Scholar 

  83. Salamon, P., Wootton, J.C., Konopka, A.K. & Hansen, L. On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences. Comput. Chem. 17, 135–148 (1993).

    Article  CAS  Google Scholar 

  84. Miyoshi, H. et al. The t(8:21) translocation in acute myeloid leukemia results in production of an AML1-MTG8 fusion transcript. EMBO J. 12, 2715–2721 (1993).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Kokubo, T., Gong, D.-W., Roeder, R.G., Horikoshi, M. & Nakatani, Y. The Drosophlla 110-kDa TFIID subunit directly interacts with the N-terminal region of the 230-kDa subunit. Proc. natn. Acad. Sci. U.S.A. 90, 5896–5900 (1993).

    Article  CAS  Google Scholar 

  86. Hoey, T. et al. Molecular cloning and functional analysis of Drosophila TAF110 reveal properties expected of coactivators. Cell 72, 247–260 (1993).

    Article  CAS  PubMed  Google Scholar 

  87. Owens, G.P., Hahn, W.E. & Cohen, J.J. Identification of mRNAs associated with programmed cell death in immature thymocytes. Mol. cell. Biol. 11, 4177–4188 (1991).

    CAS  PubMed  PubMed Central  Google Scholar 

  88. Schwabe, J.W., Neuhaus, D. & Rhodes, D. Solution structure of the DNA-binding domain of the oestrogen receptor. Nature 348, 458–461 (1990).

    Article  CAS  PubMed  Google Scholar 

  89. Feig, L.A. The many roads that lead to Ras. Science 260, 767–768 (1993).

    Article  CAS  PubMed  Google Scholar 

  90. McCormick, F. How receptors turn Ras on. Nature 363, 15–16 (1993).

    Article  CAS  PubMed  Google Scholar 

  91. Boguski, M.S. & McCormick, F. Proteins regulating Ras and its relatives. Nature 366, 643–654 (1993).

    Article  CAS  PubMed  Google Scholar 

  92. Rozakis-Adcock, M., Femley, R., Wade, J., Pawson, T. & Bowtell, D. The SH2 and SH3 domains of mammalian Grb2 couple the EGF receptor to the Ras activator mSos1. Nature 363, 83–85 (1993).

    Article  CAS  PubMed  Google Scholar 

  93. Musacchio, A., Gibson, T., Rice, P., Thompson, J. & Saraste, M. The PH domain is a common piece in the structural patchwork of signalling (and other) proteins. Trends biochem. Sci. 18, 343–348 (1993).

    Article  CAS  PubMed  Google Scholar 

  94. Arents, G., Burlingame, R.W., Wang, B.C., Love, W.E. & Moudrianakis, E.N. The nucleosomal core histone octamer at 3.1 A resolution: a tripartite protein assembly and a left-handed superhelix. Proc. natn. Acad. Sci. U.S.A. 88, 10148–10152 (1991).

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altschul, S., Boguski, M., Gish, W. et al. Issues in searching molecular sequence databases. Nat Genet 6, 119–129 (1994). https://doi.org/10.1038/ng0294-119

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng0294-119

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing