Regular article
Empirical statistical estimates for sequence similarity searches1

https://doi.org/10.1006/jmbi.1997.1525Get rights and content

Abstract

The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity scores for protein/protein, DNA/DNA, and protein/translated-DNA comparisons. The accuracy of the statistical estimates is summarized for 54 protein families using FASTA and Smith-Waterman scores. Probability estimates calculated from the distribution of similarity scores are generally conservative, as are probabilities calculated using the Altschul-Gish λ, K, and H parameters. The performance of several alternative methods for correcting similarity scores for library-sequence length was evaluated using 54 protein superfamilies from the PIR39 database and 110 protein families from the Prosite/SwissProt rel. 34 database. Both regression-scaled and Altschul-Gish scaled scores perform significantly better than unscaled Smith-Waterman or FASTA similarity scores. When the Prosite/SwissProt test set is used, regression-scaled scores perform slightly better; when the PIR database is used, Altschul-Gish scaled scores perform best. Thus, length-corrected similarity scores improve the sensitivity of database searches. Statistical parameters that are derived from the distribution of similarity scores from the thousands of unrelated sequences typically encountered in a database search provide accurate estimates of statistical significance that can be used to infer sequence homology.

Introduction

Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements in similarity searching was the introduction of accurate statistical estimates for similarity searches for alignments without gaps in the BLAST sequence comparison package (Altschul et al., 1990). Accurate statistical estimates make it possible to identify automatically sequences that are likely to be homologous (i.e. that share statistically significant similarity because of descent from a common ancestor). In general, if statistically significant similarity is found between two sequences and the similarity does not simply reflect a region with unusual amino acid composition, the sequences are likely to be homologous.

The BLAST package of sequence comparison programs Altschul et al 1990, Altschul et al 1994 provides the most widely used similarity searching programs, in part because of its accurate statistical estimates. BLAST uses two parameters, K and λ, to estimate the statistical significance of a high scoring alignment using the formula Karlin and Altschul 1990, Altschul et al 1994, Altschul and Gish 1996:P(S > x)=1−exp(−Kmne−λx) where x is the similarity score, and m and n are the lengths of the two sequences being compared. Unfortunately, the underlying statistical model used by BLAST for high scoring segment pairs is limited to alignment without gaps (Karlin & Altschul, 1990), although scores from several ungapped alignments can be evaluated as well (Karlin & Altschul, 1993). Because sequence alignments between distantly related proteins typically require gaps, and similarity searching with the Smith-Waterman algorithm and the FASTA program (with gaps) can perform better than BLAST on divergent protein families (Pearson, 1995), we sought a general strategy that would provide accurate statistical estimates for alignments with gaps that would work not only for Smith-Waterman scores but also for FASTA protein-protein, DNA-DNA comparisons, and for comparisons between protein sequences and translated DNA (FASTX, TFASTX, TFASTA).

Here, we evaluate several approaches for calculating the “location” (K) and “scale” (λ) parameters from the distribution of similarity scores from unrelated sequences that are calculated during a sequence database search. We show that statistical estimates for similarity scores that have been scaled to correct for the length-dependence of local similarity scores are very accurate, and that the empirical approach described here provides an internal calibration of the accuracy of the estimates. In addition, we show that length-corrected similarity scores are more effective than raw scores at identifying distantly related members of protein families. These estimation methods have been incorporated into versions 2.0 and 3.0 of the FASTA package of sequence comparison programs.

Section snippets

Accurate statistical estimates

This paper describes a general method for determining the statistical significance of a local similarity score, based on the distribution of similarity scores obtained from a sequence database search. Current protein and DNA sequence databases contain many tens of thousands of sequences, almost all of which are unrelated to an individual query sequence (even the largest protein families comprise less than 5% of a comprehensive protein database like SwissProt or PIR). Thus, every database search

Discussion

We have examined several strategies for correcting the length-dependence of local protein sequence similarity scores. The default method used by programs in versions 2.0 and 3.0 of the FASTA package (Pearson, 1996), regress1, produces accurate statistical estimates and significantly improves search performance over unscaled similarity scores. In addition, Altschul-Gish scaling and regress2 and regress3 scaling are available in current versions of the programs; however, Altschul-Gish scaling is

Sequence libraries and similarity searching

Searches were performed on the annotated portion (PIR1) of the National Biomedical Research Foundation protein sequence database (Barker et al. (1990), release 39, 31 December 1993, 4,306,189 amino acid residues in 11,982 sequences), augmented as described by Pearson (1995). This older library, and the same set of query sequences, was used to provide consistency with the earlier work. The library has been annotated so that every sequence in the database has been assigned to a protein

Acknowledgements

The author thanks Phil Green for very helpful discussions and code to calculate the regress3 regression-scaled scores. This work was supported by a grant from the National Library of Medicine (LM04969) with additional support from the Digital Equipment Corporation.

References (24)

  • S.F Altschul et al.

    Issues in searching molecular sequence databases

    Nature Genet.

    (1994)
  • A Bairoch

    PROSITEa dictionary of sites and patterns in proteins

    Nucl. Acids Res.

    (1991)
  • Cited by (259)

    • Comparative research on selective adsorption of Pb(II) by biosorbents prepared by two kinds of modifying waste biomass: Highly-efficient performance, application and mechanism

      2021, Journal of Environmental Management
      Citation Excerpt :

      It may be because the xanthate group is a soft base, which can form a stable complex with Pb(II) (Chand et al., 2015). This also complies with Pearson's rule (Pearson, 1998). Adsorption kinetics is the exploration of the effects of various reaction factors on the rate of chemical reactions.

    • Daily natural gas price forecasting by a weighted hybrid data-driven model

      2020, Journal of Petroleum Science and Engineering
    • GNAT: A General Narrative Alignment Tool

      2023, EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
    View all citing articles on Scopus
    1

    Edited by F. E. Cohen

    View full text