Journal of Molecular Biology
Regular articleEmpirical statistical estimates for sequence similarity searches1
Introduction
Sequence similarity searches today are the most effective method for exploiting the information in the rapidly growing DNA and protein sequence databases. One of the most dramatic improvements in similarity searching was the introduction of accurate statistical estimates for similarity searches for alignments without gaps in the BLAST sequence comparison package (Altschul et al., 1990). Accurate statistical estimates make it possible to identify automatically sequences that are likely to be homologous (i.e. that share statistically significant similarity because of descent from a common ancestor). In general, if statistically significant similarity is found between two sequences and the similarity does not simply reflect a region with unusual amino acid composition, the sequences are likely to be homologous.
The BLAST package of sequence comparison programs Altschul et al 1990, Altschul et al 1994 provides the most widely used similarity searching programs, in part because of its accurate statistical estimates. BLAST uses two parameters, K and λ, to estimate the statistical significance of a high scoring alignment using the formula Karlin and Altschul 1990, Altschul et al 1994, Altschul and Gish 1996: where x is the similarity score, and m and n are the lengths of the two sequences being compared. Unfortunately, the underlying statistical model used by BLAST for high scoring segment pairs is limited to alignment without gaps (Karlin & Altschul, 1990), although scores from several ungapped alignments can be evaluated as well (Karlin & Altschul, 1993). Because sequence alignments between distantly related proteins typically require gaps, and similarity searching with the Smith-Waterman algorithm and the FASTA program (with gaps) can perform better than BLAST on divergent protein families (Pearson, 1995), we sought a general strategy that would provide accurate statistical estimates for alignments with gaps that would work not only for Smith-Waterman scores but also for FASTA protein-protein, DNA-DNA comparisons, and for comparisons between protein sequences and translated DNA (FASTX, TFASTX, TFASTA).
Here, we evaluate several approaches for calculating the “location” (K) and “scale” (λ) parameters from the distribution of similarity scores from unrelated sequences that are calculated during a sequence database search. We show that statistical estimates for similarity scores that have been scaled to correct for the length-dependence of local similarity scores are very accurate, and that the empirical approach described here provides an internal calibration of the accuracy of the estimates. In addition, we show that length-corrected similarity scores are more effective than raw scores at identifying distantly related members of protein families. These estimation methods have been incorporated into versions 2.0 and 3.0 of the FASTA package of sequence comparison programs.
Section snippets
Accurate statistical estimates
This paper describes a general method for determining the statistical significance of a local similarity score, based on the distribution of similarity scores obtained from a sequence database search. Current protein and DNA sequence databases contain many tens of thousands of sequences, almost all of which are unrelated to an individual query sequence (even the largest protein families comprise less than 5% of a comprehensive protein database like SwissProt or PIR). Thus, every database search
Discussion
We have examined several strategies for correcting the length-dependence of local protein sequence similarity scores. The default method used by programs in versions 2.0 and 3.0 of the FASTA package (Pearson, 1996), regress1, produces accurate statistical estimates and significantly improves search performance over unscaled similarity scores. In addition, Altschul-Gish scaling and regress2 and regress3 scaling are available in current versions of the programs; however, Altschul-Gish scaling is
Sequence libraries and similarity searching
Searches were performed on the annotated portion (PIR1) of the National Biomedical Research Foundation protein sequence database (Barker et al. (1990), release 39, 31 December 1993, 4,306,189 amino acid residues in 11,982 sequences), augmented as described by Pearson (1995). This older library, and the same set of query sequences, was used to provide consistency with the earlier work. The library has been annotated so that every sequence in the database has been assigned to a protein
Acknowledgements
The author thanks Phil Green for very helpful discussions and code to calculate the regress3 regression-scaled scores. This work was supported by a grant from the National Library of Medicine (LM04969) with additional support from the Digital Equipment Corporation.
References (24)
- et al.
Local alignment statistics
Methods Enzymol.
(1996) - et al.
A basic local alignment search tool
J. Mol. Biol.
(1990) - et al.
Protein sequence database
- et al.
FASTA-SWAP and FASTA-PATpattern database searches using combinations of aligned amino acids, and a novel scoring theory
J. Mol. Biol.
(1996) Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores
Bull. Math. Biol.
(1992)Rapid and sensitive sequence comparison with FASTP and FASTA
Effective protein sequence comparison
Methods Enzymol.
(1996)- et al.
Dynamic programming algorithms for biological sequence comparison
- et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981) - et al.
Statistics of local complexity in amino acids sequences and sequence database
Comput. Chem.
(1993)
Issues in searching molecular sequence databases
Nature Genet.
PROSITEa dictionary of sites and patterns in proteins
Nucl. Acids Res.
Cited by (259)
Comparative research on selective adsorption of Pb(II) by biosorbents prepared by two kinds of modifying waste biomass: Highly-efficient performance, application and mechanism
2021, Journal of Environmental ManagementCitation Excerpt :It may be because the xanthate group is a soft base, which can form a stable complex with Pb(II) (Chand et al., 2015). This also complies with Pearson's rule (Pearson, 1998). Adsorption kinetics is the exploration of the effects of various reaction factors on the rate of chemical reactions.
Daily natural gas price forecasting by a weighted hybrid data-driven model
2020, Journal of Petroleum Science and EngineeringKEGG tools for classification and analysis of viral proteins
2023, Protein ScienceSequence Similarity among Structural Repeats in the Piezo Family of Mechanosensitive Ion Channels
2023, Microbial PhysiologyGNAT: A General Narrative Alignment Tool
2023, EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
- 1
Edited by F. E. Cohen