Trends in Biochemical Sciences
ForumGetting the most from PSI–BLAST
Section snippets
Gapped-BLAST and PSI–BLAST
This article concentrates on protein–protein comparison through Gapped-BLAST and PSI–BLAST [1], although other flavours of the algorithm are also available from the NCBI, to which similar messages apply. Before going into detail, it is best to start with a simple description of each program and the associated tools. Despite the similarity in their names and the format of the results they return, Gapped-BLAST and PSI–BLAST should be considered separately by those unfamiliar with the field.
Example of success
With PSI–BLAST, it becomes possible to identify previous ‘difficult’ cases such as exfoliative toxin A from Staphylococcus aureus as a member of the trypsin-like serine proteinase superfamily, even though the sequence identity is only 16%. This protein was, in fact, a target for the 2nd Critical Assessment of Structure Prediction experiment (CASP2), for which proteins likely to have their three-dimensional structures determined by the time of the meeting (held at the end of the experiment) had
The Devil is in the detail
As with many recent sequence comparison methods, BLAST-based programs estimate the statistical significance of each alignment score through an E-value. This can be thought of as the number of times one would expect to get a false relationship with a similar score. The limit for safe searching is commonly taken as E = 0.001 (although, at the time of writing, the current default threshold for the NCBI BLAST Web server has been loosened, first to 0.002 and most recently to 0.005). So, for a
What is the real error rate?
During work to annotate a bacterial genome, Huynen et al. [12] used a threshold E-value of 0.001 to assign protein folds to open reading frames, on the assumption that this was a safe threshold. As part of this work, they tested these assumptions using PSI–BLAST, a set of known relationships between proteins of known three-dimensional structure and the NRDB database from the NCBI. They found that the actual false positive rate for these sequences was ∼1.8%. This is effectively 18 times higher
Precalculated data on the Web
When conducting your own searches, successfully navigating the potential pitfalls is down to you. However, many genome resources have recently become available that often provide precalculated data. Well-known academic examples of this include Ensembl and the Golden Path site at UCSC (Table 1), which are revolutionizing the ways that non-specialists gain access to genome data (there are also many other less well publicized sites). With external pressure to keep these sites up-to-date and
Take home messages
At the end of the day, the user of these algorithms needs to cast a critical eye over search results and to draw conclusions using their own expertise. Running PSI–BLAST and other computer programs might be gloriously easy but, at the end of the day, the results must be interpreted as with any other experiment (albeit, in this case, an in silico experiment). Experimentalists are all too aware of the need to treat experimental results with statistical caution, but are often willing to assume
References (17)
- et al.
Iterated profile searches with PSI–BLAST – a tool for discovery in protein databases
Trends Biochem. Sci.
(1998) - et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods
J. Mol. Biol.
(1998)Benchmarking PSI–BLAST in genome annotation
J. Mol. Biol.
(1999)Accurate formula for P-values of gapped local sequence and profile alignments
J. Mol. Biol.
(2000)- et al.
Statistics of local complexity in amino acid sequences and sequence databases
Comput. Chem.
(1993) Homology-based fold predictions for Mycoplasma genitalium proteins
J. Mol. Biol.
(1998)Gapped BLAST and PSI–BLAST: a new generation of protein database search programs
Nucleic Acids Res.
(1997)
Cited by (127)
Orientation of Cel5A and Xyn10B in a fusion construct is important in facilitating synergistic degradation of plant biomass polysaccharides
2023, Journal of Bioscience and BioengineeringDiscovery of novel class of histone deacetylase inhibitors as potential anticancer agents
2021, Bioorganic and Medicinal ChemistryCitation Excerpt :The amino acid sequences of the histone deacetylase domain were retrieved from uniprot database (HDAC5: Q9UQL6|684-1028 and HDAC9: Q9UKV0|631-978, www.uniprot.org). Blast homology search engine28 was used to search for sequence homologs. The crystal structure of the catalytic domain of HDAC4 in complex with a hydroxamic acid inhibitor (PDB: 2VQM29) was used as template for both HDAC5 and HDAC9.
QUARTERplus: Accurate disorder predictions integrated with interpretable residue-level quality assessment scores
2021, Computational and Structural Biotechnology JournalCharacterization of previously identified novel DNA fragment associated with Pathogenicity Island III<inf>536</inf> reveals new bla<inf>CTX-M</inf> gene
2019, Infection, Genetics and EvolutionEXPLOITING THE SEQUENCE AND EVOLUTIONARY INFORMATION FOR THE IDENTIFICATION OF VIRULENCE FACTORS
2023, Applied Biological ResearchStructural analysis of human G-protein-coupled receptor 17 ligand binding sites
2023, Journal of Cellular Biochemistry