Forum
Getting the most from PSI–BLAST

https://doi.org/10.1016/S0968-0004(01)02039-4Get rights and content

Abstract

Most biologists now conduct sequence searches as a matter of course. But how do we know that a relationship predicted by a homology search is a true, rather than false, hit with the same score? Many biologists design their own experiments with exquisite care yet still assume that results from programs with more than 20 adjustable parameters are 100% reliable. This article explains some of the key steps in getting the most from PSI–Blast, one of the most popular and powerful homology search programs currently available.

Section snippets

Gapped-BLAST and PSI–BLAST

This article concentrates on protein–protein comparison through Gapped-BLAST and PSI–BLAST [1], although other flavours of the algorithm are also available from the NCBI, to which similar messages apply. Before going into detail, it is best to start with a simple description of each program and the associated tools. Despite the similarity in their names and the format of the results they return, Gapped-BLAST and PSI–BLAST should be considered separately by those unfamiliar with the field.

Example of success

With PSI–BLAST, it becomes possible to identify previous ‘difficult’ cases such as exfoliative toxin A from Staphylococcus aureus as a member of the trypsin-like serine proteinase superfamily, even though the sequence identity is only 16%. This protein was, in fact, a target for the 2nd Critical Assessment of Structure Prediction experiment (CASP2), for which proteins likely to have their three-dimensional structures determined by the time of the meeting (held at the end of the experiment) had

The Devil is in the detail

As with many recent sequence comparison methods, BLAST-based programs estimate the statistical significance of each alignment score through an E-value. This can be thought of as the number of times one would expect to get a false relationship with a similar score. The limit for safe searching is commonly taken as E = 0.001 (although, at the time of writing, the current default threshold for the NCBI BLAST Web server has been loosened, first to 0.002 and most recently to 0.005). So, for a

What is the real error rate?

During work to annotate a bacterial genome, Huynen et al. [12] used a threshold E-value of 0.001 to assign protein folds to open reading frames, on the assumption that this was a safe threshold. As part of this work, they tested these assumptions using PSI–BLAST, a set of known relationships between proteins of known three-dimensional structure and the NRDB database from the NCBI. They found that the actual false positive rate for these sequences was ∼1.8%. This is effectively 18 times higher

Precalculated data on the Web

When conducting your own searches, successfully navigating the potential pitfalls is down to you. However, many genome resources have recently become available that often provide precalculated data. Well-known academic examples of this include Ensembl and the Golden Path site at UCSC (Table 1), which are revolutionizing the ways that non-specialists gain access to genome data (there are also many other less well publicized sites). With external pressure to keep these sites up-to-date and

Take home messages

At the end of the day, the user of these algorithms needs to cast a critical eye over search results and to draw conclusions using their own expertise. Running PSI–BLAST and other computer programs might be gloriously easy but, at the end of the day, the results must be interpreted as with any other experiment (albeit, in this case, an in silico experiment). Experimentalists are all too aware of the need to treat experimental results with statistical caution, but are often willing to assume

References (17)

There are more references available in the full text version of this article.

Cited by (127)

  • Discovery of novel class of histone deacetylase inhibitors as potential anticancer agents

    2021, Bioorganic and Medicinal Chemistry
    Citation Excerpt :

    The amino acid sequences of the histone deacetylase domain were retrieved from uniprot database (HDAC5: Q9UQL6|684-1028 and HDAC9: Q9UKV0|631-978, www.uniprot.org). Blast homology search engine28 was used to search for sequence homologs. The crystal structure of the catalytic domain of HDAC4 in complex with a hydroxamic acid inhibitor (PDB: 2VQM29) was used as template for both HDAC5 and HDAC9.

View all citing articles on Scopus
View full text