Multiple sequence alignment
Introduction
A multiple sequence alignment (MSA) arranges protein sequences into a rectangular array with the goal that residues in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role. Although these three criteria are essentially equivalent for closely related proteins, sequence, structure and function diverge over evolutionary time and different criteria may result in different alignments. Manually refined alignments continue to be superior to purely automated methods; there is therefore a continuous effort to improve the biological accuracy of MSA tools. Additionally, the high computational cost of most naive algorithms motivates improvements in speed and memory usage to accommodate the rapid increase in available sequence data. In this review, we describe the state of the art in MSA software and benchmarking, and offer our recommended procedures for creating multiple alignments from typical types of input data.
Section snippets
Computational approaches to multiple sequence alignment
MSA algorithm development is an active area of research two decades after the first programs were written. The standard computational formulation of the pairwise problem is to identify the alignment that maximizes protein sequence similarity, which is typically defined as the sum of substitution matrix scores for each aligned pair of residues, minus some penalties for gaps. The mathematically — though not necessarily biologically — exact solution can be found in a fraction of a second for a
Benchmarks
Validation of an MSA program typically uses a benchmark data set of reference alignments. An alignment produced by the program is compared with the corresponding reference alignment, giving an accuracy score. Alignments of protein structures can be generated without considering sequence and can therefore be used as independent references for sequence-based methods. Unfortunately, multiple structure alignment is also a hard problem, so, in practice, pairwise structure alignments are often used.
Methods
CLUSTALW [16] was introduced in 1994 and quickly became the method of choice for biologists, as it represented dramatic progress in alignment sensitivity combined with speed compared with other existing tools. CLUSTALW is still the most widely used MSA program. However, to the best of our knowledge, no significant improvements have been made to the algorithm since 1994 and several modern methods achieve better performance in accuracy, speed or both.
In the category of global alignment tools that
Choosing a program
There are three main considerations in choosing a program: biological accuracy, execution time and memory usage (Table 1, Table 2). Biological accuracy is generally the most important concern. The most accurate programs according to benchmark tests are MAFFT, MUSCLE, PROBCONS and T-COFFEE. On most benchmarks, PROBCONS achieves the best performance; recent versions of the MAFFT tool achieve comparable results by incorporating consistency-based scoring.
In practice, accuracy claims can be
Future directions
Multiple alignment of protein sequences will remain an important application in the foreseeable future. The number of newly available protein sequences still far outpaces the number of determined protein three-dimensional structures, and therefore sequence homology remains the main method by which to infer protein structure, function, active sites and evolutionary history.
In recent years, protein MSA tools have improved rapidly in both scalability and accuracy. Future improvements are likely to
References and recommended reading
Papers of particular interest, published within the annual period of review, have been highlighted as:
• of special interest
•• of outstanding interest
References (39)
Consistency of optimal sequence alignments
Bull Math Biol
(1990)- et al.
T-Coffee: a novel method for fast and accurate multiple sequence alignment
J Mol Biol
(2000) - et al.
Biological Sequence Analysis
(1998) - et al.
AltAVisT: comparing alternative multiple sequence alignments
Bioinformatics
(2003) - et al.
Environmental genome shotgun sequencing of the Sargasso sea
Science
(2004) - et al.
On the complexity of multiple sequence alignment
J Comput Biol
(1994) - et al.
Progressive sequence alignment as a prerequisite to correct phylogenetic trees
J Mol Evol
(1987) - et al.
The Pfam protein families database
Nucleic Acids Res
(2004) - et al.
COFFEE: an objective function for multiple sequence alignments
Bioinformatics
(1998) - et al.
Towards integration of multiple alignment and phylogenetic tree construction
J Comput Biol
(1997)
DIALIGN: finding local similarities by multiple sequence alignment
Bioinformatics
ProbCons: probabilistic consistency-based multiple sequence alignment
Genome Res
BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs
Bioinformatics
OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy
BMC Bioinformatics
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res
SABmark-a benchmark for sequence alignment that covers the entire known fold space
Bioinformatics
DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment
BMC Bioinformatics
Large-scale comparison of protein sequence alignment algorithms with structure alignments
Proteins
CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Res
Cited by (327)
Accelerating Multiple Sequence Alignments Using Parallel Computing
2024, ComputationIn silico molecular characterization of TGF-β gene family in Bufo bufo: genome-wide analysis
2024, Journal of Biomolecular Structure and Dynamics