Multiple sequence alignment

https://doi.org/10.1016/j.sbi.2006.04.004Get rights and content

Multiple sequence alignments are an essential tool for protein structure and function prediction, phylogeny inference and other common tasks in sequence analysis. Recently developed systems have advanced the state of the art with respect to accuracy, ability to scale to thousands of proteins and flexibility in comparing proteins that do not share the same domain architecture. New multiple alignment benchmark databases include PREFAB, SABMARK, OXBENCH and IRMBASE. Although CLUSTALW is still the most popular alignment tool to date, recent methods offer significantly better alignment quality and, in some cases, reduced computational cost.

Introduction

A multiple sequence alignment (MSA) arranges protein sequences into a rectangular array with the goal that residues in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role. Although these three criteria are essentially equivalent for closely related proteins, sequence, structure and function diverge over evolutionary time and different criteria may result in different alignments. Manually refined alignments continue to be superior to purely automated methods; there is therefore a continuous effort to improve the biological accuracy of MSA tools. Additionally, the high computational cost of most naive algorithms motivates improvements in speed and memory usage to accommodate the rapid increase in available sequence data. In this review, we describe the state of the art in MSA software and benchmarking, and offer our recommended procedures for creating multiple alignments from typical types of input data.

Section snippets

Computational approaches to multiple sequence alignment

MSA algorithm development is an active area of research two decades after the first programs were written. The standard computational formulation of the pairwise problem is to identify the alignment that maximizes protein sequence similarity, which is typically defined as the sum of substitution matrix scores for each aligned pair of residues, minus some penalties for gaps. The mathematically — though not necessarily biologically — exact solution can be found in a fraction of a second for a

Benchmarks

Validation of an MSA program typically uses a benchmark data set of reference alignments. An alignment produced by the program is compared with the corresponding reference alignment, giving an accuracy score. Alignments of protein structures can be generated without considering sequence and can therefore be used as independent references for sequence-based methods. Unfortunately, multiple structure alignment is also a hard problem, so, in practice, pairwise structure alignments are often used.

Methods

CLUSTALW [16] was introduced in 1994 and quickly became the method of choice for biologists, as it represented dramatic progress in alignment sensitivity combined with speed compared with other existing tools. CLUSTALW is still the most widely used MSA program. However, to the best of our knowledge, no significant improvements have been made to the algorithm since 1994 and several modern methods achieve better performance in accuracy, speed or both.

In the category of global alignment tools that

Choosing a program

There are three main considerations in choosing a program: biological accuracy, execution time and memory usage (Table 1, Table 2). Biological accuracy is generally the most important concern. The most accurate programs according to benchmark tests are MAFFT, MUSCLE, PROBCONS and T-COFFEE. On most benchmarks, PROBCONS achieves the best performance; recent versions of the MAFFT tool achieve comparable results by incorporating consistency-based scoring.

In practice, accuracy claims can be

Future directions

Multiple alignment of protein sequences will remain an important application in the foreseeable future. The number of newly available protein sequences still far outpaces the number of determined protein three-dimensional structures, and therefore sequence homology remains the main method by which to infer protein structure, function, active sites and evolutionary history.

In recent years, protein MSA tools have improved rapidly in both scalability and accuracy. Future improvements are likely to

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

References (39)

  • B. Morgenstern et al.

    DIALIGN: finding local similarities by multiple sequence alignment

    Bioinformatics

    (1998)
  • C.B. Do et al.

    ProbCons: probabilistic consistency-based multiple sequence alignment

    Genome Res

    (2005)
  • J.D. Thompson et al.

    BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs

    Bioinformatics

    (1999)
  • G.P. Raghava et al.

    OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy

    BMC Bioinformatics

    (2003)
  • R.C. Edgar

    MUSCLE: multiple sequence alignment with high accuracy and high throughput

    Nucleic Acids Res

    (2004)
  • I. Van Walle et al.

    SABmark-a benchmark for sequence alignment that covers the entire known fold space

    Bioinformatics

    (2005)
  • A.R. Subramanian et al.

    DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment

    BMC Bioinformatics

    (2005)
  • J.M. Sauder et al.

    Large-scale comparison of protein sequence alignment algorithms with structure alignments

    Proteins

    (2000)
  • J.D. Thompson et al.

    CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

    Nucleic Acids Res

    (1994)
  • Cited by (327)

    View all citing articles on Scopus
    View full text