MUSCLE: a multiple sequence alignment method with reduced time and space complexity
Tóm tắt
In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles. We compare the speed and accuracy of MUSCLE with CLUSTALW, Progressive POA and the MAFFT script FFTNS1, the fastest previously published program known to the author. Accuracy is measured using four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three variants that offer highest accuracy (MUSCLE with default settings), highest speed (MUSCLE-fast), and a carefully chosen compromise between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest algorithm on all test sets, achieving average alignment accuracy similar to CLUSTALW in times that are typically two to three orders of magnitude less. MUSCLE-fast is able to align 1,000 sequences of average length 282 in 21 seconds on a current desktop computer. MUSCLE offers a range of options that provide improved speed and / or alignment accuracy compared with currently available programs. MUSCLE is freely available at
http://www.drive5.com/muscle
.
Tài liệu tham khảo
Notredame C: Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 2002, 3(1):131–144.
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792–1797. 10.1093/nar/gkh340
Wang L, Jiang T: On the complexity of multiple sequence alignment. J Comput Biol 1994, 1(4):337–348.
Waterman MS, Smith TF, Beyer WA: Some biological sequence metrics. Adv in Math 1976, 20: 367–387.
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(5131):208–214.
Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol 1984, 20(2):175–186.
Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25(4):351–360.
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680.
Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302(1):205–217. 10.1006/jmbi.2000.4042
Bahr A, Thompson JD, Thierry JC, Poch O: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res 2001, 29(1):323–326. 10.1093/nar/29.1.323
Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15(1):87–88. 10.1093/bioinformatics/15.1.87
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M: Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 1995, 11(1):13–18.
Gotoh O: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 1996, 264(4):823–838. 10.1006/jmbi.1996.0679
Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30(14):3059–3066. 10.1093/nar/gkf436
Van Walle I, Lasters I, Wyns L: Align-m–a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 2004.
Schultz J, Copley RR, Doerks T, Ponting CP, Bork P: SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 2000, 28(1):231–234. 10.1093/nar/28.1.231
Ponting CP, Schultz J, Milpetz F, Bork P: SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res 1999, 27(1):229–232. 10.1093/nar/27.1.229
Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998, 95(11):5857–5864. 10.1073/pnas.95.11.5857
Vinga S, Almeida J: Alignment-free sequence comparison-a review. Bioinformatics 2003, 19(4):513–523. 10.1093/bioinformatics/btg005
Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins in Atlas of protein sequence and structure,. (Edited by: Dayhoff MO, Ech RV). Maryland: National Biomedical Research Foundation 1978.
Edgar RC: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res 2004, 32(1):380–385. 10.1093/nar/gkh180
Kimura M: The neutral theory of molecular evolution. Cambridge University Press 1983.
Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406–425.
Sneath PHA, Sokal RR: Numerical taxonomy. San Francisco: Freeman 1973.
Thompson JD, Higgins DG, Gibson TJ: Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput Appl Biosci 1994, 10(1):19–29.
Henikoff S, Henikoff JG: Position-based sequence weights. J Mol Biol 1994, 243(4):574–578. 10.1016/0022-2836(94)90032-9
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389
Gerstein M, Sonnhammer EL, Chothia C: Volume changes in protein evolution. J Mol Biol 1994, 236(4):1067–1078. 10.1016/0022-2836(94)90012-4
Gotoh O: A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci 1995, 11(5):543–551.
Edgar RC, Sjolander K: A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 2004, in press.
Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996, 12(4):327–345.
Altschul SF: Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 1991, 219(3):555–565.
Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 1992, 8(3):275–282.
Muller T, Spang R, Vingron M: Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 2002, 19(1):8–13.
von Ohsen N, Zimmer R: Improving profile-profile alignment via log average scoring. In: Algorithms in Bioinformatics, First International Workshop, WABI 2001 (Edited by: Gascuel O, Moret BME). Berlin: Springer-Verlag 2001, 2149: 11–26.
Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27(13):2682–2690. 10.1093/nar/27.13.2682
Myers EW, Miller W: Optimal alignments in linear space. Comput Appl Biosci 1988, 4(1):11–17.
Felsenstein J: Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates 2004.
Sauder JM, Arthur JW, Dunbrack RL Jr: Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 2000, 40(1):6–22. 10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press 1998.
Setubal J, Meidanis J: Introduction to computational biology. Pacific Grove, California: Brooks/Cole 1997.
Thompson JD, Thierry JC, Poch O: RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 2003, 19(9):1155–1161. 10.1093/bioinformatics/btg133
Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence project: update and current status. Nucleic Acids Res 2003, 31(1):34–37. 10.1093/nar/gkg111
Edgar RC, Sjolander K: COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics 2004, in press.
Holm L, Sander C: Touring protein fold space with Dali/FSSP. Nucleic Acids Res 1998, 26(1):316–319. 10.1093/nar/26.1.316
Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315(5):1257–1275. 10.1006/jmbi.2001.5293
Pietrokovski S: Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 1996, 24(19):3836–3845. 10.1093/nar/24.19.3836
Grasso C, Lee C: Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 2004.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235