FAMSA: Fast and accurate multiple sequence alignment of huge protein families

Scientific Reports - Tập 6 Số 1
Sebastian Deorowicz1, Agnieszka Debudaj-Grabysz1, Adam Gudyś1
1Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland

Tóm tắt

AbstractRapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

Từ khóa


Tài liệu tham khảo

Chatzou, M. et al. Multiple sequence alignment modeling: methods and applications. Brief. Bioinform. 10.1093/bib/bbv099 (2015).

Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994).

Do, Ch. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15(2), 330–340 (2005).

Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113 (2004).

Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000).

Lassmann, T. & Sonnhammer, E. L. L. Kalign—an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6, 298 (2005).

Lassmann, T., Frings, O. & Sonnhammer, E. L. L. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 37, 858–865 (2009).

Wu, S. & Manber, U. Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992).

Muth, R. & Manber, U. Approximate multiple string search in Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, 75–86 (1996).

Deorowicz, S., Debudaj-Grabysz, A. & Gudyś, A. Kalign-LCS—A More Accurate and Faster Variant of Kalign2 Algorithm for the Multiple Sequence Alignment Problem in Man-Machine Interactions 3, AISC 242 (eds Gruca, A. et al.) 495–502 (Springer-Verlag, 2014).

Katoh, K. & Toh, H. PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23, 372–374 (2007).

Katoh, K. & Toh, H. Recent developments in the MAFFT multiple sequence alignment program. Brief. Bioinform. 9(4), 286–298 (2008).

Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

Blackshields, G., Sievers, F., Shi, W., Wilm, A. & Higgins, D. G. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithm. Mol. Biol. 5(1), 21 (2010).

Nguyen, Np. D., Mirarab, S., Kumar, K. & Warnow, T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 16, 124 (2015).

Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual. Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, 3C and 3Dhttp://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. (Accessed: 30th June 2016).

Sibson, R. SLINK: An optimally efficient algorithm for the single-link cluster method. Comput. J. 16, 30–34 (1973).

Yamada, K. & Tomii, K. Revisiting amino acid substitution matrices for identifying distantly related proteins. Bioinformatics 30, 317–325 (2014).

Gudyś, A. & Deorowicz, S. QuickProbs 2: towards rapid construction of high-quality alignments of large protein families. Preprint available at: http://arxiv.org/abs/1512.07437 (2015).

Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7(11), 2469–2471 (1998).

Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40(D1), D281–D288 (2012).

Plyusnin, I. & Holm, L. Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics 13, 64 (2012).

Gusfield, D. Algorithms on Strings, Trees and Sequences (Cambridge University Press, 1997).

Hyyrö, H. Bit-parallel LCS-length computation revisited in Proceedings of the 15th Australian Workshop on Combinatorial Algorithms, 16–27 (2004).

Khronos Group, The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl. (Accessed: 30th June 2016).

Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987).

Sokal, R. R. & Michener, C. D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38, 1409–1438 (1958).

Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H. & Zubrzycki, S. Sur la liaison et la division des points d’un ensemble fini. Colloq Math 2, 282–285 (1951).

Wheeler, T. J. & Kececioglu, J. D. Multiple alignment by aligning alignments. Bioinformatics 23(13), i559–i568 (2007).

Edgar, R. C. Optimizing substitution matrix choice and gap parameters for sequence alignment BMC Bioinformatics 10, 396 (2009).

Chakrabarti, S. et al. Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res. 34(9), 2598–2606 (2006).

Liu, Y., Schmidt, B. & Maskell, D. L. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 26, 1958–1964 (2010).

Edgar, R. C. QSCORE multiple alignment scoring software. http://www.drive5.com/qscore. (Accessed: 30th June 2016).

Thompson, J. D., Plewniak, F. & Poch, O. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1), 87–88 (1999).

Raghava, G., Searle, G., Audley, P., Barber, J. & Barton, G. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4(1), 47 (2003).

Walle, I., Lasters, I. & Wyns, L. SABmark—a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 21(7), 1267–1268 (2005).

Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

Sievers, F., Dinnen, D., Wilm, A. & Higgins, D. G. Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29, 989–995 (2013).

Gudyś, A. & Deorowicz, S. QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors. PLoS One 9(7), e103051 (2014).

Ye, Y. et al. GLProbs: Aligning Multiple Sequences Adaptively. IEEE/ACM Trans. Comput. Biol. Bioinf. 12, 67–78 (2015).

Boyce, K., Sievers, F. & Higgins, D. G. Simple chained guide trees give high-quality protein multiple sequence alignments. Proc. Nat. Acad. Sci. USA 111(29), 10556–10561 (2014).

Boyce, K., Sievers, F. & Higgins, D. G. Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments. Proc. Nat. Acad. Sci. USA 112(2), E101 (2015).

Tan, G., Gil, M., Löytynoja, A. P., Goldman, N. & Dessimoz, C. Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks. Proc. Nat. Acad. Sci. USA 112, E99–E100 (2015).

Sackin, M. J. “Good” and “bad” phenograms. Syst. Biol. 21(2), 225–226 (1972).

Fox, G., Sievers, F. & Higgins, D. G. Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments. Bioinformatics 32(6), 814–820 (2016).