BLAST+: architecture and applications

BMC Bioinformatics - Tập 10 - Trang 1-9 - 2009
Christiam Camacho1, George Coulouris1, Vahram Avagyan1, Ning Ma1, Jason Papadopoulos1, Kevin Bealer1, Thomas L Madden1
1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.

Tóm tắt

Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.

Tài liệu tham khảo

Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389 NCBI C toolkit[http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/INDEX.HTML] Zhang Z, Schäffer A, Miller W, Madden T, Lipman D, Koonin E, Altschul S: Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res 1998, 26(17):3986–3990. 10.1093/nar/26.17.3986 Schäffer A, Wolf Y, Ponting C, Koonin E, Aravind L, Altschul S: IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 1999, 15(12):1000–1011. 10.1093/bioinformatics/15.12.1000 Schäffer A, Aravind L, Madden T, Shavirin S, Spouge J, Wolf Y, Koonin E, Altschul S: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001, 29(14):2994–3005. 10.1093/nar/29.14.2994 Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 7(1–2):203–214. 10.1089/10665270050081478 A/G BLAST[http://www.apple.com/downloads/macosx/math_science/agblast.html] Waterston R, Lindblad-Toh K, Birney E, Rogers J, Abril J, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al.: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420(6915):520–562. 10.1038/nature01262 RepeatMasker Web site[http://www.repeatmasker.org/] NCBI BLAST web site[http://blast.ncbi.nlm.nih.gov/Blast.cgi] Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden T: NCBI BLAST: a better web interface. Nucleic Acids Res 2008, 36(Web Server issue):W5–9. 10.1093/nar/gkn201 Kent W: BLAT--the BLAST-like alignment tool. Genome Res 2002, 12(4):656–664. Cameron M, Williams H, Cannane A: A deterministic finite automaton for faster protein hit detection in BLAST. J Comput Biol 2006, 13(4):965–978. 10.1089/cmb.2006.13.965 NCBI C++ toolkit documentation[http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=toolkit] Implementing a BlastSeqSrc[http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/_impl_blast_seqsrc_howto.html] BLAST+ Command Line Applications User Manual[http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpblast] States DJ, Gish W, Altschul SF: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. METHODS: A Companion to Methods in Enzymology 1991, 3: 66–70. 10.1016/S1046-2023(05)80165-3 Morgulis A, Gertz E, Schäffer A, Agarwala R: WindowMasker: window-based masker for sequenced genomes. Bioinformatics 2006, 22(2):134–141. 10.1093/bioinformatics/bti774 Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Computer Methods for Macromolecular Sequence Analysis 1996, 266: 554–571. full_text Morgulis A, Gertz E, Schäffer A, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol 2006, 13(5):1028–1040. 10.1089/cmb.2006.13.1028 Reference assembly for Human genome build 36.1[http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html#b36] Morgulis A, Coulouris G, Raytselis Y, Madden T, Agarwala R, Schäffer A: Database indexing for production MegaBLAST searches. Bioinformatics 2008, 24(16):1757–1764. 10.1093/bioinformatics/btn322 Cachegrind[http://valgrind.org/docs/manual/cg-manual.html] NCBI SRA Software Development Kit[http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software] PUBLIC DOMAIN NOTICE for NCBI[http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=toolkit&part=toolkit.fm#A3]