VSEARCH: a versatile open source tool for metagenomics
Tóm tắt
VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use.
When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads.
VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or
VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.
Từ khóa
Tài liệu tham khảo
Altschul, 1997, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, 25, 3389, 10.1093/nar/25.17.3389
Burge, 2013, Rfam 11.0: 10 years of RNA families, Nucleic Acids Research, 41, D226, 10.1093/nar/gks1005
Caporaso, 2010, QIIME allows analysis of high-throughput community sequencing data, Nature Methods, 7, 335, 10.1038/nmeth.f.303
Cock, 2010, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Research, 38, 1767, 10.1093/nar/gkp1137
DeSantis, 2006, Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB, Applied and Environmental Microbiology, 72, 5069, 10.1128/AEM.03006-05
Eastlake, 2001, US Secure Hash Algorithm 1 (SHA)
Edgar, 2010, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, 26, 2460, 10.1093/bioinformatics/btq461
Edgar, 2013, UPARSE: highly accurate OTU sequences from microbial amplicon reads, Nature Methods, 10, 996, 10.1038/nmeth.2604
Edgar, 2015, Error filtering, pair assembly and error correction for next-generation sequencing reads, Bioinformatics, 31, 3476, 10.1093/bioinformatics/btv401
Edgar, 2011, UCHIME improves sensitivity and speed of chimera detection, Bioinformatics, 27, 2194, 10.1093/bioinformatics/btr381
Fowler, 1991, Fowler / Noll / Vo (FNV) hash
Gailly, 2016, zlib: a massively spiffy yet delicately unobtrusive compression library
Gilbert, 2014, The Earth Microbiome project: successes and aspirations, BMC Biology, 12, 69, 10.1186/s12915-014-0069-1
Gusfield, 1993, Efficient methods for multiple sequence alignment with guaranteed error bounds, Bulletin of Mathematical Biology, 55, 141, 10.1007/BF02460299
He, 2015, Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity, Microbiome, 3, 10.1186/s40168-015-0081-x
Hirschberg, 1975, A linear space algorithm for computing maximal common subsequences, Communications of the ACM, 18, 341, 10.1145/360825.360861
Human Microbiome Project Consortium, 2012, Structure, function and diversity of the healthy human microbiome, Nature, 486, 207, 10.1038/nature11234
Karsenti, 2011, A holistic approach to marine eco-systems biology, PLoS Biology, 9, e1001177, 10.1371/journal.pbio.1001177
Li, 2009, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics, 25, 1754, 10.1093/bioinformatics/btp324
Logares, 2014, The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes, Current Biology, 24, 813, 10.1016/j.cub.2014.02.050
MacCallum, 2009, ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads, Genome Biology, 10, R103, 10.1186/gb-2009-10-10-r103
Mahé, 2014, Swarm: robust and fast clustering method for amplicon-based studies, PeerJ, 2, e593, 10.7717/peerj.593
Masella, 2012, PANDAseq: paired-end assembler for illumina sequences, BMC Bioinformatics, 13, 31, 10.1186/1471-2105-13-31
Myers, 1988, Optimal alignments in linear space, Computer Applications in the Biosciences, 4, 11
Needleman, 1970, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, 48, 443, 10.1016/0022-2836(70)90057-4
Nichols, 2016, Simera: Modelling the PCR Process to Simulate Realistic Chimera Formation, bioRxiv, 10.1101/072447
Quast, 2013, The SILVA ribosomal RNA gene database project: improved data processing and web-based tools, Nucleic Acids Research, 41, D590, 10.1093/nar/gks1219
Rand, 1971, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 846, 10.2307/2284239
Rognes, 2011, Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation, BMC Bioinformatics, 12, 221, 10.1186/1471-2105-12-221
Schirmer, 2015, Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform, Nucleic Acids Research, 43, e37, 10.1093/nar/gku1341
Schloss, 2016, Application of a database-independent approach to assess the quality of operational taxonomic unit picking methods, mSystems, 1, e00027, 10.1128/mSystems.00027-16
Schloss, 2009, Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities, Applied and Environmental Microbiology, 75, 7537, 10.1128/AEM.01541-09
Seward, 2016, bzip2 and libbzip2
Song, 2014, New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing, Briefings in Bioinformatics, 15, 343, 10.1093/bib/bbt067
Steffen, 2015, Sustainability. Planetary boundaries: guiding human development on a changing planet, Science, 347, 1259855, 10.1126/science.1259855
Westcott, 2015, De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units, PeerJ, 3, e1487, 10.7717/peerj.1487