VSEARCH: a versatile open source tool for metagenomics

PeerJ - Tập 4 - Trang e2584
Torbjørn Rognes1,2, Tomáš Flouri3,4, Ben Nichols5, Christopher Quince5,6, Frédéric Mahé7,8
1Department of Informatics, University of Oslo, Oslo, Norway
2Department of Microbiology, Oslo University Hospital, Oslo, Norway
3Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
4Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
5School of Engineering, University of Glasgow, Glasgow, United Kingdom
6Warwick Medical School, University of Warwick, Coventry, United Kingdom
7Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
8UMR LSTM, CIRAD, Montpellier, France

Tóm tắt

Background

VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use.

Methods

When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads.

Results

VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based orde novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available athttps://github.com/torognes/vsearchunder either the BSD 2-clause license or the GNU General Public License version 3.0.

Discussion

VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Từ khóa


Tài liệu tham khảo

Altschul, 1997, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, 25, 3389, 10.1093/nar/25.17.3389

Burge, 2013, Rfam 11.0: 10 years of RNA families, Nucleic Acids Research, 41, D226, 10.1093/nar/gks1005

Caporaso, 2010, QIIME allows analysis of high-throughput community sequencing data, Nature Methods, 7, 335, 10.1038/nmeth.f.303

Cock, 2010, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Research, 38, 1767, 10.1093/nar/gkp1137

DeSantis, 2006, Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB, Applied and Environmental Microbiology, 72, 5069, 10.1128/AEM.03006-05

Eastlake, 2001, US Secure Hash Algorithm 1 (SHA)

Edgar, 2010, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, 26, 2460, 10.1093/bioinformatics/btq461

Edgar, 2013, UPARSE: highly accurate OTU sequences from microbial amplicon reads, Nature Methods, 10, 996, 10.1038/nmeth.2604

Edgar, 2015, Error filtering, pair assembly and error correction for next-generation sequencing reads, Bioinformatics, 31, 3476, 10.1093/bioinformatics/btv401

Edgar, 2011, UCHIME improves sensitivity and speed of chimera detection, Bioinformatics, 27, 2194, 10.1093/bioinformatics/btr381

Fowler, 1991, Fowler / Noll / Vo (FNV) hash

Gailly, 2016, zlib: a massively spiffy yet delicately unobtrusive compression library

Gilbert, 2014, The Earth Microbiome project: successes and aspirations, BMC Biology, 12, 69, 10.1186/s12915-014-0069-1

Gusfield, 1993, Efficient methods for multiple sequence alignment with guaranteed error bounds, Bulletin of Mathematical Biology, 55, 141, 10.1007/BF02460299

He, 2015, Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity, Microbiome, 3, 10.1186/s40168-015-0081-x

Hirschberg, 1975, A linear space algorithm for computing maximal common subsequences, Communications of the ACM, 18, 341, 10.1145/360825.360861

Hubert, 1985, Comparing partitions, Journal of Classification, 2, 193, 10.1007/BF01908075

Human Microbiome Project Consortium, 2012, Structure, function and diversity of the healthy human microbiome, Nature, 486, 207, 10.1038/nature11234

Karsenti, 2011, A holistic approach to marine eco-systems biology, PLoS Biology, 9, e1001177, 10.1371/journal.pbio.1001177

Li, 2009, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics, 25, 1754, 10.1093/bioinformatics/btp324

Logares, 2014, The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes, Current Biology, 24, 813, 10.1016/j.cub.2014.02.050

MacCallum, 2009, ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads, Genome Biology, 10, R103, 10.1186/gb-2009-10-10-r103

Mahé, 2014, Swarm: robust and fast clustering method for amplicon-based studies, PeerJ, 2, e593, 10.7717/peerj.593

Masella, 2012, PANDAseq: paired-end assembler for illumina sequences, BMC Bioinformatics, 13, 31, 10.1186/1471-2105-13-31

Myers, 1988, Optimal alignments in linear space, Computer Applications in the Biosciences, 4, 11

Needleman, 1970, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, 48, 443, 10.1016/0022-2836(70)90057-4

Nichols, 2016, Simera: Modelling the PCR Process to Simulate Realistic Chimera Formation, bioRxiv, 10.1101/072447

Quast, 2013, The SILVA ribosomal RNA gene database project: improved data processing and web-based tools, Nucleic Acids Research, 41, D590, 10.1093/nar/gks1219

Rand, 1971, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 846, 10.2307/2284239

Rivest, 1992, The MD5 message-digest algorithm, 10.17487/rfc1321

Rockström, 2009, A safe operating space for humanity, Nature, 461, 472, 10.1038/461472a

Rognes, 2011, Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation, BMC Bioinformatics, 12, 221, 10.1186/1471-2105-12-221

Schirmer, 2015, Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform, Nucleic Acids Research, 43, e37, 10.1093/nar/gku1341

Schloss, 2016, Application of a database-independent approach to assess the quality of operational taxonomic unit picking methods, mSystems, 1, e00027, 10.1128/mSystems.00027-16

Schloss, 2009, Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities, Applied and Environmental Microbiology, 75, 7537, 10.1128/AEM.01541-09

Seward, 2016, bzip2 and libbzip2

Song, 2014, New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing, Briefings in Bioinformatics, 15, 343, 10.1093/bib/bbt067

Steffen, 2015, Sustainability. Planetary boundaries: guiding human development on a changing planet, Science, 347, 1259855, 10.1126/science.1259855

Westcott, 2015, De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units, PeerJ, 3, e1487, 10.7717/peerj.1487

Zhang, 2014, PEAR: a fast and accurate Illumina Paired-End reAd mergeR, Bioinformatics, 30, 614, 10.1093/bioinformatics/btt593