De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units

PeerJ - Tập 3 - Trang e1487
Sarah L. Westcott1, Patrick D. Schloss1
1Department of Microbiology and Immunology, University of Michigan-Ann Arbor , Ann Arbor, MI , United States.

Tóm tắt

Background.16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method is optimal. A recent study suggested that a clustering method should be selected based on its ability to generate stable OTU assignments that do not change as additional sequences are added to the dataset. In contrast, we contend that the quality of the OTU assignments, the ability of the method to properly represent the distances between the sequences, is more important.Methods.Our analysis implemented sixde novoclustering algorithms including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. Using two previously published datasets we used the Matthew’s Correlation Coefficient (MCC) to assess the stability and quality of OTU assignments.Results.The stability of OTU assignments did not reflect the quality of the assignments. Depending on the dataset being analyzed, the average linkage and the distance and abundance-based greedy clustering methods generated OTUs that were more likely to represent the actual distances between sequences than the open and closed-reference methods. We also demonstrated that for the greedy algorithms VSEARCH produced assignments that were comparable to those produced by USEARCH making VSEARCH a viable free and open source alternative to USEARCH. Further interrogation of the reference-based methods indicated that when USEARCH or VSEARCH were used to identify the closest reference, the OTU assignments were sensitive to the order of the reference sequences because the reference sequences can be identical over the region being considered. More troubling was the observation that while both USEARCH and VSEARCH have a high level of sensitivity to detect reference sequences, the specificity of those matches was poor relative to the true best match.Discussion.Our analysis calls into question the quality and stability of OTU assignments generated by the open and closed-reference methods as implemented in current version of QIIME. This study demonstrates thatde novomethods are the optimal method of assigning sequences into OTUs and that the quality of these assignments needs to be assessed for multiple methods to identify the optimal clustering method for a particular dataset.

Từ khóa


Tài liệu tham khảo

Anderson, 2001, A new method for non-parametric multivariate analysis of variance, Austral Ecology, 26, 32, 10.1111/j.1442-9993.2001.01070.pp.x

Barriuso, 2011, Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows, BMC Bioinformatics, 12, 473, 10.1186/1471-2105-12-473

Bonder, 2012, Comparing clustering and pre-processing in taxonomy analysis, Bioinformatics, 28, 2891, 10.1093/bioinformatics/bts552

Cai, 2011, ESPRIT-tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time, Nucleic Acids Research, 39, e95, 10.1093/nar/gkr349

Caporaso, 2010, QIIME allows analysis of high-throughput community sequencing data, Nature Methods, 7, 335, 10.1038/nmeth.f.303

Chen, 2013, A comparison of methods for clustering 16S rRNA sequences into OTUs, PLoS ONE, 8, e70837, 10.1371/journal.pone.0070837

Eddelbuettel, 2013, Seamless R and C++ integration with Rcpp, 10.1007/978-1-4614-6868-4

Edgar, 2010, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, 26, 2460, 10.1093/bioinformatics/btq461

Edgar, 2013, UPARSE: Highly accurate OTU sequences from microbial amplicon reads, Nature Methods, 10, 996, 10.1038/nmeth.2604

Edgar, 2011, UCHIME improves sensitivity and speed of chimera detection, Bioinformatics, 27, 2194, 10.1093/bioinformatics/btr381

Gilbert, 2011, Defining seasonal marine microbial community dynamics, The ISME Journal, 6, 298, 10.1038/ismej.2011.107

Hamady, 2009, Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data, The ISME Journal, 4, 17, 10.1038/ismej.2009.97

He, 2015, Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity, Microbiome, 3, 20, 10.1186/s40168-015-0081-x

Huse, 2010, Ironing out the wrinkles in the rare biosphere through improved OTU clustering, Environmental Microbiology, 12, 1889, 10.1111/j.1462-2920.2010.02193.x

Huttenhower, 2012, Structure, function and diversity of the healthy human microbiome, Nature, 486, 207, 10.1038/nature11234

Kim, 2011, Evaluation of different partial 16S rRNA gene sequence regions for phylogenetic analysis of microbiomes, Journal of Microbiological Methods, 84, 81, 10.1016/j.mimet.2010.10.020

Koeppel, 2013, Surprisingly extensive mixed phylogenetic and ecological signals among bacterial operational taxonomic units, Nucleic Acids Research, 41, 5175, 10.1093/nar/gkt241

Kozich, 2013, Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform, Applied and Environmental Microbiology, 79, 5112, 10.1128/AEM.01043-13

Langille, 2013, Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences, Nature Biotechnology, 31, 814, 10.1038/nbt.2676

Mahé, 2014, Swarm: robust and fast clustering method for amplicon-based studies, PeerJ, 2, e593, 10.7717/peerj.593

Matthews, 1975, Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (BBA)—Protein Structure, 405, 442, 10.1016/0005-2795(75)90109-9

May, 2014, Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations, Bioinformatics, 30, 1530, 10.1093/bioinformatics/btu085

Navas-Molina, 2013, Advancing our understanding of the human microbiome using QIIME, Methods in enzymology, 371

Ooms, 2014, The jsonlite package: a practical and consistent mapping between JSON data and R objects

Preheim, 2013, Distribution-based clustering: using ecology to refine the operational taxonomic unit, Applied and Environmental Microbiology, 79, 6593, 10.1128/AEM.00342-13

Pruesse, 2007, SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB, Nucleic Acids Research, 35, 7188, 10.1093/nar/gkm864

R Core Team, 2015, R: a language and environment for statistical computing

Rideout, 2014, Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences, PeerJ, 2, e545, 10.7717/peerj.545

Roesch, 2007, Pyrosequencing enumerates and contrasts soil microbial diversity, The ISME Journal, 1, 283, 10.1038/ismej.2007.53

Rognes, 2015, Vsearch: VSEARCH 1.4.0, 10.5281/zenodo.31443

Schloss, 2010, The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies, PLoS Computational Biology, 6, e1000844, 10.1371/journal.pcbi.1000844

Schloss, 2012, Secondary structure improves OTU assignments of 16S rRNA gene sequences, The ISME Journal, 7, 457, 10.1038/ismej.2012.102

Schloss, 2011, Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies, PLoS ONE, 6, e27310, 10.1371/journal.pone.0027310

Schloss, 2011, Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis, Applied and Environmental Microbiology, 77, 3219, 10.1128/AEM.02810-10

Schloss, 2012, Stabilization of the murine gut microbiome following weaning, Gut Microbes, 3, 383, 10.4161/gmic.21008

Schloss, 2009, Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities, Applied and Environmental Microbiology, 75, 7537, 10.1128/AEM.01541-09

Schmidt, 2014, Ecological consistency of SSU rRNA-based operational taxonomic units at a global scale, PLoS Computational Biology, 10, e1003594, 10.1371/journal.pcbi.1003594

Schmidt, 2014, Limits to robustness and reproducibility in the demarcation of operational taxonomic units, Environmental Microbiology, 17, 1689, 10.1111/1462-2920.12610

Schubert, 2015, Antibiotic-induced alterations of the murine gut microbiota and subsequent effects on colonization resistance against Clostridium difficile, mBio, 6, e00974–15, 10.1128/mbio.00974-15

Shade, 2013, Streptomycin application has no detectable effect on bacterial community structure in apple orchard soil, Applied and Environmental Microbiology, 79, 6617, 10.1128/AEM.02017-13

Sun, 2009, ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences, Nucleic Acids Research, 37, e76, 10.1093/nar/gkp285

Sun, 2011, A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis, Briefings in Bioinformatics, 13, 107, 10.1093/bib/bbr009

Wang, 2007, Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy, Applied and Environmental Microbiology, 73, 5261, 10.1128/AEM.00062-07

White, 2010, Alignment and clustering of phylogenetic markers—implications for microbial diversity studies, BMC Bioinformatics, 11, 152, 10.1186/1471-2105-11-152

Winter, 2015, Rentrez 1.0.0, 10.5281/zenodo.32420

Xie, 2013, Dynamic documents with R and knitr

Zackular, 2015, Manipulation of the gut microbiota reveals role in colon tumorigenesis, mSphere, 1, e00001, 10.1128/mSphere.00001-15