An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea

ISME Journal - Tập 6 Số 3 - Trang 610-618 - 2012
Daniel McDonald1, Morgan N. Price2, Julia K. Goodrich1, Eric P. Nawrocki3, Todd Z. DeSantis4, Alexander J. Probst5, Gary L. Andersen5, Rob Knight1,6, Philip Hugenholtz7
1Department of Chemistry & Biochemistry and Biofrontiers Institute, University of Colorado , Boulder, CO , USA
2Lawrence Berkeley National Laboratory, Physical Biosciences Division , Berkeley, CA , USA
3Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA, USA
4Department of Bioinformatics, Second Genome Inc. , San Bruno, CA , USA
5Lawrence Berkeley National Laboratory, Center for Environmental Biotechnology , Berkeley, CA , USA
6Howard Hughes Medical Institute, Boulder, CO, USA
7Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences and Institute for Molecular Bioscience , St Lucia, Queensland , Australia

Tóm tắt

Abstract

Reference phylogenies are crucial for providing a taxonomic framework for interpretation of marker gene and metagenomic surveys, which continue to reveal novel species at a remarkable rate. Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated taxonomy based on de novo tree inference. We developed a ‘taxonomy to tree’ approach for transferring group names from an existing taxonomy to a tree topology, and used it to apply the Greengenes, National Center for Biotechnology Information (NCBI) and cyanoDB (Cyanobacteria only) taxonomies to a de novo tree comprising 408 315 sequences. We also incorporated explicit rank information provided by the NCBI taxonomy to group names (by prefixing rank designations) for better user orientation and classification consistency. The resulting merged taxonomy improved the classification of 75% of the sequences by one or more ranks relative to the original NCBI taxonomy with the most pronounced improvements occurring in under-classified environmental sequences. We also assessed candidate phyla (divisions) currently defined by NCBI and present recommendations for consolidation of 34 redundantly named groups. All intermediate results from the pipeline, which includes tree inference, jackknifing and transfer of a donor taxonomy to a recipient tree (tax2tree) are available for download. The improved Greengenes taxonomy should provide important infrastructure for a wide range of megasequencing projects studying ecosystems on scales ranging from our own bodies (the Human Microbiome Project) to the entire planet (the Earth Microbiome Project). The implementation of the software can be obtained from http://sourceforge.net/projects/tax2tree/.

Từ khóa


Tài liệu tham khảo

Cannone, 2002, The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs, BMC Bioinform, 3, 2, 10.1186/1471-2105-3-2

Caporaso, 2010, PyNAST: a flexible tool for aligning sequences to a template alignment, Bioinformatics, 26, 266, 10.1093/bioinformatics/btp636

Chun, 2007, EzTaxon: a web-based tool for the identification of prokaryotes based on 16S ribosomal RNA gene sequences, Int J Syst Evol Microbiol, 57, 2259, 10.1099/ijs.0.64915-0

Ciccarelli, 2006, Toward automatic reconstruction of a highly resolved tree of life, Science, 311, 1283, 10.1126/science.1123061

Cole, 2009, The Ribosomal Database Project: improved alignments and new tools for rRNA analysis, Nucleic Acids Res, 37, D141, 10.1093/nar/gkn879

Dalevi, 2007, Automated group assignment in large phylogenetic trees using GRUNT: GRouping, Ungrouping, Naming Tool, BMC Bioinform, 8, 402, 10.1186/1471-2105-8-402

DeSantis, 2006, Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB, Appl Environ Microbiol, 72, 5069, 10.1128/AEM.03006-05

Dojka, 1998, Microbial diversity in a hydrocarbon- and chlorinated-solvent-contaminated aquifer undergoing intrinsic bioremediation, Appl Environ Microbiol, 64, 3869, 10.1128/AEM.64.10.3869-3877.1998

Haas, 2011, Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons, Genome Res, 21, 494, 10.1101/gr.112730.110

Hugenholtz, 1998, Novel division level bacterial diversity in a Yellowstone hot spring, J Bacteriol, 180, 366, 10.1128/JB.180.2.366-376.1998

Kelly, 2001, Phylogenetic analysis of the succession of bacterial communities in the Great South Bay (Long Island), FEMS Microbiol Ecol, 35, 85, 10.1111/j.1574-6941.2001.tb00791.x

Knight, 2007, PyCogent: a toolkit for making sense from sequence, Genome Biol, 8, R171, 10.1186/gb-2007-8-8-r171

Lane, 1991, Nucleic Acid Techniques in Bacterial Systematics

Ley, 2006, Unexpected diversity and complexity of the Guerrero Negro hypersaline microbial mat, Appl Environ Microbiol, 72, 3685, 10.1128/AEM.72.5.3685-3695.2006

Liu, 2008, Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers, Nucleic Acids Res, 36, e120, 10.1093/nar/gkn491

Ludwig, 2001, Bergey's Manual of Systematic Bacteriology

Ludwig, 2004, ARB: a software environment for sequence data, Nucleic Acids Res, 32, 1363, 10.1093/nar/gkh293

Mavromatis, 2009, Genome analysis of the anaerobic thermohalophilic bacterium Halothermothrix orenii, PloS One, 4, e4192, 10.1371/journal.pone.0004192

Nawrocki, 2009, Infernal 1.0: inference of RNA alignments, Bioinformatics, 25, 1335, 10.1093/bioinformatics/btp157

Peplies, 2008, A standard operating procedure for phylogenetic inference (SOPPI) using (rRNA) marker genes, Syst Appl Microbiol, 31, 251, 10.1016/j.syapm.2008.08.003

Peterson, 2009, The NIH Human Microbiome Project, Genome Res, 19, 2317, 10.1101/gr.096651.109

Price, 2010, FastTree 2--approximately maximum-likelihood trees for large alignments, PloS One, 5, e9490, 10.1371/journal.pone.0009490

Pruesse, 2007, SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB, Nucleic Acids Res, 35, 7188, 10.1093/nar/gkm864

Sayers, 2011, Database resources of the National Center for Biotechnology Information, Nucleic Acids Res, 39, D38, 10.1093/nar/gkq1172

Tringe, 2008, A renaissance for the pioneering 16S rRNA gene, Curr Opin Microbiol, 11, 442, 10.1016/j.mib.2008.09.011

Turnbaugh, 2007, The human microbiome project, Nature, 449, 804, 10.1038/nature06244

van Rijsbergen, 1979, Information Retrieval, 2nd

Vogel, 2009, TerraGenome: a consortium for the sequencing of a soil metagenome, Nat Rev Micro, 7, 252, 10.1038/nrmicro2119

Wang, 2007, Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy, Appl Environ Microbiol, 73, 5261, 10.1128/AEM.00062-07

Werner, 2011, Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys, The ISME Journal, 6, 94, 10.1038/ismej.2011.82

Wu, 2009, A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea, Nature, 462, 1056, 10.1038/nature08656