Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

PLoS Computational Biology - Tập 10 Số 4 - Trang e1003531
Paul J. McMurdie1, Susan Holmes1
1Statistics Department, Stanford University, Stanford, California, United States of America

Tóm tắt

Từ khóa


Tài liệu tham khảo

J Shendure, 2012, The expanding scope of DNA sequencing, Nature Biotechnology, 30, 1084, 10.1038/nbt.2421

J Shendure, 2008, Next-generation DNA sequencing, Nature Biotechnology, 26, 1135, 10.1038/nbt1486

A Mortazavi, 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nature Methods, 5, 621, 10.1038/nmeth.1226

NR Pace, 1997, A molecular view of microbial diversity and the biosphere, Science, 276, 734, 10.1126/science.276.5313.734

KH Wilson, 2002, High-Density Microarray of Small-Subunit Ribosomal DNA Probes, Appl Environ Microbiol, 68, 2535, 10.1128/AEM.68.5.2535-2541.2002

SM Huse, 2008, Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing, PLoS Genetics, 4, e1000255, 10.1371/journal.pgen.1000255

CS Riesenfeld, 2004, Metagenomics: genomic analysis of microbial communities, Annual Review of Genetics, 38, 525, 10.1146/annurev.genet.38.072902.091216

DB Allison, 2006, Microarray Data Analysis: from Disarray to Consolidation and Consensus, Nature Reviews Genetics, 7, 55, 10.1038/nrg1749

JC Marioni, 2008, RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays, Genome Research, 18, 1509, 10.1101/gr.079558.108

J Lu, 2005, Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach, BMC Bioinformatics, 6, 165, 10.1186/1471-2105-6-165

MD Robinson, 2007, Small-sample estimation of negative binomial dispersion, with applications to SAGE data, Biostatistics (Oxford, England), 9, 321, 10.1093/biostatistics/kxm030

Cameron AC, Trivedi P (2013) Regression analysis of count data, volume 53. Cambridge University Press.

S Anders, 2010, Differential expression analysis for sequence count data, Genome Biology, 11, R106, 10.1186/gb-2010-11-10-r106

D Yu, 2013, Shrinkage estimation of dispersion in Negative Binomial models for RNASeq experiments with small sample size, Bioinformatics (Oxford, England), 29, 1275, 10.1093/bioinformatics/btt143

JM Di Bella, 2013, High throughput sequencing methods and analysis for microbiome research, Journal of Microbiological Methods, 95, 401, 10.1016/j.mimet.2013.08.011

N Segata, 2013, Computational meta'omics for microbial community studies, Molecular Systems Biology, 9, 666, 10.1038/msb.2013.22

JA Navas-Molina, 2013, Advancing Our Understanding of the Human Microbiome Using QIIME, Methods in Enzymology, 531, 371, 10.1016/B978-0-12-407863-5.00019-8

JB Hughes, 2005, The application of rarefaction techniques to molecular inventories of microbial diversity, Methods in Enzymology, 397, 292, 10.1016/S0076-6879(05)97017-1

O Koren, 2013, A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets, PLoS Computational Biology, 9, e1002863, 10.1371/journal.pcbi.1002863

HL Sanders, 1968, Marine benthic diversity: A comparative study, The American Naturalist, 102, 243, 10.1086/282541

NJ Gotelli, 2001, Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness, Ecology Letters, 4, 379, 10.1046/j.1461-0248.2001.00230.x

CX Mao, 2005, Estimation of Species Richness: Mixture Models, the Role of Rare Species, and Inferential Challenges, Ecology, 86, 1143, 10.1890/04-1078

C Lozupone, 2005, UniFrac: a new phylogenetic method for comparing microbial communities, Applied and Environmental Microbiology, 71, 8228, 10.1128/AEM.71.12.8228-8235.2005

C Lozupone, 2011, UniFrac: an effective distance metric for microbial community comparison, The ISME Journal, 5, 169, 10.1038/ismej.2010.133

M Hamady, 2008, Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex, Nature Methods, 5, 235, 10.1038/nmeth.1184

Z Liu, 2008, Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers, Nucleic Acids Research, 36, e120, 10.1093/nar/gkn491

M Hamady, 2010, Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data, The ISME Journal, 4, 17, 10.1038/ismej.2009.97

T Yatsunenko, 2012, Human gut microbiome viewed across age and geography, Nature, 486, 222, 10.1038/nature11053

J Caporaso, 2010, QIIME allows analysis of high-throughput community sequencing data, Nature Methods, 7, 335, 10.1038/nmeth.f.303

PD Schloss, 2009, Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities, Applied and Environmental Microbiology, 75, 7537, 10.1128/AEM.01541-09

JA Gilbert, 2009, The seasonal structure of microbial communities in the Western English Channel, Environmental Microbiology, 11, 3132, 10.1111/j.1462-2920.2009.02017.x

PJ McMurdie, 2013, phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data, PLoS ONE, 8, e61217, 10.1371/journal.pone.0061217

ES Charlson, 2010, Disordered microbial communities in the upper respiratory tract of cigarette smokers, PLoS ONE, 5, e15216, 10.1371/journal.pone.0015216

LB Price, 2010, The effects of circumcision on the penis microbiome, PLoS ONE, 5, e8422, 10.1371/journal.pone.0008422

SW Kembel, 2012, Architectural design influences the diversity and structure of the built environment microbiome, The ISME Journal, 6, 1469, 10.1038/ismej.2011.211

GE Flores, 2013, Diversity, distribution and sources of bacteria in residential kitchens, Environmental Microbiology, 15, 588, 10.1111/1462-2920.12036

DW Kang, 2013, Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children, PLoS ONE, 8, e68322, 10.1371/journal.pone.0068322

N Segata, 2012, Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples, Genome Biology, 13, R42, 10.1186/gb-2012-13-6-r42

JR White, 2009, Statistical methods for detecting differentially abundant features in clinical metagenomic samples, PLoS Computational Biology, 5, e1000352, 10.1371/journal.pcbi.1000352

JN Paulson, 2013, Differential abundance analysis for microbial marker-gene surveys, Nature Methods, 10, 1200, 10.1038/nmeth.2658

MD Robinson, 2009, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics (Oxford, England), 26, 139, 10.1093/bioinformatics/btp616

JC Gower, 1966, Some distance properties of latent root and vector methods used in multivariate analysis, Biometrika, 53, 325, 10.1093/biomet/53.3-4.325

Oksanen J, Blanchet FG, Kindt R, Legendre P, O&apos;Hara RB, <etal>et al</etal>.. (2011) vegan: Community Ecology Package. R package version 1.17-10.

M Anderson, 2001, A new method for non-parametric multivariate analysis of variance, Austral Ecology, 26, 32

JR Bray, 1957, An Ordination of the Upland Forest Communities of Southern Wisconsin, Ecological Monographs, 27, 325, 10.2307/1942268

DM Witten, 2011, Classification and clustering of sequencing data using a Poisson model, The Annals of Applied Statistics, 5, 2493, 10.1214/11-AOAS493

CA Lozupone, 2007, Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities, Applied and Environmental Microbiology, 73, 1576, 10.1128/AEM.01996-06

JG Caporaso, 2011, Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample, Proceedings of the National Academy of Sciences, 108, 4516, 10.1073/pnas.1000080107

Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: An Introduction to Cluster Analysis, JohnWiley &amp; Sons, chapter 2.

A Reynolds, 2006, Clustering rules: A comparison of partitioning and hierarchical clustering algorithms, Journal of Mathematical Modelling and Algorithms, 5, 475, 10.1007/s10852-005-9022-1

Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S (2010) multtest: Resampling-based multiple hypothesis testing. R package version 2.4.0.

Y Benjamini, 1995, Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society Series B (Methodological), 57, 289, 10.1111/j.2517-6161.1995.tb02031.x

Allaire J, Horner J, Marti V, Porte N (2014) markdown: Markdown rendering for R. R package version 0.6.4.

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2013) cluster: Cluster Analysis Basics and Extensions.

Revolution Analytics (2011) foreach: Foreach looping construct for R. R package version 1.3.2.

Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer New York.

H Wickham, 2011, The split-apply-combine strategy for data analysis, Journal of Statistical Software, 40, 1, 10.18637/jss.v040.i01

H Wickham, 2007, Reshaping data with the reshape package, Journal of Statistical Software, 21, 1, 10.18637/jss.v021.i12

T Sing, 2005, ROCR: visualizing classifier performance in R, Bioinformatics (Oxford, England), 21, 3940, 10.1093/bioinformatics/bti623

R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

Hastie TJ, Pregibon D (1992) Generalized linear models. In: Chambers JM, Hastie TJ, editors, Statistical Models in S, Chapman &amp; Hall/CRC, <volume>chapter 6</volume>..

I Nookaew, 2012, A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and crosscomparison with microarrays: a case study in Saccharomyces cerevisiae, Nucleic Acids Research, 40, 10084, 10.1093/nar/gks804

J Bullard, 2010, Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments, BMC Bioinformatics, 11, 94, 10.1186/1471-2105-11-94

J Sun, 2013, TCC: an R package for comparing tag count data with robust normalization strategies, BMC Bioinformatics, 14, 219, 10.1186/1471-2105-14-219

C Soneson, 2013, A comparison of methods for differential expression analysis of RNA-Seq data, BMC Bioinformatics, 14, 91, 10.1186/1471-2105-14-91

TJ Hardcastle, 2010, baySeq: empirical Bayesian methods for identifying differential expression in sequence count data, BMC Bioinformatics, 11, 422, 10.1186/1471-2105-11-422

HG Ozer, 2012, DFI: gene feature discovery in RNA-Seq experiments from multiple sources, BMC Genomics, 13 Suppl 8, S11, 10.1186/1471-2164-13-S8-S11

R Bourgon, 2010, Independent filtering increases detection power for highthroughput experiments, Proceedings of the National Academy of Sciences, 107, 9546, 10.1073/pnas.0914005107

A Chao, 2005, A new statistical approach for assessing similarity of species composition with incidence and abundance data, Ecology Letters, 8, 148, 10.1111/j.1461-0248.2004.00707.x

PD Schloss, 2008, Evaluating different approaches that test whether microbial communities have the same structure, The ISME Journal, 2, 265, 10.1038/ismej.2008.5

R Gentleman, 2004, Statistical analyses and reproducible research, Bioconductor Project Working Papers, 1, 1

RD Peng, 2011, Reproducible research in computational science, Science, 334, 1226, 10.1126/science.1213847

DL Donoho, 2010, An invitation to reproducible computational research, Biostatistics (Oxford, England), 11, 385, 10.1093/biostatistics/kxq028

RC Gentleman, 2004, Bioconductor: open software development for computational biology and bioinformatics, Genome Biology, 5, R80, 10.1186/gb-2004-5-10-r80

GD Wu, 2011, Linking long-term dietary patterns with gut microbial enterotypes, Science, 334, 105, 10.1126/science.1208344