A broken promise: microbiome differential abundance methods do not control the false discovery rate

Briefings in Bioinformatics - Tập 20 Số 1 - Trang 210-221 - 2019
Stijn Hawinkel1, Federico Mattiello1, Luc Bijnens2, Olivier Thas1
1Department of Mathematical Modelling, Statistics and Bioinformatics at Ghent University, Belgium
2Center for Statistics at Hasselt University, Belgium

Tóm tắt

AbstractHigh-throughput sequencing technologies allow easy characterization of the human microbiome, but the statistical methods to analyze microbiome data are still in their infancy. Differential abundance methods aim at detecting associations between the abundances of bacterial species and subject grouping factors. The results of such methods are important to identify the microbiome as a prognostic or diagnostic biomarker or to demonstrate efficacy of prodrug or antibiotic drugs. Because of a lack of benchmarking studies in the microbiome field, no consensus exists on the performance of the statistical methods. We have compared a large number of popular methods through extensive parametric and nonparametric simulation as well as real data shuffling algorithms. The results are consistent over the different approaches and all point to an alarming excess of false discoveries. This raises great doubts about the reliability of discoveries in past studies and imperils reproducibility of microbiome experiments. To further improve method benchmarking, we introduce a new simulation tool that allows to generate correlated count data following any univariate count distribution; the correlation structure may be inferred from real data. Most simulation studies discard the correlation between species, but our results indicate that this correlation can negatively affect the performance of statistical methods.

Từ khóa


Tài liệu tham khảo

The Human Microbiome Project Consortium, 2012, Structure, function and diversity of the healthy human microbiome, Nature, 486, 207, 10.1038/nature11234

Sekirov, 2009, The role of the intestinal microbiota in enteric infection, J Physiol, 587, 4159, 10.1113/jphysiol.2009.172742

Ivanov, 2009, Induction of intestinal Th17 cells by segmented filamentous bacteria, Cell, 139, 485, 10.1016/j.cell.2009.09.033

Ivanov, 2010, Segmented filamentous bacteria take the stage, Mucosal Immunol, 3, 209, 10.1038/mi.2010.3

Ravel, 2011, Vaginal microbiome of reproductive-age women, Proc Natl Acad Sci USA, 108, 4680, 10.1073/pnas.1002611107

Kahrstrom, 2012, Microbiome: Gut microbiome as a marker for diabetes, Nat Rev Micro, 10, 733, 10.1038/nrmicro2903

Kostic, 2015, The dynamics of the human infant gut microbiome in development and in progression towards type 1 diabetes, Cell Host Microbe, 17, 260, 10.1016/j.chom.2015.01.001

Scher, 2012, Periodontal disease and the oral microbiota in new-onset rheumatoid arthritis, Arthritis Rheum, 64, 3083, 10.1002/art.34539

Janda, 2007, 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls, J Clin Microbiol, 45, 2761, 10.1128/JCM.01228-07

Morgan, 2012, Chapter 12: human microbiome analysis, PLoS Comput Biol, 8, e1002808., 10.1371/journal.pcbi.1002808

Paulson, 2013, Robust methods for differential abundance analysis in marker gene surveys, Nat Methods, 10, 1200, 10.1038/nmeth.2658

McMurdie, 2014, Waste not, want not: why rarefying microbiome data is inadmissible, PLoS Comput Biol, 10, e1003531, 10.1371/journal.pcbi.1003531

Anders, 2010, Differential expression analysis for sequence count data, Genome Biol, 11, R106, 10.1186/gb-2010-11-10-r106

Robinson, 2010, A scaling normalization method for differential expression analysis of RNA-seq data, Genome Biol, 11, R25, 10.1186/gb-2010-11-3-r25

Bullard, 2010, Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments, BMC Bioinformatics, 11, 94, 10.1186/1471-2105-11-94

Li, 2012, Normalization, testing, and false discovery rate estimation for RNA-sequencing data, Biostatistics, 13, 523, 10.1093/biostatistics/kxr031

Mandal, 2015, Analysis of composition of microbiomes: a novel method for studying microbial composition, Microb Ecol Health Dis, 26, 27663

Fernandes, 2014, Unifying the analysis of highthroughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis, Microbiome, 2, 15, 10.1186/2049-2618-2-15

Paulson, 2011, Metastats: an improved statistical method for analysis of metagenomic data, Genome Biol, 12, P17, 10.1186/1465-6906-12-S1-P17

Zeller, 2014, Potential of fecal microbiota for early-stage detection of colorectal cancer, Mol Syst Biol, 10, 766, 10.15252/msb.20145645

Segata, 2011, Metagenomic biomarker discovery and explanation, Genome Biol, 12, R60, 10.1186/gb-2011-12-6-r60

Benjamini, 1995, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc Series B Methodol, 57, 289, 10.1111/j.2517-6161.1995.tb02031.x

Nookaew, 2012, A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae, Nucleic Acids Res, 40, 10084, 10.1093/nar/gks804

Rigaill, 2016, Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis, Brief Bioinform, 17, 1

Robinson, 2007, Moderated statistical tests for assessing differences in tag abundance, Bioinformatics, 23, 2881, 10.1093/bioinformatics/btm453

Law, 2014, voom: precision weights unlock linear model analysis tools for RNA-seq read counts, Genome Biol, 15, R29, 10.1186/gb-2014-15-2-r29

Benjamini, 2001, The control of the false discovery rate in multiple testing under dependency, Ann Statist, 29, 1165, 10.1214/aos/1013699998

Efron, 2008, Microarrays, empirical bayes and the two-groups model, Statist Sci, 23, 1, 10.1214/07-STS236

Benidt, 2015, SimSeq: a nonparametric approach to simulation of RNA-sequence datasets, Bioinformatics, 31, 2131, 10.1093/bioinformatics/btv124

Reeb, 2013, Evaluating statistical analysis models for RNA sequencing experiments, Front Genet, 4, 178, 10.3389/fgene.2013.00178

Kvam, 2012, A comparison of statistical methods for detecting differentially expressed genes from RNAseq data, Am J Bot, 99, 248, 10.3732/ajb.1100340

Love, 2014, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biol, 15, 550, 10.1186/s13059-014-0550-8

The NIH HMP Working Group, 2009, The NIH human microbiome project, Genome Res, 19, 2317, 10.1101/gr.096651.109

2015

Soneson, 2013, A comparison of methods for differential expression analysis of RNA-seq data, BMC Bioinformatics, 14, 91–91, 10.1186/1471-2105-14-91

Jonsson, 2016, Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics, BMC Genomics, 17, 78, 10.1186/s12864-016-2386-y

Seyednasrollah, 2013, Comparison of software packages for detecting differential expression in RNA-seq studies, Brief Bioinform, 16, 59, 10.1093/bib/bbt086

Burden, 2014, Error estimates for the analysis of differential expression from RNA-seq count data, PeerJ, 2, e576, 10.7717/peerj.576

Kurtz, 2015, Sparse and compositionally robust inference of microbial ecological networks, PLoS Comput Biol, 11, e1004226, 10.1371/journal.pcbi.1004226

Danaher, 1988, Parameter estimation for the Dirichlet-multinomial distribution using supplementary beta-binomial data, Commun Stat Theory Methods, 17, 1777, 10.1080/03610928808829713

Kostic, 2014, The microbiome in inammatory bowel diseases: current status and the future ahead, Gastroenterology, 146, 1489, 10.1053/j.gastro.2014.02.009

Looft, 2012, Infeed antibiotic effects on the swine intestinal microbiome, Proc Natl Acad Sci USA, 109, 1691, 10.1073/pnas.1120238109

Markle, 2013, Sex differences in the gut microbiome drive hormone-dependent regulation of autoimmunity, Science, 339, 1084, 10.1126/science.1233521

White, 2009, Statistical methods for detecting differentially abundant features in clinical metagenomic samples, PLoS Comput Biol, 5, e1000352, 10.1371/journal.pcbi.1000352

Rapaport, 2013, Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data, Genome Biol, 14, R95, 10.1186/gb-2013-14-9-r95

Marietta, 2013, Low incidence of spontaneous type 1 diabetes in non-obese diabetic mice raised on gluten-free diets is associated with changes in the intestinal microbiome, PLoS One, 8, e78687, 10.1371/journal.pone.0078687

Singh, 2016, Impact of age and sex on the composition and abundance of the intestinal microbiota in individuals with and without enteric infections, Ann Epidemiol, 26, 380, 10.1016/j.annepidem.2016.03.007

Koren, 2012, Host remodeling of the gut microbiome and metabolic changes during pregnancy, Cell, 150, 470, 10.1016/j.cell.2012.07.008

Fortenberry, 2013, The uses of race and ethnicity in human microbiome research, Trends Microbiol, 21, 165, 10.1016/j.tim.2013.01.001

Larsen, 2010, Gut microbiota in human adults with type 2 diabetes differs from non-diabetic adults, PLoS One, 5, e9085, 10.1371/journal.pone.0009085

McMurdie, 2013, phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data, PLoS One, 8, e61217, 10.1371/journal.pone.0061217

Strimmer, 2008, A unified approach to false discovery rate estimation, BMC Bioinformatics, 9, 303, 10.1186/1471-2105-9-303

Li, 2013, Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data, Stat Methods Med Res, 22, 519, 10.1177/0962280211428386

La Rosa, 2012, Hypothesis testing and power calculations for taxonomic-based human microbiome data, PLoS One, 7, e52078, 10.1371/journal.pone.0052078

Zhou, 2014, Robustly detecting differential expression in RNA sequencing data using observation weights, Nucleic Acids Res, 42, e91, 10.1093/nar/gku310

Dillies, 2013, A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis, Brief Bioinform, 14, 671, 10.1093/bib/bbs046

Schurch, 2015, How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?, arXiv, 43, 10

Ching, 2014, Power analysis and sample size estimation for RNA-Seq differential expression, RNA, 20, 1684, 10.1261/rna.046011.114