Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis

Biostatistics - Tập 20 Số 4 - Trang 698-713 - 2019
Zheng-Zheng Tang1, Guanhua Chen2
1Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA and Wisconsin Institute for Discovery, Madison, WI, USA
2Department of Biostatistics and Medical Informatics, University of Wisconsin—Madison, Madison, WI, USA

Tóm tắt

SummaryThere is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. In this article, we develop a new probability distribution, zero-inflated generalized Dirichlet multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status) and develop a fast expectation–maximization algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.

Từ khóa


Tài liệu tham khảo

Ahn,, 2013, Human gut microbiome and risk for colorectal cancer, Journal of the National Cancer Institute, 105, 1907, 10.1093/jnci/djt300

Alekseyenko,, 2013, Community differentiation of the cutaneous microbiota in psoriasis., Microbiome, 1, 31, 10.1186/2049-2618-1-31

Benjamini,, 1995, Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 57, 289, 10.1111/j.2517-6161.1995.tb02031.x

Benjamini,, 2001, The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, 29, 1165, 10.1214/aos/1013699998

Bogomolov,, 2017, Testing hypotheses on a tree: new error rates and controlling strategies., arXiv preprint arXiv:1705.07529

Caporaso,, 2010, QIIME allows analysis of high-throughput community sequencing data, Nature Methods, 7, 335, 10.1038/nmeth.f.303

Chen,, 2012, Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis, Biostatistics, 14, 244, 10.1093/biostatistics/kxs038

Chen,, 2013, Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis, The Annals of Applied Statistics, 7, 418, 10.1214/12-AOAS592

Cho,, 2012, The human microbiome: at the interface of health and disease, Nature Reviews Genetics, 13, 260, 10.1038/nrg3182

Cho,, 2012, Antibiotics in early life alter the murine colonic microbiome and adiposity, Nature, 488, 621, 10.1038/nature11400

Cole,, 2007, The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data, Nucleic Acids Research, 35, 169, 10.1093/nar/gkl889

Connor,, 1969, Concepts of independence for proportions with a generalization of the Dirichlet distribution, Journal of the American Statistical Association, 64, 194, 10.1080/01621459.1969.10500963

DeSantis,, 2006, Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB, Applied and Environmental Microbiology, 72, 5069, 10.1128/AEM.03006-05

Gilbert,, 2016, Microbiome-wide association studies link dynamic microbial consortia to disease, Nature, 535, 94, 10.1038/nature18850

Ishwaran,, 2001, Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association, 96, 161, 10.1198/016214501750332758

Jovel,, 2016, Characterization of the gut microbiome using 16S or shotgun metagenomics., Frontiers in Microbiology, 7, 459, 10.3389/fmicb.2016.00459

Kuczynski,, 2012, Experimental and analytical tools for studying the human microbiome, Nature Reviews Genetics, 13, 47, 10.1038/nrg3129

La Rosa,, 2012, Hypothesis testing and power calculations for taxonomic-based human microbiome data., PLoS One, 7, e52078, 10.1371/journal.pone.0052078

La Rosa,, 2016, HMP: Hypothesis Testing and Power Calculations for Comparing Metagenomic Samples from HMP

Lei,, 2017, Star: a general interactive framework for FDR control under structural constraints., arXiv preprint arXiv:1710.02776

Li,, 2015, Microbiome, metagenomics, and high-dimensional compositional data analysis, Annual Review of Statistics and Its Application, 2, 73, 10.1146/annurev-statistics-010814-020351

Lin,, 2011, A general framework for detecting disease associations with rare variants in sequencing studies, The American Journal of Human Genetics, 89, 354, 10.1016/j.ajhg.2011.07.015

Liu,, 2008, Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers., Nucleic Acids Research, 36, 10.1093/nar/gkn491

Mandal,, 2015, Analysis of composition of microbiomes: a novel method for studying microbial composition., Microbial Ecology in Health and Disease, 26, 27663, 10.3402/mehd.v26.27663

O’Brien,, 2016, The power and pitfalls of Dirichlet-multinomial mixture models for ecological count data., bioRxiv

Qin,, 2012, A metagenome-wide association study of gut microbiota in type 2 diabetes, Nature, 490, 55, 10.1038/nature11450

Sanderson,, 2006, Human gut microbes associated with obesity, Nature, 444, 1022, 10.1038/4441022a

Sankaran,, 2017, Latent variable modeling for the microbiome., arXiv

Shi,, 2017, A model for paired-multinomial data and its application to analysis of data on a taxonomic tree, Biometrics, 73, 1266, 10.1111/biom.12681

Tang,, 2016, PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances, Bioinformatics, 32, 2618, 10.1093/bioinformatics/btw311

Tang,, 2017, A general framework for association analysis of microbial communities on a taxonomic tree, Bioinformatics, 33, 1278, 10.1093/bioinformatics/btw804

Tibshirani,, 1996, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58, 267, 10.1111/j.2517-6161.1996.tb02080.x

Wadsworth,, 2017, An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data., BMC Bioinformatics, 18, 94, 10.1186/s12859-017-1516-0

Wang,, 2017, A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms, Biometrics, 73, 792, 10.1111/biom.12654

Wong,, 1998, Generalized Dirichlet distribution in Bayesian analysis, Applied Mathematics and Computation, 97, 165, 10.1016/S0096-3003(97)10140-0

Wu,, 2011, Linking long-term dietary patterns with gut microbial enterotypes, Science, 334, 105, 10.1126/science.1208344

Yuan,, 2006, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49, 10.1111/j.1467-9868.2005.00532.x

Zhang,, 2017, Regression models for multivariate count data, Journal of Computational and Graphical Statistics, 26, 1, 10.1080/10618600.2016.1154063