GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data

PeerJ - Tập 6 - Trang e4600
Li Chen1, James Reeve2, Lujun Zhang3, Nancy Y. Ip2, Xuefeng Wang4, Jun Chen5
1Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University, Auburn, AL, USA
2Bioinformatics and Computational Biology Program, University of Minnesota—Rochester, Rochester, MN, USA
3College of Environmental and Resource Sciences, Zhejiang University, Hangzhou, Zhejiang, China
4Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
5Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA

Tóm tắt

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero-inflation remain largely undeveloped. Here we propose geometric mean of pairwise ratios—a simple but effective normalization method—for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.

Từ khóa


Tài liệu tham khảo

Aird, 2011, Analyzing and minimizing PCR amplification bias in illumina sequencing libraries, Genome Biology, 12, R18, 10.1186/gb-2011-12-2-r18

Anders, 2010, Differential expression analysis for sequence count data, Genome Biology, 11, R106, 10.1186/gb-2010-11-10-r106

Caporaso, 2010, QIIME allows analysis of high-throughput community sequencing data, Nature Methods, 7, 335, 10.1038/nmeth.f.303

Chen, 2012, Associating microbiome composition with environmental covariates using generalized UniFrac distances, Bioinformatics, 28, 2106, 10.1093/bioinformatics/bts342

Chen, 2018, An omnibus test for differential distribution analysis of microbiome sequencing data, Bioinformatics, 34, 643, 10.1093/bioinformatics/btx650

Chen, 2013, Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis, Annals of Applied Statistics, 7, 418, 10.1214/12-aoas592

Costea, 2014, A fair comparison, Nature Methods, 11, 359, 10.1038/nmeth.2897

Dillies, 2013, A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis, Briefings in Bioinformatics, 14, 671, 10.1093/bib/bbs046

Fortin, 2014, Functional normalization of 450k methylation array data improves replication in large cancer studies, Genome Biology, 15, 503, 10.1186/s13059-014-0503-2

Hall, 2017, Human genetic variation and the gut microbiome in disease, Nature Reviews Genetics, 18, 690, 10.1038/nrg.2017.63

Li, 2015, Comparing the normalization methods for the differential analysis of illumina high-throughput RNA-Seq data, BMC Bioinformatics, 16, 347, 10.1186/s12859-015-0778-7

Love, 2014, Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2, Genome Biology, 15, 550, 10.1186/s13059-014-0550-8

Mandal, 2015, Analysis of composition of microbiomes: a novel method for studying microbial composition, Microbial Ecology in Health & Disease, 26, 27663, 10.3402/mehd.v26.27663

McMurdie, 2014, Waste not, want not: why rarefying microbiome data is inadmissible, PLOS Computational Biology, 10, e1003531, 10.1371/journal.pcbi.1003531

Morton, 2017, Balance trees reveal microbial niche differentiation, mSystems, 2, e0016216, 10.1128/msystems.00162-16

Paulson, 2013, Differential abundance analysis for microbial marker-gene surveys, Nature Methods, 10, 1200, 10.1038/nmeth.2658

Robinson, 2016, Intricacies of assessing the human microbiome in epidemiologic studies, Annals of Epidemiology, 26, 311, 10.1016/j.annepidem.2016.04.005

Robinson, 2010, edgeR: a bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, 26, 139, 10.1093/bioinformatics/btp616

Robinson, 2010, A scaling normalization method for differential expression analysis of RNA-Seq data, Genome Biology, 11, R25, 10.1186/gb-2010-11-3-r25

Sinha, 2016, Collecting fecal samples for microbiome analyses in epidemiology studies, Cancer Epidemiology Biomarkers & Prevention, 25, 407, 10.1158/1055-9965.epi-15-0951

Thorsen, 2016, Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16s rRNA gene amplicon data analysis methods used in microbiome studies, Microbiome, 4, 62, 10.1186/s40168-016-0208-8

Tsilimigras, 2016, Compositional data analysis of the microbiome: fundamentals, tools, and challenges, Annals of Epidemiology, 26, 330, 10.1016/j.annepidem.2016.03.002

Vallejos, 2017, Normalizing single-cell RNA sequencing data: challenges and opportunities, Nature Methods, 14, 565, 10.1038/nmeth.4292

Wang, 2009, RNA-Seq: a revolutionary tool for transcriptomics, Nature Reviews Genetics, 10, 57, 10.1038/nrg2484

Weiss, 2017, Normalization and microbial differential abundance strategies depend upon data characteristics, Microbiome, 5, 27, 10.1186/s40168-017-0237-y

Wu, 2011, Linking long-term dietary patterns with gut microbial enterotypes, Science, 334, 105, 10.1126/science.1208344