Managing batch effects in microbiome data

Briefings in Bioinformatics - Tập 21 Số 6 - Trang 1954-1970 - 2020
Yiwen Wang1, Kim‐Anh Lê Cao1
1Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, 3052, Australia

Tóm tắt

Abstract Microbial communities have been increasingly studied in recent years to investigate their role in ecological habitats. However, microbiome studies are difficult to reproduce or replicate as they may suffer from confounding factors that are unavoidable in practice and originate from biological, technical or computational sources. In this review, we define batch effects as unwanted variation introduced by confounding factors that are not related to any factors of interest. Computational and analytical methods are required to remove or account for batch effects. However, inherent microbiome data characteristics (e.g. sparse, compositional and multivariate) challenge the development and application of batch effect adjustment methods to either account or correct for batch effects. We present commonly encountered sources of batch effects that we illustrate in several case studies. We discuss the limitations of current methods, which often have assumptions that are not met due to the peculiarities of microbiome data. We provide practical guidelines for assessing the efficiency of the methods based on visual and numerical outputs and a thorough tutorial to reproduce the analyses conducted in this review.

Từ khóa


Tài liệu tham khảo

Aitchison, 1986, The Statistical Analysis of Compositional Data, 10.1007/978-94-009-4109-0

Alter, 2000, Singular value decomposition for genome-wide expression data processing and modeling, Proc Natl Acad Sci USA, 97, 10101, 10.1073/pnas.97.18.10101

Beggs, 2000, Impacts of climate and climate change on medications and human health, Aust N Z J Public Health, 24, 630, 10.1111/j.1467-842X.2000.tb00531.x

Blaser, 2016, Toward a Predictive Understanding of Earth’s Microbiomes to Address 21st Century Challenges, MBio, 7, e00714, 10.1128/mBio.00714-16

Borcard, 1992, Partialling out the spatial component of ecological variation, Ecology, 73, 1045, 10.2307/1940179

Brooks, 2015, The truth about metagenomics: quantifying and counteracting bias in 16s rRNA studies, BMC Microbiol, 15, 66, 10.1186/s12866-015-0351-6

Buhule, 2014, Stratified randomization controls better for batch effects in 450k methylation analysis: a cautionary tale, Front Genet, 5, 354, 10.3389/fgene.2014.00354

Bushel, 2013, pvca: Principal Variance Component Analysis (PVCA)

Buttigieg, 2014, A guide to statistical analysis in microbial ecology: a community-focused, living review of multivariate data analyses, FEMS Microbiol Ecol, 90, 543, 10.1111/1574-6941.12437

Campbell, 2012, Host genetic and environmental effects on mouse intestinal microbiota, ISME J, 6, 2033, 10.1038/ismej.2012.54

Chapleur, 2016, Increasing concentrations of phenol progressively affect anaerobic digestion of cellulose and associated microbial communities, Biodegradation, 27, 15, 10.1007/s10532-015-9751-4

Chevalier, 2015, Gut microbiota orchestrates energy homeostasis during cold, Cell, 163, 1360, 10.1016/j.cell.2015.11.004

Costea, 2017, Towards standards for human fecal sample processing in metagenomic studies, Nat Biotechnol, 35, 1069, 10.1038/nbt.3960

Dai, 2019, Batch effects correction for microbiome data with Dirichlet-multinomial regression, Bioinformatics, 35, 807, 10.1093/bioinformatics/bty729

Deaver, 2018, Circadian disruption changes gut microbiome taxa and functional gene composition, Frontiers in Microbiology, 9, 737, 10.3389/fmicb.2018.00737

Egozcue, 2003, Isometric logratio transformations for compositional data analysis, Math Geol, 35, 279, 10.1023/A:1023818214614

Ericsson, 2018, The influence of caging, bedding, and diet on the composition of the microbiota in different regions of the mouse gut, Sci Rep, 8, 4065, 10.1038/s41598-018-21986-7

Fernandes, 2014, Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16s rRNA gene sequencing and selective growth experiments by compositional data analysis, Microbiome, 2, 15, 10.1186/2049-2618-2-15

Finucane, 2014, A taxonomic signature of obesity in the microbiome? Getting to the guts of the matter, PloS One, 9, 10.1371/journal.pone.0084689

CL, 2017, Microbiota and reproducibility of rodent models, Lab Anim (NY), 46, 114, 10.1038/laban.1222

Friedman, 2012, Inferring correlation networks from genomic survey data, PLoS Comput Biol, 8, e1002687, 10.1371/journal.pcbi.1002687

Gagnon-Bartsch, 2013, Removing unwanted variation from high dimensional data with negative controls, 1

Gagnon-Bartsch, 2012, Using control genes to correct for unwanted variation in microarray data, Biostatistics, 13, 539, 10.1093/biostatistics/kxr034

Gandolfo, 2018, RLE plots: visualizing unwanted variation in high dimensional data, PloS One, 13, 10.1371/journal.pone.0191629

Gibbons, 2018, Correcting for batch effects in case-control microbiome studies, PLoS Comput Biol, 14, 10.1371/journal.pcbi.1006102

Gibson, 2004, Dietary modulation of the human colonic microbiota: updating the concept of prebiotics, Nutr Res Rev, 17, 259, 10.1079/NRR200479

Gloor, 2017, Microbiome datasets are compositional: and this is not optional, Front Microbiol, 8, 2224, 10.3389/fmicb.2017.02224

Goh, 2017, Why batch effects matter in omics data, and how to avoid them, Trends Biotechnol, 35, 498, 10.1016/j.tibtech.2017.02.012

Guidi, 2016, Plankton networks driving carbon export in the oligotrophic ocean, Nature, 532, 465, 10.1038/nature16942

Haro, 2016, Intestinal microbiota is influenced by gender and body mass index, PloS One, 11, e0154090, 10.1371/journal.pone.0154090

Hildebrand, 2013, Inflammation-associated enterotypes, host genotype, cage and inter-individual effects drive gut microbiota variation in common laboratory mice, Genome Biol, 14, R4, 10.1186/gb-2013-14-1-r4

Ho, 2018, Human pharyngeal microbiota in age-related macular degeneration, PloS One, 13, e0201768, 10.1371/journal.pone.0201768

Hong, 2018, Meta-analysis of the lung microbiota in pulmonary tuberculosis, Tuberculosis, 109, 102, 10.1016/j.tube.2018.02.006

Hornung, 2016, Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment, BMC Bioinformatics, 17, 27, 10.1186/s12859-015-0870-z

Hughes, 2018, Is there a link between aging and microbiome diversity in exceptional mammalian longevity, PeerJ, 6, e4174, 10.7717/peerj.4174

Jacob, 2015, Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed, Biostatistics, 17, 16, 10.1093/biostatistics/kxv026

Jiang, 2017, Advances in industrial microbiome based on microbial consortium for biorefinery, Bioresour Bioprocess, 11

Johnson, 2007, Adjusting batch effects in microarray expression data using empirical Bayes methods., Biostatistics, 8, 118, 10.1093/biostatistics/kxj037

Jolliffe, 2003, Principal component analysis, Technometrics, 276

Kaul, 2017, Analysis of microbiome data in the presence of excess zeros, Front Microbiol, 2114

Kim, 2017, Optimizing methods and dodging pitfalls in microbiome research, Microbiome, 52

Kong, 2018, Microbiome profiling reveals gut dysbiosis in a transgenic mouse model of Huntington’s disease, Neurobiol Dis

Langdon, 2016, The effects of antibiotics on the microbiome throughout development and alternative approaches for therapeutic modulation, Genome Med, 39

Langille, 2014, Microbial shifts in the aging mouse gut, Microbiome, 2, 50, 10.1186/s40168-014-0050-9

Lauder, 2016, Comparison of placenta samples with contamination controls does not provide evidence for a distinct placenta microbiota, Microbiome, 4, 29, 10.1186/s40168-016-0172-3

Lazar, 2012, Batch effect removal methods for microarray gene expression data integration: a survey, Brief Bioinform, 14, 469, 10.1093/bib/bbs037

Le Cao, 2016, MixMC: a multivariate statistical framework to gain insight into microbial communities, PloS One, e0160169

Leek, 2014, Svaseq: removing batch effects and other unwanted noise from sequencing data, Nucleic Acids Res, e161

Leek, 2018, sva: Surrogate Variable Analysis

Leek, 2010, Tackling the widespread and critical impact of batch effects in high-throughput data, Nat Rev Genet, 11, 733, 10.1038/nrg2825

Leek, 2007, Capturing heterogeneity in gene expression studies by surrogate variable analysis, PLoS Genet, 3, 10.1371/journal.pgen.0030161

Li, 2003, DNA-Chip Analyzer (dChip), The Analysis of Gene Expression Data: Methods and Software, 10.1007/0-387-21679-0_5

Li, 2015, Microbiome, metagenomics, and high-dimensional compositional data analysis, Annu Rev Stat Appl, 2, 73, 10.1146/annurev-statistics-010814-020351

Liang, 2018, Involvement of gut microbiome in human health and disease: brief overview, knowledge gaps and research opportunities, Gut Pathog, 3

Lin, 2014, Variable selection in regression with compositional covariates, Biometrika, 101, 785, 10.1093/biomet/asu031

Lin, 2019, scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets, Proc Natl Acad Sci U S A, 116, 9775, 10.1073/pnas.1820006116

Lozupone, 2013, Meta-analyses of studies of the human microbiota, Genome Res, 23, 1704, 10.1101/gr.151803.112

Martínez, 2018, Experimental evaluation of the importance of colonization history in early-life gut microbiota assembly, Elife, 10.7554/eLife.36521

McCafferty, 2013, Stochastic changes over time and not founder effects drive cage effects in microbial community assembly in a mouse model, 2116

McMurdie, 2014, PLoS Comput Biol, 10, 10.1371/journal.pcbi.1003531

Miyoshi, 2018, Minimizing confounders and increasing data quality in murine models for studies of the gut microbiome, PeerJ, 10.7717/peerj.5166

Nguyen, 2015, How informative is the mouse for human gut microbiota research?, Dis Model Mech, 8, 1, 10.1242/dmm.017400

Nygaard, 2016, Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses, Biostatistics, 29

Paulson, 2013, Differential abundance analysis for microbial marker-gene surveys., Nat Methods, 10, 1200, 10.1038/nmeth.2658

Poussin, 2018, Interrogating the microbiome: experimental and computational considerations in support of study reproducibility, Drug Discov Today, 1644

Rakoff-Nahoum, 2016, The evolution of cooperation within the gut microbiota, Nature, 255

Risso, 2014, Normalization of RNA-seq data using factor analysis of control genes or samples, Nat Biotechnol, 896

Ritchie, 2015, Limma powers differential expression analyses for RNA-sequencing and microarray studies, Nucleic Acids Res, 43, e47, 10.1093/nar/gkv007

PJ, 1987, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis., J Comput Appl Math, 20, 53, 10.1016/0377-0427(87)90125-7

Sacristán-Soriano, 2011, Exploring the links between natural products and bacterial assemblages in the sponge Aplysina aerophoba, Appl Environ Microbiol, 862

Salter, 2014, Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol, 12, 87

Schloss, 2018, Identifying and overcoming threats to reproducibility, replicability, robustness, and generalizability in microbiome research, MBio, 10.1128/mBio.00525-18

2018, Sims AH, Smethurst GJ, Hey Y, et al. The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis, BMC Med Genomics, 1, 42

Stämmler

Nature, 545, 305, 10.1038/nature22075

2017, A comprehensive analysis of breast cancer microbiota and host gene expression. PloS One, 12

Dig Dis Sci, 52, 2069, 10.1007/s10620-006-9285-z

Analyzing Compositional Data with R

Weiss, 2017, Normalization and microbial differential abundance strategies depend upon data characteristics, Microbiome, 27

2017, Resolving host–pathogen interactions by dual rna-seq. PLoS Pathog, 13

Wu, 2016, Cigarette smoking and the oral microbiome in a large study of american adults. ISME J, 10, 2435

Xia, 2017, Hypothesis testing and statistical analysis of microbiome, Genes Dis, 4, 138, 10.1016/j.gendis.2017.06.001