A robust knockoff filter for sparse regression analysis of microbiome compositional data
Computational Statistics - Trang 1-18 - 2022
Tóm tắt
Microbiome data analysis often relies on the identification of a subset of potential biomarkers associated with a clinical outcome of interest. Robust ZeroSum regression, an elastic-net penalized compositional regression built on the least trimmed squares estimator, is a variable selection procedure capable to cope with the high dimensionality of these data, their compositional nature, and, at the same time, it guarantees robustness against the presence of outliers. The necessity of discovering “true” effects and to improve clinical research quality and reproducibility has motivated us to propose a two-step robust compositional knockoff filter procedure, which allows selecting the set of relevant biomarkers, among the many measured features having a nonzero effect on the response, controlling the expected fraction of false positives. We demonstrate the effectiveness of our proposal in an extensive simulation study, and illustrate its usefulness in an application to intestinal microbiome analysis.
Tài liệu tham khảo
Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall, London
Aitchison J, Bacon-Shone J (1984) Log contrast models for experiments with mixtures. Biometrika 71(2):323–330. https://doi.org/10.2307/2336249
Aitchison J, Shen SM (1980) Logistic-normal distributions: some properties and uses. Biometrika 67(2):261–272. https://doi.org/10.2307/2335470
Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248. https://doi.org/10.1214/12-AOAS575
Altenbuchinger M, Rehberg T, Zacharias HU, Stämmler F, Dettmer K, Weber D, Hiergeist A, Gessner A, Holler E, Oefner PJ, Spang R (2017) Reference point insensitive molecular data analysis. Bioinformatics 33(2):219–226. https://doi.org/10.1093/bioinformatics/btw598
Barber RF, Candés EJ (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085. https://doi.org/10.1214/15-AOS1337
Barber RF, Candés EJ (2019) A knockoff filter for high-dimensional selective inference. Ann Stat 47(5):2504–2537. https://doi.org/10.1214/18-AOS1755
Bates S, Candés E, Janson L, Wang W (2021) Metropolized knockoff sampling. J Am Stat Assoc 116(535):1413–1427. https://doi.org/10.1080/01621459.2020.1729163
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Stat Methodol 57(1):289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 44(2):813–852. https://doi.org/10.1214/15-AOS1388
Brzyski D, Peterson CB, Sobczyk P, Candés EJ, Bogdan M, Sabatti C (2017) Controlling the rate of GWAS false discoveries. Genetics 205(1):61–75. https://doi.org/10.1534/genetics.116.193987
Candés E, Fan Y, Janson L, Lv J (2018) Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J R Stat Soc Ser B Stat Methodol 80(3):551–577. https://doi.org/10.1111/rssb.12265
Egozcue J, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35:279–300. https://doi.org/10.1023/A:1023818214614
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol 70(5):849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ (2017) Microbiome datasets are compositional: and this is not optional. Front Microbiol 8:2224. https://doi.org/10.3389/fmicb.2017.02224
Konno H, Yamamoto R (2009) Choosing the best set of variables in regression analysis using integer programming. J Glob Optim 44(2):273–282. https://doi.org/10.1007/s10898-008-9323-9
Li H (2015) Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl 2:73–94. https://doi.org/10.1146/annurev-statistics-010814-020351
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139. https://doi.org/10.1080/01621459.2012.695654
Lin W, Shi P, Feng R, Li H (2014) Variable selection in regression with compositional covariates. Biometrika 101(4):785–797. https://doi.org/10.1093/biomet/asu031
Lubbe S, Filzmoser P, Templ M (2021) Comparison of zero replacement strategies for compositional data with large numbers of zeros. Chemom Intell Lab Syst 210:104248. https://doi.org/10.1016/j.chemolab.2021.104248
Maronna RA, Martin RD, Yohai VJ, Salibián-Barrera M (2019) Robust statistics: theory and methods (with R). Wiley, Hoboken
Monti GS, Filzmoser P (2021) Sparse least trimmed squares regression with compositional covariates for high dimensional data. Bioinformatics 37(21):3805–3814. https://doi.org/10.1093/bioinformatics/btab572
Nearing JT, Douglas GM, Hayes MG, MacDonald J, Desai DK, Allward N, Jones CMA, Wright RJ, Dhanani AS, Comeau AM, Langille MGI (2022) Microbiome differential abundance methods produce different results across 38 datasets. Nat Commun 13(1):1–6. https://doi.org/10.1038/s41467-022-28034-z
Sesia M, Sabatti C, Candés EJ (2019) Gene hunting with hidden Markov model knockoffs. Biometrika 106(1):1–18. https://doi.org/10.1093/biomet/asy033
Shi P, Zhang A, Li H (2016) Regression analysis for microbiome compositional data. Ann Stat 10(2):1019–1040. https://doi.org/10.1214/16-AOAS928
Srinivasan A, Xue L, Zhan X (2021) Compositional knockoff filter for high-dimensional regression analysis of microbiome data. Biometrics 77(3):984–995. https://doi.org/10.1111/biom.13336
Storey JD (2002) A direct approach to false discovery rates. J R Stat Soc Ser B Stat Methodol 64(3):479–498. https://doi.org/10.1111/1467-9868.00346
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci 100(16):9440–9445. https://doi.org/10.1073/pnas.1530509100
Szekely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794. https://doi.org/10.1214/009053607000000505
The Human Microbiome Project Consortium (2012) A framework for human microbiome research. Nature 486:215–221. https://doi.org/10.1038/nature11209
Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C, Zaneveld JR, Vázquez-Baeza Y, Birmingham A, Hyde ER, Knight R (2017) Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5(1):1–18. https://doi.org/10.1186/s40168-017-0237-y
Zhang W, Xia Y (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space’’. J R Stat Soc Ser B Stat Methodol 70(2):849–911
Zhu X, Yang Y (2015) Variable selection after screening: with or without data splitting? Comput Stat 30(1):191–203. https://doi.org/10.1007/s00180-014-0528-8