Genomic Prediction Models for Count Data

Osval A. Montesinos-López1, Abelardo Montesinos-López2, Paulino Pérez-Rodríguez3, Kent Eskridge4, Xinyao He5, Philomin Juliana6, Pawan Singh5, José Crossa7
1Biometrics and Statistics Unit of the International Maize and Wheat Improvement Center (CIMMYT), México, México
2Departamento de Estadística, Centro de Investigación en Matemáticas (CIMAT), Guanajuato, Guanajuato, México
3Colegio de Postgraduados, Montecillos, México
4Deparment of Statistics at the University of Nebraska, Lincoln, USA
5Global Wheat Breeding Program of CIMMYT, México, México
6Plant Breeding & Genetics, Cornell University, Ithaca, USA
7Biometrics and Statistics Unit of CIMMYT, Texcoco, México

Tóm tắt

Whole genome prediction models are useful tools for breeders when selecting candidate individuals early in life for rapid genetic gains. However, most prediction models developed so far assume that the response variable is continuous and that its empirical distribution can be approximated by a Gaussian model. A few models have been developed for ordered categorical phenotypes, but there is a lack of genomic prediction models for count data. There are well-established regression models for count data that cannot be used for genomic-enabled prediction because they were developed for a large sample size (n) and a small number of parameters (p); however, the rule in genomic-enabled prediction is that p is much larger than the sample size n. Here we propose a Bayesian mixed negative binomial (BMNB) regression model for counts, and we present the conditional distributions necessary to efficiently implement a Gibbs sampler. The proposed Bayesian inference can be implemented routinely. We evaluated the proposed BMNB model together with a Poisson model, a Normal model with untransformed response, and a Normal model with transformed response using a logarithm, and applied them to two real wheat datasets from the International Maize and Wheat Improvement Center. Based on the criteria used for assessing genomic prediction accuracy, results indicated that the BMNB model is a viable alternative for analyzing count data.

Tài liệu tham khảo

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422): 669-679. Boone, E. L., Stewart-Koster, B., & Kennard, M. J. (2012). A hierarchical zero-inflated Poisson regression model for stream fish distribution and abundance. Environmetrics, 23(3), 207-218. de los Campos, G., Gianola, D., & Allison, D. B. (2010). Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet, 11: 880-886. doi:10.1038/nrg2898. de los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y. C., & Sorensen, D. (2013a). Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor. PLoS Genetics 9 (7) e1003608. de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., & Calus, M. P. L. (2013b). Whole Genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics, 193(2), 327-345. Gelfand, A. E. (1996). Model determination using sampling-based methods. In: Gilks, W. R., Richardson, S., & Spiegelhalter, D. J., editors. Markov Chain Monte Carlo in practice. London: Chapman & Hall. Pp. 145-60. Gelfand, A. E., & Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398-409. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. 2. Boca Raton: Chapman & Hall. Geyer, C. J. (1992). Practical Markov Chain Monte Carlo. Statistical Science, 473-483. Goddard, M. E., & Hayes, B. J. (2009). Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet, 10: 381-391. doi:10.1038/nrg2575. Kärkkäinen, H. P., & Sillanpää, M. J. (2012). Back to basics for Bayesian model building in genomic selection. Genetics, 191(3), 969-987. Kizilkaya, K., Fernando, R. L., & Garrick, D. J. (2014). Reduction in accuracy of genomic prediction for ordered categorical data compared to continuous observations. Genetics Selection Evolution, 46:37 doi:10.1186/1297-9686-46-37. Laud, P. W., & Ibrahim, J. G. (1995). Predictive Model Selection. Journal of the Royal Statistical Society, B 57, pp. 247-262. Link, W. A., & Eaton, M. J. (2012). On thinning of chains in MCMC. Methods in Ecology and Evolution, 3(1), 112-115. MacEachern, S. N., & Berliner, L. M. (1994). Subsampling the Gibbs sampler. The American Statistician, 48(3), 188-190. Montesinos-López, O. A., Montesinos-López, A., Pérez-Rodríguez, P., de los Campos, G., Eskridge, K. M., & Crossa, J. (2015). Threshold models for genome-enabled prediction of ordinal categorical traits in plant breeding. G3: Genes| Genomes| Genetics, 5(1), 1-10. Park, T., & van Dyk, D. A. (2009). Partially collapsed Gibbs samplers: Illustrations and applications. Journal of Computational and Graphical Statistics, 18(2), 283-305. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339-1349. Poland, J.A., Brown, P.J., Sorrells, M.E., Jannink J.-L. 2012. Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PloS ONE, 7:e32253. Quenouille, M. H. (1949). A relation between the logarithmic, Poisson, and negative binomial series. Biometrics, 5(2), 162-164. R Core Team. (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/. Riedelsheimer, C., Czedik-Eysenberg, A., Grieder, C., Lisec, J., Technow, F., et al. (2012). Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat Genet 44: 217-220. doi:10.1038/ng.1033. Scott, J., & Pillow, J. W. (2013). Fully Bayesian inference for neural models with negative-binomial spiking. In Advances in neural information processing systems, pp. 1898-1906. Spiegelhalter, D. J., Mejor, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society, B 64, pp. 583-639. Stroup, W. W. (2015). Rethinking the Analysis of Non-Normal Data in Plant and Soil Science. Agronomy Journal, 107(2): 811-827. VanRaden, P. M. (2007). Genomic measures of relationship and inbreeding. Interbull Bull 37: 33-36. ——– (2008). Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414-4423. Windle, J., Carvalho, C. M., Scott, J. G., & Sun, L. (2013). Pólya–Gamma Data Augmentation for Dynamic Models. arXiv preprint arXiv:1308.0774. Zhang, Z., Ober, U., Erbe, M., Zhang, H., Gao, N., He, J., & Simianer, H. (2014). Improving the accuracy of whole genome prediction for complex traits using the results of genome-wide association studies. PloS One, 9(3), e93017. Zhou, M., & Carin, L. (2015). Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2), 307-320. Zhou, M., Li, L., Dunson, D., & Carin, L. (2012). Lognormal and gamma mixed negative binomial regression. In Machine Learning: Proceedings of the International Conference on Machine Learning (vol. 2012, p. 1343). NIH Public Access.