Genomic Prediction in Animals and Plants: Simulation of Data, Validation, Reporting, and Benchmarking

Genetics - Tập 193 Số 2 - Trang 347-365 - 2013
Hans D. Daetwyler1, M.P.L. Calus2, Ricardo Pong‐Wong3, Gustavo de los Campos4, John M. Hickey5,6
1Biosciences Research Division, Department of Primary Industries, Bundoora 3083, Victoria, Australia
2Animal Breeding and Genomics Centre, Wageningen University Research Livestock Research, 8200 AB Lelystad, The Netherlands
3The Roslin Institute, Royal Dick School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG Scotland, United Kingdom
4Department of Biostatistics, School of Public Health, University of Alabama, Birmingham, Alabama 35294
5Biometrics and Statistics Unit, International Maize and Wheat Improvement Center (CIMMYT), 06600 Mexico, D.F., Mexico
6School of Environmental and Rural Science, University of New England, Armidale 2351, New South Wales, Australia

Tóm tắt

AbstractThe genomic prediction of phenotypes and breeding values in animals and plants has developed rapidly into its own research field. Results of genomic prediction studies are often difficult to compare because data simulation varies, real or simulated data are not fully described, and not all relevant results are reported. In addition, some new methods have been compared only in limited genetic architectures, leading to potentially misleading conclusions. In this article we review simulation procedures, discuss validation and reporting of results, and apply benchmark procedures for a variety of genomic prediction methods in simulated and real example data. Plant and animal breeding programs are being transformed by the use of genomic data, which are becoming widely available and cost-effective to predict genetic merit. A large number of genomic prediction studies have been published using both simulated and real data. The relative novelty of this area of research has made the development of scientific conventions difficult with regard to description of the real data, simulation of genomes, validation and reporting of results, and forward in time methods. In this review article we discuss the generation of simulated genotype and phenotype data, using approaches such as the coalescent and forward in time simulation. We outline ways to validate simulated data and genomic prediction results, including cross-validation. The accuracy and bias of genomic prediction are highlighted as performance indicators that should be reported. We suggest that a measure of relatedness between the reference and validation individuals be reported, as its impact on the accuracy of genomic prediction is substantial. A large number of methods were compared in example simulated and real (pine and wheat) data sets, all of which are publicly available. In our limited simulations, most methods performed similarly in traits with a large number of quantitative trait loci (QTL), whereas in traits with fewer QTL variable selection did have some advantages. In the real data sets examined here all methods had very similar accuracies. We conclude that no single method can serve as a benchmark for genomic prediction. We recommend comparing accuracy and bias of new methods to results from genomic best linear prediction and a variable selection approach (e.g., BayesB), because, together, these methods are appropriate for a range of genetic architectures. An accompanying article in this issue provides a comprehensive review of genomic prediction methods and discusses a selection of topics related to application of genomic prediction in plants and animals.

Từ khóa


Tài liệu tham khảo

Amer, 2010, Implications of avoiding overlap between training and testing data sets when evaluating genomic predictions of genetic merit, J. Dairy Sci., 93, 3320, 10.3168/jds.2009-2845

Bernardo, 2007, Prospects for genomewide selection for quantitative traits in maize, Crop Sci., 47, 1082, 10.2135/cropsci2006.11.0690

Bijma, 2012, Accuracies of estimated breeding values from ordinary genetic evaluations do not reflect the correlation between true and estimated breeding values in selected populations, J. Anim. Breed. Genet., 129, 345, 10.1111/j.1439-0388.2012.00991.x

Calus, 2011, Accuracy of multi-trait genomic selection using different methods, Genet. Sel. Evol., 43, 26, 10.1186/1297-9686-43-26

Calus, 2010, Genomic breeding value prediction: methods and procedures, Animal, 4, 157, 10.1017/S1751731109991352

Calus, 2008, Accuracy of genomic selection using different methods to define haplotypes, Genetics, 178, 553, 10.1534/genetics.107.080838

Chen, 2009, Fast and flexible simulation of DNA sequence data, Genome Res., 19, 136, 10.1101/gr.083634.108

Clark, 2011, Different models of genetic variation and their effect on genomic evaluation, Genet. Sel. Evol., 43, 18, 10.1186/1297-9686-43-18

Clark, 2011, Proceedings of the Association for the Advancement of Animal Breeding and Genetics. 19–21 July 2012

Clark, 2012, The importance of information on relatives for the prediction of genomic breeding values and implications for the makeup of reference populations in livestock breeding schemes, Genet. Sel. Evol., 44, 4, 10.1186/1297-9686-44-4

Cleveland, 2012

Coster A , BastiaansenJ, 2009 HaploSim: R-package version 1.8-4. http://cran.r-project.org/web/packages/HaploSim/index.html.

Crossa, 2010, Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers, Genetics, 186, 713, 10.1534/genetics.110.118521

Daetwyler, 2009

Daetwyler, 2007, Inbreeding in genome-wide selection, J. Anim. Breed. Genet., 124, 369, 10.1111/j.1439-0388.2007.00693.x

Daetwyler, 2008, Accuracy of predicting the genetic risk of disease using a genome-wide approach, PLoS ONE, 3, e3395, 10.1371/journal.pone.0003395

Daetwyler, 2010, Accuracy of estimated genomic breeding values for wool and meat traits in a multi-breed sheep population, Anim. Prod. Sci., 50, 1004, 10.1071/AN10096

Daetwyler, 2010, The impact of genetic architecture on genome-wide evaluation methods, Genetics, 185, 1021, 10.1534/genetics.110.116855

Daetwyler, 2012, Components of the accuracy of genomic prediction in a multi-breed sheep population, J. Anim. Sci., 90, 3375, 10.2527/jas.2011-4557

Dekkers, 2002, The use of molecular genetics in the improvement of agricultural populations, Nat. Rev. Genet., 3, 22, 10.1038/nrg701

de los Campos G , PerezP, 2010 BLR: Bayesian linear regression. R-package version 1.2. http://cran.r-project.org/web/packages/BLR/index.html.

de los Campos, 2013, Whole genome regression and prediction methods applied to plant and animal breeding, Genetics, 10.1534/genetics.112.143313

De Roos, 2009, Reliability of genomic breeding values across multiple populations, Genetics, 183, 1545, 10.1534/genetics.109.104935

Donnelly, 1999, Genealogical processes for Fleming-Viot models with selection and recombination, Ann. Appl. Probab., 9, 1091, 10.1214/aoap/1029962866

Efron, 1983, A leisurely look at the bootstrap, the jackknife, and cross-validation, Am. Stat., 37, 36, 10.1080/00031305.1983.10483087

2010

Falconer, 1996, Introduction to Quantitative Genetics

Fearnhead, 2003, Ancestral processes for non-neutral models of complex diseases, Theor. Popul. Biol., 63, 115, 10.1016/S0040-5809(02)00049-7

Fisher, 1915, Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population, Biometrika, 10, 507

Gilmour, 2009, 2009 ASReml User Guide Release 3.0

Goddard, 2009, Genomic selection: prediction of accuracy and maximisation of long term response, Genetica, 136, 245, 10.1007/s10709-008-9308-0

Goddard, 2009, Mapping genes for complex traits in domestic animals and their use in breeding programmes, Nat. Rev. Genet., 10, 381, 10.1038/nrg2575

Groenen, 2011, The development and characterization of a 60K SNP chip for chicken, BMC Genomics, 12, 274, 10.1186/1471-2164-12-274

Habier, 2007, The impact of genetic relationship information on genome-assisted breeding values, Genetics, 177, 2389, 10.1534/genetics.107.081190

Habier, 2010, The impact of genetic relationship information on genomic breeding values in German Holstein cattle, Genet. Sel. Evol., 42, 5, 10.1186/1297-9686-42-5

Habier, 2011, Extension of the Bayesian alphabet for genomic selection, BMC Bioinformatics, 12, 186, 10.1186/1471-2105-12-186

Hastie, 2001, The Elements of Statistical Learning, 10.1007/978-0-387-21606-5

Hayes, 2009, Invited review: genomic selection in dairy cattle: progress and challenges, J. Dairy Sci., 92, 433, 10.3168/jds.2008-1646

Hayes, 2009, Accuracy of genomic breeding values in multi-breed dairy cattle populations, Genet. Sel. Evol., 41, 51, 10.1186/1297-9686-41-51

Hayes, 2009, Increased accuracy of artificial selection by using the realized relationship matrix, Genet. Res., 91, 47, 10.1017/S0016672308009981

Hayes, 2010, Genetic architecture of complex traits and accuracy of genomic prediction: coat colour, milk-fat percentage, and type in Holstein cattle as contrasting model traits, PLoS Genet., 6, e1001139, 10.1371/journal.pgen.1001139

Heffner, 2009, Genomic selection for crop improvement, Crop Sci., 49, 1, 10.2135/cropsci2008.08.0512

Applications of linear model in animal breeding

Hickey, 2012, Simulated data for genomic selection and genome-wide association studies using a combination of coalescent and gene drop methods, G3: Genes, Genomes, Genetics, 2, 425, 10.1534/g3.111.001297

Hickey, 2011, A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes, Genet. Sel. Evol., 43, 12, 10.1186/1297-9686-43-12

Hill, 1968, Linkage disequilibrium in finite populations, Theor. Appl. Genet., 38, 226, 10.1007/BF01245622

Hoggart, 2008, Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies, PLoS Genet., 4, 10.1371/journal.pgen.1000130

Hooper, 1958, The sampling variance of correlation coefficients under assumptions of fixed and mixed variates, Biometrika, 45, 471, 10.1093/biomet/45.3-4.471

Hudson, 2002, Generating samples under a Wright-Fisher neutral model, Bioinformatics, 18, 337, 10.1093/bioinformatics/18.2.337

Hudson, 1985, The sampling distribution of linkage disequilibrium under an infinite allele model without selection, Genetics, 109, 611, 10.1093/genetics/109.3.611

Ibanez-Escriche, 2009, Genomic selection of purebreds for crossbred performance, Genet. Sel. Evol., 41, 12, 10.1186/1297-9686-41-12

Jaccoud, 2001, Diversity arrays: a solid state technology for sequence information independent genotyping, Nucleic Acids Res., 29, e25, 10.1093/nar/29.4.e25

Jairath, 1998, Genetic evaluation for herd life in Canada, J. Dairy Sci., 81, 550, 10.3168/jds.S0022-0302(98)75607-3

Jannink, 2010, Genomic selection in plant breeding: from theory to practice, Brief. Funct. Genomics, 9, 166, 10.1093/bfgp/elq001

Kimura, 1964, The number of alleles that can be maintained in a finite population, Genetics, 49, 725, 10.1093/genetics/49.4.725

Kingman, 1982, On the geneaology of large populations, J. Appl. Probab., 19A, 27, 10.2307/3213548

Kingman, 2000, Origins of the Coalescent: 1974–1982, Genetics, 156, 1461, 10.1093/genetics/156.4.1461

Kizilkaya, 2010, Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes, J. Anim. Sci., 88, 544, 10.2527/jas.2009-2064

Krone, 1997, Ancestral processes with selection, Theor. Popul. Biol., 51, 210, 10.1006/tpbi.1997.1299

Legarra, 2008, Performance of genomic selection in mice, Genetics, 180, 611, 10.1534/genetics.108.088575

Lund, 2011, A common reference population from four European Holstein populations increases reliability of genomic predictions, Genet. Sel. Evol., 43, 43, 10.1186/1297-9686-43-43

Lund, 2009

MacLeod, 2010, Power of a genome scan to detect and locate quantitative trait loci in cattle using dense single nucleotide polymorphisms, J. Anim. Breed. Genet., 127, 133, 10.1111/j.1439-0388.2009.00831.x

Malosetti, 2007, A mixed-model approach to association mapping using pedigree information with an illustration of resistance to Phytophthora infestans in potato, Genetics, 175, 879, 10.1534/genetics.105.054932

Marchini, 2005, Genome-wide strategies for detecting multiple loci that influence complex diseases, Nat. Genet., 37, 413, 10.1038/ng1537

Marchini, 2007, A new multipoint method for genome-wide association studies by imputation of genotypes, Nat. Genet., 39, 906, 10.1038/ng2088

Marjoram, 2006, Fast “coalescent” simulation, BMC Genet., 7, 16, 10.1186/1471-2156-7-16

Matukumalli, 2009, Development and characterization of a high density SNP genotyping assay for cattle, PLoS ONE, 4, e5350, 10.1371/journal.pone.0005350

McVean, 2002, A genealogical interpretation of linkage disequilibrium, Genetics, 162, 987, 10.1093/genetics/162.2.987

Meuwissen, 2009, Accuracy of breeding values of ‘unrelated’ individuals predicted by dense SNP genotyping, Genet. Sel. Evol., 41, 35, 10.1186/1297-9686-41-35

Meuwissen, 2001, Prediction of total genetic value using genome-wide dense marker maps, Genetics, 157, 1819, 10.1093/genetics/157.4.1819

Nadaf, 2012, Effect of the prior distribution of SNP effects on the estimation of total breeding values, BMC Proc., 10.1186/1753-6561-6-S2-S6

Nagylaki, 1989, Gustave Malecot and the transition from classical to modern population genetics, Genetics, 122, 253, 10.1093/genetics/122.2.253

Nejati-Javaremi, 1997, Effect of total allelic relationship on accuracy of evaluation and response to selection, J. Anim. Sci., 75, 1738, 10.2527/1997.7571738x

Neuenschwander, 2008, quantiNemo: an individual-based program to simulate quantitative traits with explicit genetic architecture in a dynamic metapopulation, Bioinformatics, 24, 10.1093/bioinformatics/btn219

Ohta, 1971, Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population, Genetics, 68, 571, 10.1093/genetics/68.4.571

Patry, 2011, Accounting for genomic pre-selection in national BLUP evaluations in dairy cattle, Genet. Sel. Evol., 43, 30, 10.1186/1297-9686-43-30

Patry, 2011, Evidence of biases in genetic evaluations due to genomic preselection in dairy cattle, J. Dairy Sci., 94, 1011, 10.3168/jds.2010-3804

Peng, 2010, Forward-time simulation of realistic samples for genome-wide association studies, BMC Bioinformatics, 11, 442, 10.1186/1471-2105-11-442

Peng, 2005, simuPOP: a forward-time population genetics simulation environment, Bioinformatics, 21, 3686, 10.1093/bioinformatics/bti584

Peng, 2007, Forward-time simulations of human populations with complex diseases, PLoS Genet., 3, e47, 10.1371/journal.pgen.0030047

Pong-Wong, 2010, A two-step approach combining the Gompertz growth model with genomic selection for longitudinal data, BMC Proc., 4, S4, 10.1186/1753-6561-4-S1-S4

Pritchard, 2000, Association mapping in structured populations, Am. J. Hum. Genet., 67, 170, 10.1086/302959

Pryce, 2012, Accuracy of genomic predictions of residual feed intake and 250-day body weight in growing heifers using 625,000 single nucleotide polymorphism markers, J. Dairy Sci., 95, 2108, 10.3168/jds.2011-4628

Pryce, 2012, Novel strategies to minimize progeny inbreeding while maximizing genetic gain using genomic information, J. Dairy Sci., 95, 377, 10.3168/jds.2011-4254

Pszczola, 2012, Reliability of direct genomic values for animals with different relationships within and to the reference population, J. Dairy Sci., 95, 389, 10.3168/jds.2011-4338

Raadsma, 2008, Predicting genetic merit for mastitis and fertility in dairy cattle using genome wide selection and high density SNP screens, Anim. Genomics Anim. Health, 132, 219, 10.1159/000317163

Ramos, 2009, Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology, PLoS ONE, 4, e6524, 10.1371/journal.pone.0006524

Resende, 2012, Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.), Genetics, 190, 1503, 10.1534/genetics.111.137026

Saatchi, 2011, Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation, Genet. Sel. Evol., 43, 40, 10.1186/1297-9686-43-40

Sargolzaei, 2009, QMSim: a large-scale genome simulator for livestock, Bioinformatics, 25, 680, 10.1093/bioinformatics/btp045

Schaffner, 2005, Calibrating a coalescent simulation of human genome sequence variation, Genome Res., 15, 1576, 10.1101/gr.3709305

Schierenbeck, 2011, Controlling inbreeding and maximizing genetic gain using semi-definite programming with pedigree-based and genomic relationships, J. Dairy Sci., 94, 6143, 10.3168/jds.2011-4574

Solberg, 2009, Reducing dimensionality for prediction of genome-wide breeding values, Genet. Sel. Evol., 41, 29, 10.1186/1297-9686-41-29

Stam, 1980, The distribution of the fraction of the genome identical by descent in finite random mating populations, Genet. Res., 35, 131, 10.1017/S0016672300014002

Sved, 1971, Linkage disequilibrium and homozygosity of chromosome segments in finite populations, Theor. Popul. Biol., 2, 125, 10.1016/0040-5809(71)90011-6

Tenesa, 2007, Recent human effective population size estimated from linkage disequilibrium, Genome Res., 17, 520, 10.1101/gr.6023607

Toosi, 2010, Genomic selection in admixed and crossbred populations, J. Anim. Sci., 88, 32, 10.2527/jas.2009-1975

VanRaden, 2009

VanRaden, 2009, Invited review: reliability of genomic predictions for North American Holstein bulls, J. Dairy Sci., 92, 16, 10.3168/jds.2008-1514

Van Tassell, 2008, SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries, Nat. Methods, 5, 247, 10.1038/nmeth.1185

Visscher, 1998, Power of a chromosomal test to detect genetic variation using genetic markers, Heredity, 81, 317, 10.1046/j.1365-2540.1998.00398.x

Visscher, 2006, Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings, PLoS Genet., 2, 316, 10.1371/journal.pgen.0020041

Wakeley, 2005, The limits of theoretical population genetics, Genetics, 169, 1, 10.1093/genetics/169.1.1

Wolc, 2011, Breeding value prediction for production traits in layer chickens using pedigree or genomic relationships in a reduced animal model, Genet. Sel. Evol., 43, 5, 10.1186/1297-9686-43-5

Woolliams, 2012, Coalescence theory in livestock breeding, J. Anim. Breed. Genet., 129, 255, 10.1111/j.1439-0388.2012.01016.x

Woolliams, 1999, Expected genetic contributions and their impact on gene flow and genetic gain, Genetics, 153, 1009, 10.1093/genetics/153.2.1009

Yang, 2010, Common SNPs explain a large proportion of the heritability for human height, Nat. Genet., 42, 565, 10.1038/ng.608