OpenMendel: a cooperative programming project for statistical genetics

Hua Zhou1, Janet S. Sinsheimer2, Douglas M. Bates3, Benjamin B. Chu4, Christopher A. German1, Sarah S. Ji1, Kevin L. Keys5, Ju-Hyun Kim1, Seyoon Ko6, Gordon Mosher7, Jeanette C. Papp2, Eric M. Sobel2, Jing Zhai8, Jin Zhou8, Kenneth Lange4
1Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, USA
2Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, USA
3Department of Statistics, University of Wisconsin, Madison, USA
4Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, USA
5Department of Medicine, University of California, San Francisco, USA
6Department of Statistics, Seoul National University, Seoul, South Korea
7Departments of Statistics and Computer Science, University of California, Riverside, USA
8Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, USA

Tóm tắt

Từ khóa


Tài liệu tham khảo

Aird I, Bentall HH, Roberts JF (1953) Relationship between cancer of stomach and the abo blood groups. Br Med J 1(4814):799

Alexander DH, Novembre J, Lange K (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9):1655–1664. https://doi.org/10.1101/gr.094052.109

Amin N, Van Duijn CM, Aulchenko YS (2007) A genomic background based method for association analysis in related individuals. PLoS One 2(12):e1274

Astle W, Balding DJ et al (2009) Population structure and cryptic relatedness in genetic association studies. Stat Sci 24(4):451–471

Bahmani S, Raj B, Boufounos PT (2013) Greedy sparsity-constrained optimization. J Mach Learn Res 14(Mar):807–841

Bezanson J, Edelman A, Karpinski S, Shah VB (2017) Julia: a fresh approach to numerical computing. SIAM Rev 59(1):65–98. https://doi.org/10.1137/141000671

Bickerstaffe A, Ranaweera T, Endersby T, Ellis C, Maddumarachchi S, Gooden GE, White P, Moses EK, Hewitt AW, Hopper JL (2017) The Ark: a customizable web-based data management tool for health and medical research. Bioinformatics 33(4):624–626. https://doi.org/10.1093/bioinformatics/btw675

Blumensath T, Davies ME (2008) Iterative thresholding for sparse approximations. J Fourier Anal Appl 14(5–6):629–654

Blumensath T, Davies ME (2009) Iterative hard thresholding for compressed sensing. Appl Comput Harmon Anal 27(3):265–274

Boerwinkle E, Sing C (1987) The use of measured genotype information in the analysis of quantitative phenotypes in man. Ann Hum Genet 51(3):211–226

Brody JA, Morrison AC, Bis JC, O’Connell JR, Brown MR, Huffman JE, Ames DC, Carroll A, Conomos MP, Gabriel S et al (2017) Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology. Nat Genet 49(11):1560

Bühlmann P, Van De Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, New York

Burgess S, Thompson SG (2015) Mendelian randomization: methods for using genetic variants in causal estimation. Chapman and Hall/CRC, Boca Raton

Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717–772. https://doi.org/10.1007/s10208-009-9045-5

Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet 86(1):6–22

Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HYK, Chen R, Miriami E, Karczewski KJ, Hariharan M, Dewey FE, Cheng Y, Clark MJ, Im H, Habegger L, Balasubramanian S, O’Huallachain M, Dudley JT, Hillenmeyer S, Haraksingh R, Sharon D, Euskirchen G, Lacroute P, Bettinger K, Boyle AP, Kasowski M, Grubert F, Seki S, Garcia M, Whirl-Carrillo M, Gallardo M, Blasco MA, Greenberg PL, Snyder P, Klein TE, Altman RB, Butte AJ, Ashley EA, Gerstein M, Nadeau KC, Tang H, Snyder M (2012) Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 148(6):1293–1307. https://doi.org/10.1016/j.cell.2012.02.009

Chen WM, Abecasis GR (2007) Family-based association tests for genomewide association scans. Am J Hum Genet 81(5):913–926

Chi EC, Zhou H, Chen GK, Del Vecchyo DO, Lange K (2013) Genotype imputation via matrix completion. Genome Res 23(3):509–518. https://doi.org/10.1101/gr.145821.112

Chiu Cy, Jung J, Chen W, Weeks DE, Ren H, Boehnke M, Amos CI, Liu A, Mills JL, Ting Lee Ml, Xiong M, Fan R (2016) Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. European Journal Of Human Genetics 25:350 EP. https://doi.org/10.1038/ejhg.2016.170

Clark MM, Blangero J, Dyer TD, Sobel EM, Sinsheimer JS (2016) The quantitative-MFG test: a linear mixed effect model to detect maternal-offspring gene interactions. Ann Hum Genet 80(1):63–80. https://doi.org/10.1111/ahg.12137

Claster A (2017) Julia joins petaflop club. URL https://juliacomputing.com/press/2017/09/12/julia-joins-petaflop-club.html

Conomos MP, Reiner AP, Weir BS, Thornton TA (2016) Model-free estimation of recent genetic relatedness. Am J Hum Genet 98(1):127–148

Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M (2009) Mapping complex disease traits with global gene expression. Nat Rev Genet 10(3):184

Day-Williams AG, Blangero J, Dyer TD, Lange K, Sobel EM (2011) Linkage analysis without defined pedigrees. Genet Epidemiol 35(5):360–370. https://doi.org/10.1002/gepi.20584

Falconer D, Mackay T (1996) C. 1996. Introduction to Quantitative Genetics, pp 82–86

Fan R, Wang Y, Chiu Cy, Chen W, Ren H, Li Y, Boehnke M, Amos CI, Moore JH, Xiong M (2016) Meta-analysis of complex diseases at gene level with generalized functional linear models. Genetics 202(2):457–470. https://doi.org/10.1534/genetics.115.180869 . http://www.genetics.org/content/202/2/457

Fisher RA (1915) Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10(4):507–521

Fisher RA (1921) On the probable error of a coefficient of correlation deduced from a small sample. Metron 1:3–32

Gaziano JM, Concato J, Brophy M, Fiore L, Pyarajan S, Breeling J, Whitbourne S, Deen J, Shannon C, Humphries D, Guarino P, Aslan M, Anderson D, LaFleur R, Hammond T, Schaa K, Moser J, Huang G, Muralidhar S, Przygodzki R, O’Leary TJ (2016) Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J Clin Epidemiol 70:214–223. https://doi.org/10.1016/j.jclinepi.2015.09.016

Hall MA, Wallace J, Lucas A, Kim D, Basile AO, Verma SS, McCarty CA, Brilliant MH, Peissig PL, Kitchner TE et al (2017) Plato software provides analytic framework for investigating complexity beyond genome-wide association studies. Nat Commun 8(1):1167

Hastie T, Mazumder R, Lee JD, Zadeh R (2015) Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16(1):3367–3402, http://dl.acm.org/citation.cfm?id=2789272.2912106

Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J, Stefánsson K (2005) An Icelandic example of the impact of population structure on association studies. Nat Genet 37(1):90

Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44(8):955

Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5(6):e1000,529. https://doi.org/10.1371/journal.pgen.1000529

Jacquard A (1974) The genetic structure of populations, vol 5. Springer Science & Business Media, New York

Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42(4):348–354

Kawaguchi ES, Suchard MA, Liu Z, Li G (2018) Scalable sparse Cox regression for large-scale survival data via broken adaptive ridge. arXiv:1712.00561 (in preparation)

Keys KL, Chen GK, Lange K (2017) Iterative hard thresholding for model selection in genome-wide association studies. Genet Epidemiol 41(8):756–768

Khanna R, Kyrillidis A (2018) Iht dies hard: Provable accelerated iterative hard thresholding. In: International Conference on Artificial Intelligence and Statistics, pp 188–198

Kilpinen H, Barrett JC (2013) How next-generation sequencing is transforming complex disease genetics. Trends Genet 29(1):23–30

Kim J, Bai Y, Pan W (2015) An adaptive association test for multiple phenotypes with GWAS summary statistics. Genet Epidemiol 39(8):651–663

Knowler WC, Williams R, Pettitt D, Steinberg AG (1988) Gm3; 5, 13, 14 and type 2 diabetes mellitus: an association in american indians with genetic admixture. Am J Hum Genet 43(4):520

Lange K (2003) Mathematical and statistical methods for genetic analysis. Springer Science & Business Media, New York

Lange K (2016) MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA. https://doi.org/10.1137/1.9781611974409.ch1

Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics 29(12):1568–1570

Lange K, Sinsheimer J (1992) Calculation of genetic identity coefficients. Ann Hum Genet 56(4):339–346

Lee S, Abecasis GR, Boehnke M, Lin X (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95(1):5–23. https://doi.org/10.1016/j.ajhg.2014.06.009

Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34(8):816–834. https://doi.org/10.1002/gepi.20533

Liberty E, Woolfe F, Martinsson PG, Rokhlin V, Tygert M (2007) Randomized algorithms for the low-rank approximation of matrices. Proc Natl Acad Sci USA 104(51):20167–20172. https://doi.org/10.1073/pnas.0709640104

Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D (2011) FaST linear mixed models for genome-wide association studies. Nat Methods 8(10):833–835

Liu Y, Athanasiadis G, Weale ME (2008) A survey of genetic simulation software for population and epidemiological studies. Hum Genom 3(1):79

Mancuso N, Shi H, Goddard P, Kichaev G, Gusev A, Pasaniuc B (2017) Integrating gene expression with summary association statistics to identify genes associated with 30 complex traits. Am J Hum Genet 100(3):473–487

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867–2873

Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7):906–913. https://doi.org/10.1038/ng2088

Metzker ML (2010) Sequencing technologies-the next generation. Nat Rev Genet 11(1):31

Mittal S, Madigan D, Burd RS, Suchard MA (2014) High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis. Biostatistics 15(2):207–221. https://doi.org/10.1093/biostatistics/kxt043

Morris AP, Lindgren CM, Zeggini E, Timpson NJ, Frayling TM, Hattersley AT, McCarthy MI (2010) A powerful approach to sub-phenotype analysis in population-based genetic association studies. Gen Epidemiol 34(4):335–343

Novembre J, Peter BM (2016) Recent advances in the study of fine-scale population structure in humans. Curr Opin Genet Dev 41:98–105

Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet 2(12):e190

Pickrell WO, Rees MI, Chung SK (2012) Next generation sequencing methodologies-an overview. In: Advances in protein chemistry and structural biology, vol. 89, pp. 1–26. Elsevier

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909. https://doi.org/10.1038/ng1847

Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2):945–959

Ranaweera T, Makalic E, Hopper JL, Bickerstaffe A (2018) An open-source, integrated pedigree data management and visualization tool for genetic epidemiology. Int J Epidemiol 47(4):1034–1039. https://doi.org/10.1093/ije/dyy049

Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers for inference of ancestry. Am J Hum Genet 73(6):1402–1422

Schäffer AA, Lemire M, Ott J, Lathrop GM, Weeks DE (2011) Coordinated conditional simulation with slink and sup of many markers linked or associated to a trait in large pedigrees. Hum Hered 71(2):126–134

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70(2):425–434

Shen J, Li P (2017) A tight bound of hard thresholding. J Mach Learn Res 18(1):7650–7691

Sobel E, Lange K, OConnell JR, Weeks DE (1996) Haplotyping algorithms. In: Genetic mapping and DNA sequencing, pp. 89–110. Springer

Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D (2013) Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Transactions on Modeling and Computer Simulation (TOMACS) 23(1):article10:1–17. https://doi.org/10.1145/2414416.2414791

Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M et al (2015) UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12(3):e1001,779

Svishcheva GR, Axenovich TI, Belonogova NM, van Duijn CM, Aulchenko YS (2012) Rapid variance components-based method for whole-genome association analysis. Nat Genet 44(10):1166

Telenti A, Pierce LCT, Biggs WH, di Iulio J, Wong EHM, Fabani MM, Kirkness EF, Moustafa A, Shah N, Xie C, Brewerton SC, Bulsara N, Garner C, Metzker G, Sandoval E, Perkins BA, Och FJ, Turpaz Y, Venter JC (2016) Deep sequencing of 10,000 human genomes. Proc Natl Acad Sci 113(42):11901–11906. https://doi.org/10.1073/pnas.1613365113

Van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C (2014) Ten years of next-generation sequencing technology. Trends Genet 30(9):418–426

Van Leeuwen EM, Kanterakis A, Deelen P, Kattenberg MV, Abdellaoui A, Hofman A, Schönhuth A, Menelaou A, de Craen AJ, van Schaik BD et al (2015) Population-specific genotype imputations using minimac or impute2. Nat Protocols 10(9):1285

VanRaden PM (2008) Efficient methods to compute genomic predictions. J Dairy Sci 91(11):4414–4423

Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of GWAS discovery. Am J Hum Genet 90(1):7–24

Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J (2017) 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 101(1):5–22

Wang B, Sverdlov S, Thompson E (2017) Efficient estimation of realized kinship from SNP genotypes. Genetics 210(2)

Wang K, Li M, Bucan M (2007) Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet 81(6):1278–1283

Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 1(89):82–93

Wu TT, Chen YF, Hastie T, Sobel E, Lange K (2009) Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25(6):714–721

Yang F, Barber RF, Jain P, Lafferty J (2016) Selective inference for group-sparse linear models. In: Advances in Neural Information Processing Systems, pp 2469–2477

Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ et al (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44(4):369

Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88(1):76–82. https://doi.org/10.1016/j.ajhg.2010.11.011

Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y (2012) An overview of population genetic data simulation. J Comput Biol 19(1):42–54

Yuan XT, Li P, Zhang T (2017) Gradient hard thresholding pursuit. J Mach Learn Res 18:166–221

Zhou H, Alexander D, Lange K (2011) A quasi-newton acceleration for high-dimensional optimization algorithms. Stat Comput 21(2):261–273

Zhou H, Alexander DH, Sehl ME, Sinsheimer JS, Sobel E, Lange K (2011) Penalized regression for genome-wide association screening of sequence data. In: Biocomputing 2011, pp. 106–117. World Scientific

Zhou H, Blangero J, Dyer TD, Chan KhK, Lange K, Sobel EM (2017) Fast genome-wide QTL association mapping on pedigree and population data. Genet Epidemiol 41(3):174–186. https://doi.org/10.1002/gepi.21988

Zhou H, Hu L, Zhou J, Lange K (2018) MM algorithms for variance components models. J Comput Graph Stat Accept. https://doi.org/10.1080/10618600.2018.1529601

Zhou H, Sehl ME, Sinsheimer JS, Lange K (2010) Association screening of common and rare genetic variants by penalized regression. Bioinformatics 26(19):2375–2382

Zhou JJ, Hu T, Qiao D, Cho MH, Zhou H (2016) Boosting gene mapping power and efficiency with efficient exact variance component tests of SNP sets. Genetics 204(3):921–931

Zhou JJ, Sinsheimer JS, Cho MH, Castaldi P, Zhou H (2018) MMVC: An efficient mm algorithm to quantify genetic correlations across large number of phenotypes in giant datasets. manuscript in preparation

Zhou X, Stephens M (2014) Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat Methods 11(4):407–409. https://doi.org/10.1038/nmeth.2848

Zhu X, Zhang S, Zhao H, Cooper RS (2002) Association mapping, using a mixture model for complex traits. Genet Epidemiol 23(2):181–196