Học Máy Trong Việc Phát Hiện Các Tương Tác Giữa Các Gen

Brett A. McKinney1,2, David M. Reif2,1, Marylyn D. Ritchie1, Jason H. Moore3,4,5,6,2
1Department of Molecular Physiology and Biophysics, Center for Human Genetics Research, Vanderbilt University Medical School, Nashville, USA
2Computational Genetics Laboratory, Department of Genetics, Dartmouth Medical School, Lebanon, USA
3Department of Community and Family Medicine, Dartmouth Medical School, Lebanon, USA
4Department of Biological Sciences, Dartmouth College, Hanover, USA
5Department of Computer Science, University of New Hampshire, Durham, USA
6Department of Computer Science, University of Vermont, Burlington, USA

Tóm tắt

Các tương tác phức tạp giữa các gen và các yếu tố môi trường được biết đến là có vai trò trong sinh bệnh học của các bệnh lý thường gặp ở con người. Có một khối lượng bằng chứng ngày càng tăng cho thấy rằng các tương tác phức tạp là ‘điều bình thường’, và thay vì chỉ như một sự nhiễu loạn nhỏ đối với di truyền học Mendel cổ điển, các tương tác có thể là tác động chiếm ưu thế. Các phương pháp thống kê truyền thống không phù hợp cho việc phát hiện những tương tác này, đặc biệt là khi dữ liệu có chiều cao (nhiều thuộc tính hoặc biến độc lập) hoặc khi có các tương tác xảy ra giữa nhiều hơn hai đa hình. Trong bài viết tổng quan này, chúng tôi thảo luận về các mô hình và thuật toán học máy để xác định và đặc trưng hóa các gen nhạy cảm ở các bệnh lý phức tạp, đa yếu tố thường gặp ở con người. Chúng tôi tập trung vào các phương pháp học máy sau đây đã được sử dụng để phát hiện các tương tác giữa các gen: mạng nơ-ron, tự động tế bào, rừng ngẫu nhiên và giảm chiều đa yếu tố. Chúng tôi kết thúc với một số ý tưởng về cách mà những phương pháp này và những phương pháp khác có thể được tích hợp vào một khuôn khổ toàn diện và linh hoạt cho việc khai thác dữ liệu và khám phá tri thức trong di truyền học người.

Từ khóa

#học máy #tương tác gen #bệnh lý thường gặp #di truyền học người #phân tích dữ liệu

Tài liệu tham khảo

Moore JH, Williams SM. Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. Bioessays 2005 Jun; 27(6): 637–46 Moore JH. A global view of epistasis. Nat Genet 2005 Jan; 37(1): 13–4 Bateson W. Mendel’s principles of heredity. Cambridge: Cambridge University Press, 1909 Fisher RA. The correlation between relatives on the assumption of Mendelian inheritance. Trans R Soc Edinb 1918; 52: 399–433 Phillips PC. The language of gene interaction. Genetics 1998 Jul; 149(3): 1167–71 Freitas AA. Understanding the crucial role of attribute interaction in data mining. Artif Intell Rev 2001; 16(3): 177–99 Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common diseases. Hum Hered 2003; 56: 73–82 Sing CF, Stengard JH, Kardia SL. Genes, environment, and cardiovascular disease. Arterioscler Thromb Vasc Biol 2003 Jul; 23(7): 1190–6 Gibson G, Wagner G. Canalization in evolutionary genetics: a stabilizing theory? Bioessays 2000 Apr; 22(4): 372–80 Templeton AR. Epistasis and complex traits. In: Wolf JB, Brodie ED, Wade MJ, editors. Epistasis and the evolutionary process. Oxford: Oxford University Press, 2000: 41–57 Remold SK, Lenski RE. Pervasive joint influence of epistasis and plasticity on mutational effects in Escherichia coli. Nat Genet 2004 Apr; 36(4): 423–6 Segre D, Deluna A, Church GM, et al. Modular epistasis in yeast metabolism. Nat Genet 2005 Jan; 37(1): 77–83 Hirschhorn JN, Lohmueller K, Byrne E, et al. A comprehensive review of genetic association studies. Genet Med 2002; 4: 45–61 Moore JH, Williams SM. New strategies for identifying gene-gene interactions in hypertension. Ann Med 2002; 34: 88–95 Thornton-Wells TA, Moore JH, Haines JL. Genetics, statistics and human disease: analytical retooling for complexity. Trends Genet 2004; 20(12): 640–7 Li W, Reich J. A complete enumeration and classification of two-locus disease models. Hum Hered 2000; 50: 334–49 Culverhouse R, Suarez BK, Lin J, et al. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 2002; 70: 461–71 Moore JH, Hahn LW, Ritchie MD, et al. Application of genetic algorithms to the discovery of complex models for simulation studies in human genetics. In: Langden WB, Cantú-Paz E, Mathias K, et al., editors. Proceedings of the Genetic and Evolutionary Computational Conference 2002. San Francisco (CA): Morgan-Kauffman, 2002: 1150–5 Bellman R. Adaptive control processes. Princeton (NJ): Princeton University Press, 1961 Gauderman WJ, Faucett CL. Detection of gene-environment interactions in joint segregation and linkage analysis. Am J Hum Genet 1997 Nov; 61(5): 1189–99 Coffey CS, Hebert PR, Krumholz HM, et al. Reporting of model validation procedures in human studies of genetic interactions. Nutrition 2004; 20(1): 69–73 Coffey CS, Hebert PR, Ritchie MD, et al. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics 2004; 5: 49 Mitchell T. Machine learning. Boston (MA): McGraw Hill, 1997 Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science 1983; 220: 671–80 Goldberg DE. Genetic algorithms in search, optimization, and machine learning. Reading (MA): Addison-Wesley, 1989 Koza JR. Genetic programming: on the programming of computers by means of natural selection. Cambridge (MA): MIT Press, 1992 Fogel GB, Corne DW. Evolutionary computation in bioinformatics. San Francisco (CA): Morgan-Kauffman, 2003 Skapura D. Building neural networks. New York: ACM Press, 1995 Tarassenko L. A guide to neural computing applications. London: Arnold Publishers, 1998 Anderson J. An introduction to neural networks. Cambridge (MA): MIT Press, 1995 Bhat A, Lucek PR, Ott J. Analysis of complex traits using neural networks. Genet Epidemiol 1999; 17Suppl. 1: S503-7 Bicciato S, Pandin M, Didone G, et al. Pattern identification and classification in gene expression data using an autoassociative neural network model. Biotechnol Bioeng 2003 Mar; 81(5): 594–606 Curtis D, North BV, Sham PC. Use of artificial neural network to detect association between a disease and multiple marker genotypes. Ann Hum Genet 2001; 65: 95–107 Hsia TC, Chiang HC, Chiang D, et al. Prediction of survival in surgical unresectable lung cancer by artificial neural networks including genetic polymorphisms and clinical parameters. J Clin Lab Anal 2003; 17(6): 229–34 Li W, Haghighi F, Falk C. Design of artificial neural network and its applications to the analysis of alcoholism data. Genet Epidemiol 1999; 17: S223–8 Lucek PR, Hanke J, Reich J, et al. Multi-locus nonparametric linkage analysis of complex trait loci with neural networks. Hum Hered 1998; 48(5): 275–84 Lucek PR, Ott J. Neural network analysis of complex traits. Genet Epidemiol 1997; 14(6): 1101–6 Marinov M, Weeks D. The complexity of linkage analysis with neural networks. Hum Hered 2001; 51: 169–76 Ott J. Neural networks and disease association studies. Am J Med Genet 2001; 105: 60–1 Saccone NL, Downey TJ, Meyer DJ, et al. Mapping genotype to phenotype for linkage analysis. Genet Epidemiol 1997; 17: S703–8 Serretti A, Smeraldi E. Neural network analysis in pharmacogenetics of mood disorders. BMC Med Genet 2004 Dec; 5: 27 Sherriff A, Ott J. Application of neural networks for gene finding. Adv Genet 2001; 42: 287–97 Tomita Y, Tomida S, Hasegawa Y, et al. Artificial neural network approach for selection of susceptible single nucleotide polymorphisms and construction of rediction model on childhood allergic asthma. BMC Bioinformatics 2004 Sep; 5: 20 Ritchie MD, White BC, Parker JS, et al. Optimization of neural network architecture using genetic programming improves the detection and modeling of genegene interactions in studies of human diseases. BMC Bioinformatics 2003; 4: 28 Koza JR, Rice JP. Genetic generation of both the weights and architecture for a neural network. cataway (NJ): IEEE Press 1991 Ritchie MD, Coffey CS, Moore JH. Genetic programming neural networks as a bioinformatics tool in human genetics. In: Deb K, Poli R, Banthaf W, et al., editors. Lecture notes in computer science. Vol. 3102. New York: Springer, 2004; 438-48 Bush WS, Motsinger AA, Dudek SM, et al. Can neural network constraints in GP provide power to detect genes associated with human disease? In: Rothlauf F, Branke J, Cagnoni S, et al., editors. Lecture notes in computer science. Vol. 3449. New York: Springer, 2005; 44–53 Motsinger AA, Lee SL, Mellick G, et al. Power of genetic programming neural networks for detecting high-order gene-gene interactions in association studies of human disease and an application in Parkinson’s disease. BMC Bioinformatics 2006; 7: 39 Von Neumann. The theory of self-reproducing automata. Urbana (IL): University of Illinois Press, 1966 Spezzano G, Talia D, Gregorio SD, et al. A parallel cellular tool for interaction modeling and simulation. IEEE Computational Science and Engineering 1996; 3: 33–43 Toffoli T. Cellular automata as an alternative to (rather than approximation of) differential equations in modeling physics. Physica D 1984; 10: 117–27 Mitchell M, Crutchfield JP, Hraber PT. Evolving cellular automata to perform computations: mechanisms and impediments. Physica D 1994; 75: 361–91 Packard NH. Adaptation toward the edge of chaos. In: Kelso JAS, Mandell AJ, Shlesinger MF, editors. Dynamical patterns in complex systems. Singapore: World Scientific, 1988: 293–301 Capcarrere MS, Sipper M. Necessary conditions for density classification by cellular automata. Phys Rev E Stat Nonlin Soft Matter Phys 2001; 64: 036113 Moore JH, Hahn LW. Cellular automata and genetic algorithms for parallel problem solving in human genetics. In: Merelo JJ, Panagiotis A, Beyer H-G, editors. Lecture notes in computer science. Vol. 2439. New York: Springer, 2002; 821–30 Moore JH, Hahn LW. A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases. Pac Symp Biocomput 2002, 53–64 Busch C, Hegele R. Genetic determinants of type 2 diabetes mellitus. Clin Genet 2002; 60: 243–54 Breiman L. Random forests. Mach Learn 2001; 45(1): 5–32 Bureau A, Dupuis J, Falls K, et al. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 2005 Feb; 28(2): 171–82 Breiman L, Friedman JH, Olshen RA, et al. Classification and regression trees. Belmont (CA): Wadsworth International Group, 1984 Cook NR, Zee RY, Ridker PM. Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med 2004 May; 23(9): 1439–53 Lunetta KL, Hayward LB, Segal J, et al. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004 Dec; 5(1): 32 Schwender H, Zucknick M, Ickstadt K, et al. A pilot study on the application of statistical classification procedures to molecular epidemiological data. Toxicol Lett 2004 Jun; 151(1): 291–9 Hahn LW, Moore JH. Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In Silico Biol 2004; 4(2): 183–94 Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003; 19(3): 376–82 Moore JH. Computational analysis of gene-gene interactions in common human diseases using multifactor dimensionality reduction. Expert Rev Mol Diagn 2004; 4(6): 795–803 Ritchie MD, Hahn LW, Roodi N, et al. Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer. Am J Hum Genet 2001; 69: 138–47 Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 2003 Feb; 24(2): 150–7 Michalski RS. A theory and methodology of inductive learning. Artif Intell 1983; 20: 111–61 Moore JH, Gilbert JC, Tsai CT, et al. A flexible framework for data mining and knowledge discovery in human genetics. J Theor Biol. In press Langley P. The computer-aided discovery of scientific knowledge. In: Carbonell JG, Siekmann J, editors. Lecture notes in artifical intelligence. Vol. 1532. New York: Springer, 1998; 25–39 Langley P. The computational support of scientific discovery. In: Carbonell JG, Siekmann J, editors. Lecture notes in artifical intelligence. Vol. 2049. New York: Springer, 2001; 230–48 Langley P. Lessons for the computational discovery of scientific knowledge. International Conference on Machine Learning; 2002 Jul 8–12; Sydney (NSW). San Francisco (CA): Morgan-Kauffman, 2002; 9–12 Williams SM, Ritchie MD, Phillips JA, et al. Multilocus analysis of hypertension. Hum Hered 2004; 57: 28–38 Cho YM, Ritchie MD, Moore JH, et al. Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia 2004; 47: 549–54 Motsinger AA, Donahue BS, Brown NJ, et al. Risk factor interactions and genetic effects associated with post-operative atrial fibrillation. Pac Symp Biocomput 2006; 11: 514–95 Tsai CT, Lai LP, Lin JL, et al. Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation 2004; 109: 1640–6 Soares ML, Coelho T, Sousa A, et al. Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease. Hum Mol Genet 2005 Feb 15; 14(4): 543–53 Ashley-Koch AE, Mei H, Jaworski J, et al. An analysis paradigm for investigating multi-locus effects in complex disease: examination of three GABAA receptor subunit genes on 15q11-q13 as risk factors for autistic disorder. Ann Hum Genet 2006; 70: 281–92 Ma DQ, Whitehead PL, Menold MM, et al. Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. Am J Hum Genet 2005 Sep; 77(3): 377–88 Bastone L, Reilly M, Rader DJ, et al. MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered 2004; 58(2): 82–92 Wilke RA, Reif DM, Moore JH. Combinatorial pharmacogenetics. Nat Rev Drug Discov 2005 Nov; 4(11): 911–8 Wilke RA, Moore JH, Burmester JK. Relative impact of CYP3A genotype and concomitant medication on the severity of atorvastatin-induced muscle damage. Pharmacogenet Genomics 2005 Jun; 15(6): 415–21 Andrew AS, Nelson HN, Kelsey KT, et al. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking, and bladder cancer susceptibility. Carcinogenesis 2006; 27: 1030–7 Xu J, Lowey J, Wiklund F, et al. The interaction of four genes in the inflammation pathway significantly predicts prostate cancer risk. Cancer Epidemiol Biomarkers Prev 2005 Nov; 14(11): 2563–8 Qin S, Zhao X, Pan Y, et al. An association study of the N-methyl-D-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray. Eur J Hum Genet 2005 Jul; 13(7): 807–14 Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 2003; 53(1): 23–69 Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 2005 Feb; 6(2): 95–108 Wang WY, Barratt BJ, Clayton DG, et al. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 2005 Feb; 6(2): 109–18 Jakulin A, Bratko I. Analyzing attribute dependencies. In: Lavrac N, Gamberger D, Todorovski L, et al., editors. Lecture notes in artificial intelligence. Berlin: Springer-Verlag, 2003: 229–40 Jakulin A. Attribute interactions in machine learning [PhD thesis]. Ljubljana, Slovenia: University of Ljubljana, 2003 Moore JH, Ritchie MD. The challenges of whole-genome approaches to common diseases. JAMA 2004 Apr; 291(13): 1642–3