ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples
BMC Bioinformatics - 2011
Tóm tắt
Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases. ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at
http://cbio.ensmp.fr/prodige
.
Từ khóa
Tài liệu tham khảo
Giallourakis C, Henson C, Reich M, Xie X, Mootha VK: Disease gene discovery through integrative genomics. Annu Rev Genomics Hum Genet. 2005, 6: 381-406. 10.1146/annurev.genom.6.080604.162234.
Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002, 31 (3): 316-319.
Turner FS, Clutterbuck DR, Semple CAM: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol. 2003, 4 (11): R75. 10.1186/gb-2003-4-11-r75.
Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 2005, 33 (5): 1544-1552. 10.1093/nar/gki296.
Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002, 18 (Suppl 2): S110-S115. 10.1093/bioinformatics/18.suppl_2.S110.
Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y: Gene prioritization through genomic data fusion. Nat Biotechnol. 2006, 24 (5): 537-544. 10.1038/nbt1203.
De Bie T, Tranchevent LC, van Oeffelen LMM, Moreau Y: Kernel-based data fusion for gene prioritization. Bioinformatics. 2007, 23 (13): i125-i132. 10.1093/bioinformatics/btm187.
Linghu B, Snitkin E, Hu Z, Xia Y, Delisi C: Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol. 2009, 10 (9): R91. 10.1186/gb-2009-10-9-r91.
Hwang T, Kuang R: A Heterogeneous Label Propagation Algorithm for Disease Gene Discovery. Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA. 2010, 583-594.
Yu S, Falck T, Daemen A, Tranchevent LC, Suykens Y, De Moor B, Moreau Y: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinformatics. 2010, 11: 309. 10.1186/1471-2105-11-309.
Ala U, Piro R, Grassi E, Damasco C, Silengo L, Oti M, Provero P, Di Cunto F: Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput Biol. 2008, 4 (3): e1000043. 10.1371/journal.pcbi.1000043.
Wu X, Jiang R, Zhang M, Li S: Network-based global inference of human disease genes. Mol Syst Biol. 2008, 4: 189.
Köhler S, Bauer S, Horn D, Robinson P: Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008, 82 (4): 949-958. 10.1016/j.ajhg.2008.02.013.
Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010, 6: e1000641. 10.1371/journal.pcbi.1000641.
Tranchevent LC, Capdevila FB, Nitsch D, De Moor B, De Causmaecker P, Moreau Y: A guide to web tools to prioritize candidate genes. Brief Bioinform. 2011, 11 (1): 22-32. doi: 10.1093/bib/bbq007.
Liu B, Lee WS, Yu PS, Li X: Partially Supervised Classification of Text Documents. ICML '02: Proceedings of the Nineteenth International Conference on Machine Learning. 2002, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 387-394. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.78.6998&rep=rep1&type=pdf]
Denis F, Gilleron R, Letouzey F: Learning from positive and unlabeled examples. Theor Comput Sci. 2005, 348: 70-83. 10.1016/j.tcs.2005.09.007.
Mordelet F, Vert JP: A bagging SVM to learn from positive and unlabeled examples. Tech Rep 00523336, HAL. 2010, [http://hal.archives-ouvertes.fr/hal-00523336]
Evgeniou T, Micchelli C, Pontil M: Learning multiple tasks with kernel methods. J Mach Learn Res. 2005, 6: 615-637. [http://jmlr.csail.mit.edu/papers/volume6/evgeniou05a]
Jacob L, Vert JP: Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics. 2008, 24 (3): 358-366. 10.1093/bioinformatics/btm611.
Jacob L, Vert JP: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008, 24 (19): 2149-2156. 10.1093/bioinformatics/btn409.
Pavlidis P, Weston J, Cai J, Noble W: Learning Gene Functional Classifications from Multiple Data Types. J Comput Biol. 2002, 9 (2): 401-411. 10.1089/10665270252935539.
Schölkopf B, Tsuda K, Vert JP: Kernel Methods in Computational Biology. 2004, The MIT Press, Cambridge, Massachussetts: MIT Press
Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS: A statistical framework for genomic data fusion. Bioinformatics. 2004, 20 (16): 2626-2635. 10.1093/bioinformatics/bth294.
McKusick V: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet. 2007, 80 (4): 588-604. 10.1086/514346.
Brancotte B, Biton A, Bernard-Pierrot I, Radvanyi F, Reyal F, Cohen-Boulakia S: Gene List significance at-a-glance with GeneValorization. Bioinformatics. 2011, 27 (8): 1187-1189. 10.1093/bioinformatics/btr073.
Calvo B, López-Bigas N, Furney S, Larrañaga P, Lozano J: A partially supervised classification approach to dominant and recessive human disease gene prediction. Comput Methods Programs Biomed. 2007, 85 (3): 229-237. 10.1016/j.cmpb.2006.12.003.
Schölkopf B, Smola AJ: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. 2002, Cambridge, MA: MIT Press
Chang CC, Lin CJ: LIBSVM: a library for support vector machines. 2001, [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
Yamanishi Y, Vert JP, Kanehisa M: Protein network inference from multiple genomic data: a supervised approach. Bioinformatics. 2004, 20: i363-i370. 10.1093/bioinformatics/bth910.
Bleakley K, Biau G, Vert JP: Supervised reconstruction of biological networks with local models. Bioinformatics. 2007, 23 (13): i57-i65. 10.1093/bioinformatics/btm204.
Lanckriet G, Cristianini N, Bartlett P, El Ghaoui L, Jordan M: Learning the kernel matrix with semidefinite programming. J Mach Learn Res. 2004, 5: 27-72. [http://www.jmlr.org/papers/v5/lanckriet04a.html]
López-Bigas N, Ouzounis CA: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res. 2004, 32 (10): 3108-3114. 10.1093/nar/gkh605.
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics. 2005, 6: 55. 10.1186/1471-2105-6-55.
Lage K, Karlberg E, Størling Z, Olason P, Pedersen A, Rigina O, Hinsby A, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25 (3): 309-316. 10.1038/nbt1295.
van Driel M, Bruggeman J, Vriend G, Brunner H, Leunissen J: A text-mining analysis of the human phenome. Eur J Hum Genet. 2006, 14 (5): 535-542. 10.1038/sj.ejhg.5201585.
Schölkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC: Estimating the support of a high-himensional distributions. Neural Comput. 2001, 13: 1443-1471. 10.1162/089976601750264965.
Son C, Bilke S, Davis S, Greer B, Wei J, Whiteford C, Chen QR, Cenacchi N, Khan J: Database of mRNA gene expression profiles of multiple human organs. Genome Res. 2005, 15 (3): 443-450. 10.1101/gr.3124505.
Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA. 2002, 99 (7): 4465-4470. 10.1073/pnas.012025199.
Kondor RI, Lafferty J: Diffusion kernels on graphs and other discrete input. Proceedings of the Nineteenth International Conference on Machine Learning. 2002, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 315-322.