Learning from biomedical linked data to suggest valid pharmacogenes

Journal of Biomedical Semantics - Tập 8 - Trang 1-12 - 2017
Yassine Marzougui1,2, Ndeye Coumba Ndiaye3, Sébastien Da Silva1, Patrice Ringot1, Adrien Coulet1, Kevin Dalleau1
1LORIA (CNRS, Inria Nancy-Grand Est, University of Lorraine), Campus Scientifique, Nancy, France
2Ecole nationale supérieure des mines de Nancy, Campus Artem, Nancy, France
3UMR U1122 IGE-PCV (INSERM, University of Lorraine), Nancy, France

Tóm tắt

A standard task in pharmacogenomics research is identifying genes that may be involved in drug response variability, i.e., pharmacogenes. Because genomic experiments tended to generate many false positives, computational approaches based on the use of background knowledge have been proposed. Until now, only molecular networks or the biomedical literature were used, whereas many other resources are available. We propose here to consume a diverse and larger set of resources using linked data related either to genes, drugs or diseases. One of the advantages of linked data is that they are built on a standard framework that facilitates the joint use of various sources, and thus facilitates considering features of various origins. We propose a selection and linkage of data sources relevant to pharmacogenomics, including for example DisGeNET and Clinvar. We use machine learning to identify and prioritize pharmacogenes that are the most probably valid, considering the selected linked data. This identification relies on the classification of gene–drug pairs as either pharmacogenomically associated or not and was experimented with two machine learning methods –random forest and graph kernel–, which results are compared in this article. We assembled a set of linked data relative to pharmacogenomics, of 2,610,793 triples, coming from six distinct resources. Learning from these data, random forest enables identifying valid pharmacogenes with a F-measure of 0.73, on a 10 folds cross-validation, whereas graph kernel achieves a F-measure of 0.81. A list of top candidates proposed by both approaches is provided and their obtention is discussed.

Tài liệu tham khảo

Xie HG, Frueh FW. Pharmacogenomics steps toward personalized medicine. Personalized Med. 2005; 2(4):325–7. Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics. 2010; 11(10):1467–89. Whirl-Carrillo M, et al. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012; 92(4):414–17. Ioannidis JPA. To replicate or not to replicate: The case of pharmacogenetic studies. Circ Cardiovasc Genet. 2013; 6:413–8. Zineh I, Pacanowski M, Woodcock J. Pharmacogenetics and coumarin dosing? Recalibrating expectations. N Engl J Med. 2013; 369:2273–5. Bizer C, Heath T, Berners-Lee T. Linked data - the story so far. Int J Semantic Web Inf Syst. 2009; 5(3):1–22. Antezana E, Kuiper M, Mironov V. Biological knowledge management: the emerging role of the Semantic Web technologies. Brief Bioinform. 2009; 10(4):392–407. Callahan A, Cruz-Toledo J, Ansell P, Dumontier M. Bio2rdf release 2: Improved coverage, interoperability and provenance of life science linked data. In: Proceedings of the 10th European Semantic Web Conference, ESWC 2013. Lecture Notes in Computer Science 7882. Springer: 2013. p. 200–12. Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia LJ, Gaulton A, Gehant S, Laibe C, Redaschi N, Wimalaratne SM, Martin MJ, Novère NL, Parkinson HE, Birney E, Jenkinson AM. The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014; 30(9):1338–9. Kinjo AR, et al. Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format. Nucleic Acids Res. 2012; 40(Database issue):D453–60. Samwald M, Jentzsch A, Bouton C, Kallesøe CS, Willighagen E, Hajagos J, Marshall MS, Prud’hommeaux E, Hassenzadeh O, Pichler E, Stephens S. Linked open drug data for pharmaceutical research and development. J Cheminformatics. 2011; 3:19. Accessed 07 June 2016. Good BM, Wilkinson MD. The life sciences semantic web is full of creeps!Brief Bioinform. 2006; 7(3):275–86. Marshall MS, Boyce R, Deus HF, Zhao J, Willighagen EL, Samwald M, Pichler E, Hajagos J, Prud’hommeaux E, Stephens S. Emerging practices for mapping and linking life sciences data using RDF — A case series. Web Semant Sci Serv Agents World Wide Web. 2012; 14:2–13. Accessed 07 June 2016. PharmGKB. Levels of evidence of annotations. https://www.pharmgkb.org/page/clinAnnLevels. Accessed 1 June 2016. Bio, 2RDF project. PharmGKB endpoint. http://cu.pharmgkb.bio2rdf.org/sparql. Accessed 1 June 2016. Wishart DS, Knox C, Guo A, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008; 36(Database-Issue):901–6. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. Clinvar: public archive of relationships among sequence variation and human phenotype.Nucleic Acids Res. 2014; 42(Database-Issue):980–5. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010; 6(1):343. Kuhn M, Letunic I, Jensen LJ, Bork P. The sider database of drugs and side effects. Nucleic Acids Res. 2016; 44(D1):D1075–9. Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic Acids Res. 2016; 44(Database-Issue):1075–9. doi:10.1093/nar/gkv1075. Wagner AH, Coffman AC, Ainscough BJ, Spies NC, Skidmore ZL, Campbell KM, Krysiak K, Pan D, McMichael JF, Eldred JM, Walker JR, Wilson RK, Mardis ER, Griffith M, Griffith OL. Dgidb 2.0: mining clinically relevant drug-gene interactions. Nucleic Acids Res. 2016; 44(Database-Issue):1036–44. Piñero J, Queralt-Rosinach N, Bravo À, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. Disgenet: a discovery platform for the dynamical exploration of human diseases and their genes. Database. 2015; 2015:bav028. Samwald M, Coulet A, Huerga I, Powers RL, Luciano JS, Freimuth RR, Whipple F, Pichler E, Prud’hommeaux E, Dumontier M, Marshall MS. Semantically enabling pharmacogenomic data for the realization of personalized medicine. Pharmacogenomics. 2012; 13(2):201–12. Hoehndorf R, Dumontier M, Gkoutos GV. Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012; 28(16):2169–75. Coulet A, Garten Y, Dumontier M, Altman RB, Musen MA, Shah NH. Integration and publication of heterogeneous text-mined relationships on the semantic web. J Biomed Semant. 2011; 2(S-2):10. Bicer V, Tran T, Gossen A. Relational kernel machines for learning from graph-structured RDF data. In: Proceedings of the 8th Extended Semantic Web Conference, Part I, ESWC 2011. Lecture Notes in Computer Science 6643. Springer: 2011. p. 47–62. Huang Y, Tresp V, Bundschus M, Rettinger A, Kriegel H. Multivariate prediction for learning on the semantic web. In: Proceedings of the 20th International Conference on Inductive Logic Programming, ILP 2010. Lecture Notes in Computer Science 7489. Springer: 2010. p. 92–104. Thor A, Anderson P, Raschid L, Navlakha S, Saha B, Khuller S, Zhang XN. Link Prediction for Annotation Graphs Using Graph Summarization. In: Proceedings of the 10th International Conference on The Semantic Web - Volume Part I ISWC’11. Springer: 2011. p. 714–29. Lösch U, Bloehdorn S, Rettinger A. Graph kernels for RDF data. In: Proceedings of the 9th Extended Semantic Web Conference, ESWC 2012. Lecture Notes in Computer Science 7295. Springer: 2012. p. 134–48. de Vries GKD. A fast approximation of the weisfeiler-lehman graph kernel for RDF data. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, Part I, ECML PKDD 2013. Lecture Notes in Computer Science 8188. Springer: 2013. p. 606–21. Brenninkmeijer CYA, Dunlop I, Goble CA, Gray AJG, Pettifer S, Stevens R. Computing identity co-reference across drug discovery datasets. In: Proceedings of the 6th International Workshop on Semantic Web Applications and Tools for Life Sciences, SWAT4LS 2013. CEUR Workshop Proceedings 114. Volz J, Bizer C, Gaedke M, Kobilarov G. Discovering and maintaining links on the web of data. In: Proceedings of the 8th International Semantic Web Conference, ISWC 2009. Lecture Notes in Computer Science 5823. Springer: 2009. p. 650–65. Heim P, Lohmann S, Stegemann T. Interactive relationship discovery via the semantic web. In: Proceedings of the 7th Extended Semantic Web Conference, Part I, ESWC 2010. Lecture Notes in Computer Science 6088. Springer: 2011. p. 303–17. de Vries GKD, de Rooij S. Substructure counting graph kernels for machine learning from RDF data. J Web Sem. 2015; 35:71–84. Kondor R, Lafferty JD. Diffusion kernels on graphs and other discrete input spaces. In: Proceedings of the 19th International Conference on Machine Learning, ICML 2002. Morgan Kaufmann: 2002. p. 315–22. Data, 2Semantics. Mustard – machine learning using svms to analyse rdf data, under mit licence. https://github.com/Data2Semantics/mustard. Accessed 01 June 2016. Percha B, Garten Y, Altman RB. Discovery and explanation of drug-drug interactions via text mining. In: PSB: 2012. p. 410–21. http://psb.stanford.edu/psb-online/proceedings/psb2012/percha.pdf. Dalleau K, Ndiaye NC, Coulet A. Suggesting valid pharmacogenes by mining linked data. In: Proceedings of the 8th Semantic Web Applications and Tools for Life Sciences International Conference, SWAT4LS 2015. CEUR Workshop Proceedings 1546: 2015. p. 49–58. Hansen N, Brunak S, Altman R. Generating genome-scale candidate gene lists for pharmacogenomics. Clin Pharmacol Ther. 2009; 86(2):183–9. Garten Y, Tatonetti NP, Altman RB. Improving the prediction of pharmacogenes using text-derived gene-drug relationships. In: PSB: 2010. p. 305–14. http://psb.stanford.edu/psb-online/proceedings/psb10/garten.pdf. Funk CS, Hunter LE, Cohen KB. Combining heterogenous data for prediction of disease related and pharmacogenes. In: Pacific Symposium on Biocomputing: 2014. p. 328–39. Dumontier M, Villanueva-Rosales N. Towards pharmacogenomics knowledge discovery with the semantic web. Brief Bioinform. 2009; 10(2):153–63. Coulet A, Smail-Tabbone M, Napoli A, Devignes MD. Ontology-based knowledge discovery in pharmacogenomics. Adv Exp Med Biol. 2011; 696:357–66. DisGeNET endpoint. http://rdf.disgenet.org/sparql/. Accessed 01 June 2016. Bio, 2RDF project. SIDER endpoint. http://cu.sider.bio2rdf.org/sparql. Accessed 01 June 2016. Bio, 2RDF project. DrugBank endpoint. http://cu.drugbank.bio2rdf.org/sparql. Accessed 01 June 2016. Imanishi T, Nakaoka H. Hyperlink Management System and ID Converter System: enabling maintenance-free hyperlinks among major biological databases. Nucleic Acids Res. 2009; 37(Web Server issue):17–22. Accessed 07 June 2016. Dalleau K. biojp2rdf – a tool to rdfize biodb.jp data, under mit licence. https://github.com/KevinDalleau/biojp2rdf. Accessed 01 June 2016. Zeng K, Bodenreider O, Kilbourne J, Nelson S. Rxnav: Towards an integrated view on drug information. In: Proceedings of the 12th World Congress on Health (Medical) Informatics, MEDINFO 2007. IOS Press 129: 2007. p. 386. Breiman L. Random Forests. Mach Learn. 2001; 45(1):5–32. Accessed 2016-06-06. Leistner C, Saffari A, Bischof H. MIForests: Multiple-instance learning with randomized trees. In: Proceedings of the 11th European Conference on Computer Vision, Part IV, ECCV 2010. Lecture Notes in Computer Science 6316. Springer: 2010. p. 29–42. Amores J. Multiple instance classification: Review, taxonomy and comparative study. Artif Intell. 2013; 201:81–105. 20-top candidate pharmacogenes, highlighted by our graph Random Forest classifier. https://members.loria.fr/ACoulet/files/pgxlod/rf_20.csv. Accessed 11 Apr 2017. Cawley GC, Talbot NL. On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res. 2010; 11:2079–107. 20-top candidate pharmacogenes, highlighted by our graph kernel / svm classifier. https://members.loria.fr/ACoulet/files/pgxlod/gk_20.csv. Accessed 01 June 2016. Mayo C, Bertran-Alamillo J, Molina-Vila MA, Giménez-Capitán A, Costa C, Rosell R. Pharmacogenetics of EGFR in lung cancer: perspectives and clinical applications. Pharmacogenomics. 2012; 13(7):789–802. de Mello RA, Madureira P, Carvalho LS, Araújo A, O’Brien M, Popat S. EGFR and KRAS mutations, and ALK fusions: current developments and personalized therapies for patients with advanced non-small-cell lung cancer. Pharmacogenomics. 2013; 14(14):1765–77. Okabe T, Okamoto I, Tsukioka S, Uchida J, Iwasa T, Yoshida T, Hatashita E, Yamada Y, Satoh T, Tamura K, Fukuoka M, Nakagawa K. Synergistic antitumor effect of S-1 and the epidermal growth factor receptor inhibitor gefitinib in non-small cell lung cancer cell lines: role of gefitinib-induced down-regulation of thymidylate synthase. Mol Cancer Ther. 2008; 7(3):599–606. Kim HK, Choi IJ, Kim CG, Kim HS, Oshima A, Yamada Y, Arao T, Nishio K, Michalowski A, Green JE. Three-gene predictor of clinical outcome for gastric cancer patients treated with chemotherapy. Pharmacogenomics J. 2012; 12(2):119–27. Accessed 07 June 2016. Sato Y, Yamamoto N, Kunitoh H, Ohe Y, Minami H, Laird NM, Katori N, Saito Y, Ohnami S, Sakamoto H, Sawada JI, Saijo N, Yoshida T, Tamura T. Genome-wide association study on overall survival of advanced non-small cell lung cancer patients treated with carboplatin and paclitaxel. J Thorac Oncol Off Publ Int Assoc Study Lung Cancer. 2011; 6(1):132–8. Saigo H, Nowozin S, Kadowaki T, Kudo T, Tsuda K. gboost: a mathematical programming approach to graph classification and regression. Mach Learn. 2009; 75(1):69–89. Yan X, Han J. gspan: Graph-based substructure pattern mining. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002 IEEE Computer Society: 2002. p. 721–4. Marzougui Y. pgx-lod-mining – adapting mustard to pgx linked data. https://github.com/yassmarzou/pgx-lod-mining. Accessed 28 Oct 2016.