Using random forests for assistance in the curation of G-protein coupled receptor databases

Springer Science and Business Media LLC - Tập 16 - Trang 1-21 - 2017

Aleksei Shkurin^1,2, Alfredo Vellido¹

¹Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona, Spain

²Technology, Communication and Transport Department, Mikkeli University of Applied Sciences, Mikkeli, Finland

Tóm tắt

Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.

Tài liệu tham khảo

Marx V. Biology: the big challenges of big data. Nature. 2013;498(7453):255–60.

Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, Twigger S. Big data: the future of biocuration. Nature. 2008;455(7209):47–50.

Kniazeff J, Prézeau L, Rondard P, Pin JP, Goudet C. Dimers and beyond: the functional puzzles of class C GPCRs. Pharmacol Ther. 2011;130(1):9–25.

Katritch V, Cherezov V, Stevens RC. Structure-function of the G protein-coupled receptor superfamily. Annu Rev Pharmacol. 2013;53:531–56.

Wu H, et al. Structure of a class C GPCR metabotropic glutamate receptor 1 bound to an allosteric modulator. Science. 2014;344(6179):58–64.

Doré AS, et al. Structure of class C GPCR metabotropic glutamate receptor 5 transmembrane domain. Nature. 2014;551:557–62.

Gao QB, Ye XF, He J. Classifying G-protein-coupled receptors to the finest subtype level. Biochem Biophy Res Commun. 2013;439(2):303–8.

Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Brief Bioinform. 2006;7(1):86–112.

König K, Cárdenas M, Giraldo J, Alquézar R, Vellido A. Label noise in subtype discrimination of class C G-protein coupled receptors: a systematic approach to the analysis of classification errors. BMC Bioinform. 2015;16(1):314.

Cruz-Barbosa R, Vellido A, Giraldo J. The influence of alignment-free sequence representations on the semi-supervised classification of Class C G protein-coupled receptors. Med Biol Eng Comput. 2015;53(2):137–49.

Shkurin, A, Vellido A. Random forests for quality control in G-protein coupled receptor databases. In: Ortuño F, Rojas I, eds. Bioinformatics and biomedical engineering. Proceedings. of the 4th international conference (IWBBIO 2016); 2016, LNCS/LNBI 9656, p. 707-18.

Lord P, Macdonald A, Lyon L, Giaretta D. From data deluge to data curation. In: Proceedings of the UK e-science All Hands meeting; 2004. p. 371–5

Isberg V, Mordalski S, Munk C, Rataj K, Harpsøe K, Hauser AS, Vroling B, Bojarski AJ, Vriend G, Gloriam DE. GPCRdb: an information system for G protein-coupled receptors. Nucleic Acids Res. 2016;44(Database issue):D356–64.

GLISTEN COST Action CM1207. http://www.glisten-gpcr.eu. Accessed 8 Mar 2017.

IUPHAR: International Union of Basic and Clinical Pharmacology. http://www.iuphar.org. Accessed 8 Mar 2017.

Cooke RM, Brown AJ, Marshall FH, Mason JS. Structures of G protein-coupled receptors reveal new opportunities for drug discovery. Drug Discov Today. 2015;20(11):1355–64.

Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem. 1998;41:2481–91.

Cárdenas MI, Vellido A, Giraldo J. Visual interpretation of class C GPCR subtype overlapping from the nonlinear mapping of transformed primary sequences. In: Proceedings of the international conference on biomedical and health informatics (IEEE BHI 2014); 2014. p. 764–7

Davies MN, Secker A, Freitas A, Clark E, Timmis J, Flower DR. Optimizing amino acid groupings for GPCR classification. Bioinformatics. 2008;24(18):1980–6.

Can Cobanoglu M, Saygin Y, Sezerman UO. Classification of GPCRs using family specific motifs. IEEE ACM Trans Comput Biol. 2011;8(6):1495–508.

Caragea C, Silvescu A, Mitra P. Protein sequence classification using feature hashing. In: Proceedings. of the IEEE international conference on bioinformatics and biomedicine (BIBM 2011); 2011. p. 538–43

Mhamdi F, Elloumi M, Rakotomalala R. Textmining, features selection and datamining for proteins classification. In: Proceedings. of the IEEE international conference on information and communication technologies: from theory to applications, IEEE/ICTTA; 2004. p. 457–8

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Boulesteix A-L, Kruppa J, Konig I. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdis Rev Data Mining Knowl Dis. 2012;2(6):493–507.

Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinform. 2008;9:307.

König K, Alquézar R, Vellido A, Giraldo J. Finding class C GPCR subtype-discriminating n-grams through feature selection. In: Proceedings of the 8th international conference on practical applications of computational biology and bioinformatics (PACBB 2014); 2014. p. 89–96

UniProt Database, GPCR Q5I5C3. http://www.uniprot.org/uniprot/Q5I5C3. Accessed 8 Mar 2017.

UniProt Database, GPCR B0UYJ3. http://www.uniprot.org/uniprot/B0UYJ3. Accessed 8 Mar 2017.

RefSeq: NCBI Reference Sequence Database. http://www.ncbi.nlm.nih.gov/refseq. Accessed 8 Mar 2017.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA