Grammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applications

Springer Science and Business Media LLC - Tập 4 - Trang 1-10 - 2009
Piero Fariselli1, Castrense Savojardo1, Pier Luigi Martelli1, Rita Casadio1
1Biocomputing Group, University of Bologna, Bologna, Italy

Tóm tắt

Discriminative models are designed to naturally address classification tasks. However, some applications require the inclusion of grammar rules, and in these cases generative models, such as Hidden Markov Models (HMMs) and Stochastic Grammars, are routinely applied. We introduce Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) as an extension of Hidden Conditional Random Fields (HCRFs). GRHCRFs while preserving the discriminative character of HCRFs, can assign labels in agreement with the production rules of a defined grammar. The main GRHCRF novelty is the possibility of including in HCRFs prior knowledge of the problem by means of a defined grammar. Our current implementation allows regular grammar rules. We test our GRHCRF on a typical biosequence labeling problem: the prediction of the topology of Prokaryotic outer-membrane proteins. We show that in a typical biosequence labeling problem the GRHCRF performs better than CRF models of the same complexity, indicating that GRHCRFs can be useful tools for biosequence analysis applications. GRHCRF software is available under GPLv3 licence at the website http://www.biocomp.unibo.it/~savojard/biocrf-0.9.tar.gz

Tài liệu tham khảo

Durbin R: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1999, Cambridge Univ Pr, reprint edition Baldi P, Brunak S: Bioinformatics: The Machine Learning Approach. 2001, MIT Press, 2 Manning C, Schütze H: Foundations of Statistical Natural Language Processing. 1999, MIT Press Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML01. 2001, 282-289. Liu Y, Carbonell J, Weigele P, Gopalakrishnan V: Protein fold recognition using segmentation conditional random fields (SCRFs). Journal of Computational Biology. 2006, 13 (2): 394-406. Sato K, Sakakibara Y: RNA secondary structural alignment with conditional random fields. Bioinformatics. 2005, 21 (2): 237-242. 10.1093/bioinformatics/bti1139. Wang L, Sauer UH: OnD-CRF: predicting order and disorder in proteins conditional random fields. Bioinformatics. 2008, 24 (11): 1401-1402. Li CT, Yuan Y, Wilson R: An unsupervised conditional random fields approach for clustering gene expression time series. Bioinformatics. 2008, 24 (21): 2467-2473. Li MH, Lin L, Wang XL, Liu T: Protein protein interaction site prediction based on conditional random fields. Bioinformatics. 2007, 23 (5): 597-604. Dang TH, Van Leemput K, Verschoren A, Laukens K: Prediction of kinase-specific phosphorylation sites using conditional random fields. Bioinformatics. 2008, 24 (24): 2857-2864. Xia X, Zhang S, Su Y, Sun Z: MICAlign: a sequence-to-structure alignment tool integrating multiple sources of information in conditional random fields. Bioinformatics. 2009, 25 (11): 1433-1434. Wang S, Quattoni A, Morency L, Demirdjian D: Hidden Conditional Random Fields for Gesture Recognition. CVPR. 2006, II: 1521-1527. McCallum A, Bellare K, Pereira F: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance. Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI-05). 2005, 388: Arlington, Virginia: AUAI Press Quattoni A, Collins M, Darrell T: Conditional Random Fields for Object Recognition. Advances in Neural Information Processing Systems 17. Edited by: Saul LK, Weiss Y, Bottou L. 2005, 1097-1104. Cambridge, MA: MIT Press Fariselli P, Martelli P, Casadio R: A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics. 2005, 6 (Suppl 4): S12- Sutton C, McCallum A: An Introduction to Conditional Random Fields for Relational Learning. 2006, MIT Press Krogh A: Hidden Markov Models for Labeled Sequences. Proceedings of the 12th IAPR ICPR'94. 1994, 140-144. IEEE Computer Society Press Martelli P, Fariselli P, Krogh A, Casadio R: A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics. 2002, 18 (Suppl 1): 46-53. Bigelow H, Petrey D, Liu J, Przybylski D, Rost B: Predicting transmembrane beta-barrels in proteomes. Nucleic Acids Res. 2004, 2566-2577: 32- Bagos P, Liakopoulos T, Hamodrakas S: Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics. 2005, 6: 7-20. Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. Sutton C, McCallum A, Rohanimanesh K: Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. J Mach Learn Res. 2007, 8: 693-723.