Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction

Hafida Bouziane1, Belhadri Messabih1, Abdallah Chouarfia1
1Department of Computer Science, USTO-MB University, BP 1505 El Mnaouer, Oran, Algeria.

Tóm tắt

Machine learning techniques have been widely applied to solve the problem of predicting protein secondary structure from the amino acid sequence. They have gained substantial success in this research area. Many methods have been used including k-Nearest Neighbors (k-NNs), Hidden Markov Models (HMMs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs), which have attracted attention recently. Today, the main goal remains to improve the prediction quality of the secondary structure elements. The prediction accuracy has been continuously improved over the years, especially by using hybrid or ensemble methods and incorporating evolutionary information in the form of profiles extracted from alignments of multiple homologous sequences. In this paper, we investigate how best to combine k-NNs, ANNs and Multi-class SVMs (M-SVMs) to improve secondary structure prediction of globular proteins. An ensemble method which combines the outputs of two feed-forward ANNs, k-NN and three M-SVM classifiers has been applied. Ensemble members are combined using two variants of majority voting rule. An heuristic based filter has also been applied to refine the prediction. To investigate how much improvement the general ensemble method can give rather than the individual classifiers that make up the ensemble, we have experimented with the proposed system on the two widely used benchmark datasets RS 126 and CB513 using cross-validation tests by including PSI-BLAST position-specific scoring matrix (PSSM) profiles as inputs. The experimental results reveal that the proposed system yields significant performance gains when compared with the best individual classifier.

Từ khóa


Tài liệu tham khảo

10.1126/science.181.4096.223

10.1016/0022-2836(74)90404-5

10.1146/annurev.bi.47.070178.001343

10.1016/0022-2836(78)90297-8

10.1016/0022-2836(87)90292-0

10.1093/protein/2.3.185

10.1016/0022-2836(88)90564-5

10.1006/jmbi.1993.1413

10.1002/prot.340190108

10.1093/protein/7.2.157

10.1002/pro.5560051116

10.1006/jmbi.1994.0116

10.1093/protein/9.2.133

10.1089/cmb.1996.3.163

10.1006/jmbi.1999.3091

10.1002/(SICI)1097-0134(19990301)34:4<508::AID-PROT10>3.0.CO;2-4

10.1093/nar/gki396

10.1093/bioinformatics/bti203

10.1002/prot.21177

10.1093/bioinformatics/14.10.846

10.1002/1097-0134(20001001)41:1<17::AID-PROT40>3.0.CO;2-F

10.1006/jmbi.2001.4580

10.1109/72.991427

10.1093/protein/gzg072

Nguyen M., 2003, Genome Informatics., 14, 218

Nguyen M., 2005, Pac Symp Biocomput., 10, 346

10.1093/bioinformatics/btg223

10.1002/prot.10634

10.1002/9780470124642

10.1002/bip.360221211

10.1002/prot.340230412

10.1002/prot.340030202

10.1186/1472-6807-5-17

10.1093/nar/25.17.3389

10.1093/bioinformatics/14.9.755

10.1073/pnas.89.22.10915

ChangC., LinC. LIBSVM: a library for support vector machines. SIAM J Appl Math. 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.

10.1109/TPAMI.2006.145

Minsky M., 1969, MIT Press.

10.1016/S0020-0255(96)00200-9

Rumellart D.E., 1986, MIT Press Cambridge., 1, 318

10.1162/neco.1989.1.2.281

10.1109/5.58326

Ou Y., 2005, International Joint Conference on Neural Networks (IJCNN)., 1

10.1145/130385.130401

10.1007/BF00994018

Vapnik V., 1982, Estimation of Dependences Based on Empirical Data.

Aizerman A., 1964, Automation and Remote Control., 25, 821

Vapnik V., 1998, Statistical Learning Theory.

Schölkopf B., 1999, Advances in Kernel Methods, Support Vector Learning.

Rifkin R., 2004, Journal of Machine Learning Research., 5, 101

Piatt J., 2000, NIPS 12., 547

Dietterich T., 1991, Ninth National Conference on Articial Intelligence (AAAI-91)., 572

Weston J., 1998, “Multi-class support vector machines,”

Crammer K., 2001, Journal of Machine Learning Research., 2, 265

10.1002/pro.5560051116

10.1006/jmbi.1993.1413

10.1016/0005-2795(75)90109-9

10.1002/(SICI)1097-0134(19990201)34:2<220::AID-PROT7>3.0.CO;2-K