Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features
Tóm tắt
Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at
http://cabgrid.res.in:8080/HSplice
.
Tài liệu tham khảo
Golam Bari ATM, Reaz MR, Jeong BS. Effective DNA encoding for splice site prediction using SVM. MATCH Commun Math Comput Chem. 2014;71:241–58.
Kamath U, De Jong K, Shehu A. Effective automated feature construction and selection for classification of biological sequences. PLoS ONE. 2014;9(7):e99982. doi:10.1371/journal.pone.0099982.
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G. Accurate splice site prediction using support vector machines. BMC Bioinform. 2007;8(Suppl 10):S7.
Malousi A, Chouvarda I, Koutkias V, Kouidou S, Maglaveras N. SpliceIT: a hybrid method for splice signal identification based on probabilistic and biological inference. J Biomed Inform. 2010;43:208–17.
Wei D, Zhang H, Wei Y, Jiang Q. A novel splice site prediction method using support vector machine. J Comput Inform Syst. 2013;920:8053–60.
Meher PK, Sahu TK, Rao AR, Wahi SD. A statistical approach for 5′ splice site prediction using short sequence motif and without encoding sequence data. BMC Bioinform. 2014;15:362.
Baten A, Halgamuge SK, Chang B, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 2006;7:1–15.
Huang J, Li T, Chen K, Wu J. An approach of encoding for prediction of splice sites using SVM. Biochemie. 2006;88:923–9.
Rätsch G, Sonnenburg S. Accurate splice site detection for caenorhabditis elegans. In: Schölkopf KT, Vert JP, editors. Kernel methods in computational biology. Cambridge: MIT Press; 2004.
Rätsch G, Sonnenburg S, Schölkopf B. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics. 2005;21(Suppl 1):369–77.
Zhang X, Lee J, Chasin LA. The effect of nonsense codons on splicing: a genomic analysis. RNA. 2006;9:637–9.
Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984;12:505–19.
Zhang M, Marr T. A weight array method for splicing signal analysis. Comput Appl Biosci. 1993;9(5):499–509.
Senapathy P, Shapiro MB, Harris NL. Splice junctions, branch point sites and exons: sequence statistics, identification, and applications to genome project. Meth Enzymol. 1990;183:252–78.
Baten A, Halgamuge SK, Chang B. Fast splice site detection using information content and feature reduction. BMC Bioinform. 2008;8:1–12.
Pollastro P, Rampone S. HS3D: homosapiens splice site data set. Nucleic Acids Res. 2003, Annual Database Issue.
Reese MG, Eeckman FH, Kulp D, Haussler D. Improved splice site detection in genie. J Comput Biol. 1997;43:311–23.
De Bona F, Ossowski S, Schneeberger K, Rätsch G. Optimal splice alignments of short sequence reads. Bioinformatics. 2008;24:174–80.
Bins J. Feature selection of huge feature sets in the context of computer vision. Ph.D. thesis. Colorado State University; 2000.
Neumann J, Schnorr C, Steidl G. Combined SVM-based feature selection and classification. Mach Learn. 2005;61(1–3):129–50.
Dror G, Sorek R, Shamir R. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics. 2004;21(7):897–901.
Vapnik VN. The nature of statistical learning theory. New York: Springer; 1998.
Noble WS. Support vector machine applications in computational biology. In: Scho¨lkopf B, Tsuda K, Vert JP, editors. Kernel methods in computational biology. Cambridge: MIT Press; 2004. p. 71–92.
Tech M, Pfeifer N, Morgenstein B, Meinicke P. TICO: a tool for improving predictions of prokaryotic translation initiation sites. Bioinformatics. 2005;21:3568–9.
Jiang B, Zhang MQ, Zhang X. OSCAR: one-class SVM for accurate recognition of ciselements. Bioinformatics. 2007;23:2823–38.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang CC, Lin CC. Misc functions of the department of statistics, TU Wien. R Package. 2012; 6-1
Henderson J, Salzberg S, Fasman KH. Finding genes in DNA with a hidden Markov model. J Comput Biol. 1992;4:127–41.
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997;30:1145–59.
Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: ML ’06 Proceedings of the 23rd international conference on machine learning. New York; 2006. p 233–40.
Li JL, Wang LF, Wang HY, Bai LY, Yuan ZM. High-accuracy splice site prediction based on sequence component and position features. Genet Mol Res. 2012;113:3432–51.