Interpol: An R package for preprocessing of protein sequences

BioData Mining - Tập 4 - Trang 1-6 - 2011
Dominik Heider1, Daniel Hoffmann1
1Department of Bioinformatics, Center for Medical Biotechnology, University of Duisburg-Essen, Essen, Germany

Tóm tắt

Most machine learning techniques currently applied in the literature need a fixed dimensionality of input data. However, this requirement is frequently violated by real input data, such as DNA and protein sequences, that often differ in length due to insertions and deletions. It is also notable that performance in classification and regression is often improved by numerical encoding of amino acids, compared to the commonly used sparse encoding. The software "Interpol" encodes amino acid sequences as numerical descriptor vectors using a database of currently 532 descriptors (mainly from AAindex), and normalizes sequences to uniform length with one of five linear or non-linear interpolation algorithms. Interpol is distributed with open source as platform independent R-package. It is typically used for preprocessing of amino acid sequences for classification or regression. The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, and it will in many cases improve their performance in classification and regression.

Tài liệu tham khảo

Rost B, Sander C: Combining evolutionary information and neural networks to predict protein secondary structure. Proteins. 1994, 19: 55-72. 10.1002/prot.340190108. Dubchak I, Muchnik I, Holbrook SR, Kim SH: Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci. 1995, 92: 8700-8704. 10.1073/pnas.92.19.8700. Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics. 2002, 18: 147-150. 10.1093/bioinformatics/18.1.147. Nielsen M, Lundegaard C, Worning P, Lauemøller SL, Lamberth K, Buus S, Brunak S, Lund O: Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci. 2003, 12 (5): 1007-1017. 10.1110/ps.0239403. Nanni L, Lumini A: A new encoding technique for peptide classification. Expert Systems with Applications. 2011, 38 (4): 3185-3191. 10.1016/j.eswa.2010.09.005. Kyte J, Doolittle R: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157: 105-132. 10.1016/0022-2836(82)90515-0. Dybowski JN, Heider D, Hoffmann D: Prediction of co-receptor usage of HIV-1 from genotype. PLoS Comput Biol. 2010, 6 (4): e1000743-10.1371/journal.pcbi.1000743. Heider D, Appelmann J, Bayro T, Dreckmann W, Held A, Winkler J, Barnekow A, Borschbach M: A computational approach for the identification of small GTPases based on preprocessed amino acid sequences. Technology in Cancer Research and Treatment. 2009, 8 (5): 333-342. Heider D, Hauke S, Pyka M, Kessler D: Insights into the classification of small GTPases. Advances and Applications in Bioinformatics and Chemistry. 2010, 3: 15-24. Heider D, Verheyen J, Hoffmann D: Machine learning on normalized protein sequences. BMC Research Notes. 2011, 4: 94-10.1186/1756-0500-4-94. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, D202-D205. 36 Database Forsythe GE: Computer Methods for Mathematical Computations. 1977, Prentice Hall Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324. Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21 (20): 3940-3941. 10.1093/bioinformatics/bti623. Karatzoglou A, Smola A, Hornik K, Zeileis A: kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software. 2004, 11 (9): 1-20. Walker FO: Huntington's disease. Lancet. 2007, 369 (9557): 218-228. 10.1016/S0140-6736(07)60111-1.