Fast subcellular localization by cascaded fusion of signal-based and homology-based methods

Springer Science and Business Media LLC - Tập 9 - Trang 1-12 - 2011
Man-Wai Mak1, Wei Wang1, Sun-Yuan Kung2
1Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong
2Department of Electrical Engineering, Princeton University, USA

Tóm tắt

The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means. This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA). Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.

Tài liệu tham khảo

von Heijne G: A new method for predicting signal sequence cleavage sites. Nucleic Acids Research 1986,14(11):4683–4690. 10.1093/nar/14.11.4683 Nakai K, Kanehisa M: Expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Structure, Function, and Genetics 1991,11(2):95–110. 10.1002/prot.340110203 Horton P, Park KJ, Obayashi T, Nakai K: Protein Subcellular Localization Prediction with WoLF PSORT. Proc. 4th Annual Asia Pacific Bioinformatics Conference (APBC06) 2006, 39–48. Horton P, Park K, Obayashi T, Fujita N, Harada H, Adams-Collier C, Nakai K: WoLF PSORT: protein localization predictor. Nucleic acids research 2007,35(Web Server issue):585–587. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 2000,300(4):1005–1016. 10.1006/jmbi.2000.3903 Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP, and related tools. Nature Protocols 2007,2(4):953–971. 10.1038/nprot.2007.131 Hua SJ, Sun ZR: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17: 721–728. 10.1093/bioinformatics/17.8.721 Huang Y, Li YD: Prediction of protein subcellular locations using fuzzy K-NN method. Bioinformatics 2004, 20: 21–28. 10.1093/bioinformatics/btg366 Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003,19(13):1656- 1663. 10.1093/bioinformatics/btg222 Nakashima H, Nishikawa K: Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 1994, 238: 54–61. 10.1006/jmbi.1994.1267 Mott R, Schultz J, Bork P, Ponting C: Predicting protein cellular localization using a domain projection method. Genome research 2002,12(8):1168–1174. 10.1101/gr.96802 Scott M, Thomas D, Hallett M: Predicting subcellular localization via protein motif co-occurrence. Genome research 2004,14(10a):1957–1966. 10.1101/gr.2650004 Mak MW, Guo J, Kung SY: PairProSVM: Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM. IEEE/ACM Trans. on Computational Biology and Bioinfor-matics 2008,5(3):416–422. Nair R, Rost B: Inferring sub-cellular localization through automated lexical analysis. Bioinformatics 2002, 18: S78-S76. 10.1093/bioinformatics/18.suppl_1.S78 Chou K, Shen H: Recent progress in protein subcellular location prediction. Analytical Biochemistry 2007, 370: 1–16. 10.1016/j.ab.2007.07.006 Baldi P, Brunak S: Bioinformatics : The Machine Learning Approach. 2nd edition. MIT Press; 2001. Nielsen H, Engelbrecht J, Brunak S, von Heijne G: A neural network method for identification of prokaryotic and eukaryotic signal perptides and prediction of their cleavage sites. Int. J. Neural Sys. 1997, 8: 581–599. 10.1142/S0129065797000537 Nielsen H, Engelbrecht J, Brunak S, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 1997, 10: 1–6. 10.1093/protein/10.1.1 Xu Q, Hu DH, Xue H, Yu W, Yang Q: Semi-supervised protein subcellular localization. BMC Bioinformatics 2009., 10: Yuan Z: Prediction of protein subcellular locations using Markov chain models. FEBS Letters 1999, 451: 23–26. 10.1016/S0014-5793(99)00506-2 Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Structure, Function, and Genetics 2001, 43: 246–255. 10.1002/prot.1035 Nair R, Rost B: Sequence conserved for subcellular localization. Protein Science 2002, 11: 2836–2847. Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, An-vik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformat-ics 2004,20(4):547–556. 10.1093/bioinformatics/btg447 Kim JK, Raghava GPS, Bang SY, Choi S: Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine. Pattern Recog. Lett. 2006,27(9):996–1001. 10.1016/j.patrec.2005.11.014 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389 Wang W, Mak MW, Kung SY: Speeding up Subcellular Localization by Extracting Informative Regions of Protein Sequences for Profile Alignment. In Proc. Computational Intelligence in Bioinformatics and Computational Biology. Montreal; 2010:147–154. Mak MW, Kung SY: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. In Proc. ICASSP. Taipei; 2009:1605–1608. [http://158.132.148.85:8080/CSitePred/faces/Page1.jsp] Lafferty J, McCallum A, Pereira F: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th Int. Conf. on Machine Learning 2001. von Heijne G: Patterns of amino acids near signal-sequence cleavage sites. Eur J Biochem 1983, 133: 17–21. 10.1111/j.1432-1033.1983.tb07424.x Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 2004, 340: 783–795. 10.1016/j.jmb.2004.05.028 Emanuelsson O, Nielsen H, von Heijne G: ChloroP, a neural network-based method for predicting chloroplast transit pep-tides and their cleavage sites. Protein Science 1999, 8: 978–984. 10.1110/ps.8.5.978 Nielsen H, Brunak S, von Heijne G: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 1999, 12: 3–9. 10.1093/protein/12.1.3 [http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html] Menne KML, Hermjakob H, Apweiler R: A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 2000, 16: 741–742. 10.1093/bioinformatics/16.8.741 Kung SY: Kernel Approaches to Unsupervised and Supervised Machine Learning. In Proc. PCM, LNCS 5879. Edited by: Muneesawang P. Springer-Verlag; 2009:1–32. Vapnik VN: Statistical Learning Theory. New York: Wiley; 1998. Matthews BW: Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975, 405: 442–451. Tsuda K: Support vector classifier with asymmetric kernel functions. In Proc. ESANN. Bruges, Belgium; 1999:183–188. Mika S, Ratsch G, Weston J, Scholkopf B, Mullers KR: Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX Edited by: Hu YH, Larsen J, Wilson E, Douglas S. 1999, 41–48. Kung S, Mak M: PDA-SVM Hybrid: A Unified Model For Kernel-Based Supervised Classification. Journal of Signal Processing Systems for Signal, Image, and Video Technology 2011. To appear Suykens JAK, Vandewalle J: Least squares support vector machine classifiers. Neural processing letters 1999,9(3):293–300. 10.1023/A:1018628609742 Wu CH, McLarty JM: Neural Networks and Genome Informatics. Elsevier Science; 2000.