Fast subcellular localization by cascaded fusion of signal-based and homology-based methods
Tóm tắt
The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means. This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA). Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.
Tài liệu tham khảo
von Heijne G: A new method for predicting signal sequence cleavage sites. Nucleic Acids Research 1986,14(11):4683–4690. 10.1093/nar/14.11.4683
Nakai K, Kanehisa M: Expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Structure, Function, and Genetics 1991,11(2):95–110. 10.1002/prot.340110203
Horton P, Park KJ, Obayashi T, Nakai K: Protein Subcellular Localization Prediction with WoLF PSORT. Proc. 4th Annual Asia Pacific Bioinformatics Conference (APBC06) 2006, 39–48.
Horton P, Park K, Obayashi T, Fujita N, Harada H, Adams-Collier C, Nakai K: WoLF PSORT: protein localization predictor. Nucleic acids research 2007,35(Web Server issue):585–587.
Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 2000,300(4):1005–1016. 10.1006/jmbi.2000.3903
Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP, and related tools. Nature Protocols 2007,2(4):953–971. 10.1038/nprot.2007.131
Hua SJ, Sun ZR: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17: 721–728. 10.1093/bioinformatics/17.8.721
Huang Y, Li YD: Prediction of protein subcellular locations using fuzzy K-NN method. Bioinformatics 2004, 20: 21–28. 10.1093/bioinformatics/btg366
Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003,19(13):1656- 1663. 10.1093/bioinformatics/btg222
Nakashima H, Nishikawa K: Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 1994, 238: 54–61. 10.1006/jmbi.1994.1267
Mott R, Schultz J, Bork P, Ponting C: Predicting protein cellular localization using a domain projection method. Genome research 2002,12(8):1168–1174. 10.1101/gr.96802
Scott M, Thomas D, Hallett M: Predicting subcellular localization via protein motif co-occurrence. Genome research 2004,14(10a):1957–1966. 10.1101/gr.2650004
Mak MW, Guo J, Kung SY: PairProSVM: Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM. IEEE/ACM Trans. on Computational Biology and Bioinfor-matics 2008,5(3):416–422.
Nair R, Rost B: Inferring sub-cellular localization through automated lexical analysis. Bioinformatics 2002, 18: S78-S76. 10.1093/bioinformatics/18.suppl_1.S78
Chou K, Shen H: Recent progress in protein subcellular location prediction. Analytical Biochemistry 2007, 370: 1–16. 10.1016/j.ab.2007.07.006
Baldi P, Brunak S: Bioinformatics : The Machine Learning Approach. 2nd edition. MIT Press; 2001.
Nielsen H, Engelbrecht J, Brunak S, von Heijne G: A neural network method for identification of prokaryotic and eukaryotic signal perptides and prediction of their cleavage sites. Int. J. Neural Sys. 1997, 8: 581–599. 10.1142/S0129065797000537
Nielsen H, Engelbrecht J, Brunak S, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 1997, 10: 1–6. 10.1093/protein/10.1.1
Xu Q, Hu DH, Xue H, Yu W, Yang Q: Semi-supervised protein subcellular localization. BMC Bioinformatics 2009., 10:
Yuan Z: Prediction of protein subcellular locations using Markov chain models. FEBS Letters 1999, 451: 23–26. 10.1016/S0014-5793(99)00506-2
Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Structure, Function, and Genetics 2001, 43: 246–255. 10.1002/prot.1035
Nair R, Rost B: Sequence conserved for subcellular localization. Protein Science 2002, 11: 2836–2847.
Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, An-vik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformat-ics 2004,20(4):547–556. 10.1093/bioinformatics/btg447
Kim JK, Raghava GPS, Bang SY, Choi S: Prediction of subcellular localization of proteins using pairwise sequence alignment and support vector machine. Pattern Recog. Lett. 2006,27(9):996–1001. 10.1016/j.patrec.2005.11.014
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Wang W, Mak MW, Kung SY: Speeding up Subcellular Localization by Extracting Informative Regions of Protein Sequences for Profile Alignment. In Proc. Computational Intelligence in Bioinformatics and Computational Biology. Montreal; 2010:147–154.
Mak MW, Kung SY: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. In Proc. ICASSP. Taipei; 2009:1605–1608.
[http://158.132.148.85:8080/CSitePred/faces/Page1.jsp]
Lafferty J, McCallum A, Pereira F: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th Int. Conf. on Machine Learning 2001.
von Heijne G: Patterns of amino acids near signal-sequence cleavage sites. Eur J Biochem 1983, 133: 17–21. 10.1111/j.1432-1033.1983.tb07424.x
Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 2004, 340: 783–795. 10.1016/j.jmb.2004.05.028
Emanuelsson O, Nielsen H, von Heijne G: ChloroP, a neural network-based method for predicting chloroplast transit pep-tides and their cleavage sites. Protein Science 1999, 8: 978–984. 10.1110/ps.8.5.978
Nielsen H, Brunak S, von Heijne G: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 1999, 12: 3–9. 10.1093/protein/12.1.3
[http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html]
Menne KML, Hermjakob H, Apweiler R: A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 2000, 16: 741–742. 10.1093/bioinformatics/16.8.741
Kung SY: Kernel Approaches to Unsupervised and Supervised Machine Learning. In Proc. PCM, LNCS 5879. Edited by: Muneesawang P. Springer-Verlag; 2009:1–32.
Vapnik VN: Statistical Learning Theory. New York: Wiley; 1998.
Matthews BW: Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975, 405: 442–451.
Tsuda K: Support vector classifier with asymmetric kernel functions. In Proc. ESANN. Bruges, Belgium; 1999:183–188.
Mika S, Ratsch G, Weston J, Scholkopf B, Mullers KR: Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX Edited by: Hu YH, Larsen J, Wilson E, Douglas S. 1999, 41–48.
Kung S, Mak M: PDA-SVM Hybrid: A Unified Model For Kernel-Based Supervised Classification. Journal of Signal Processing Systems for Signal, Image, and Video Technology 2011. To appear
Suykens JAK, Vandewalle J: Least squares support vector machine classifiers. Neural processing letters 1999,9(3):293–300. 10.1023/A:1018628609742
Wu CH, McLarty JM: Neural Networks and Genome Informatics. Elsevier Science; 2000.