RRCRank: a fusion method using rank strategy for residue-residue contact prediction
Tóm tắt
In structural biology area, protein residue-residue contacts play a crucial role in protein structure prediction. Some researchers have found that the predicted residue-residue contacts could effectively constrain the conformational search space, which is significant for de novo protein structure prediction. In the last few decades, related researchers have developed various methods to predict residue-residue contacts, especially, significant performance has been achieved by using fusion methods in recent years. In this work, a novel fusion method based on rank strategy has been proposed to predict contacts. Unlike the traditional regression or classification strategies, the contact prediction task is regarded as a ranking task. First, two kinds of features are extracted from correlated mutations methods and ensemble machine-learning classifiers, and then the proposed method uses the learning-to-rank algorithm to predict contact probability of each residue pair. First, we perform two benchmark tests for the proposed fusion method (RRCRank) on CASP11 dataset and CASP12 dataset respectively. The test results show that the RRCRank method outperforms other well-developed methods, especially for medium and short range contacts. Second, in order to verify the superiority of ranking strategy, we predict contacts by using the traditional regression and classification strategies based on the same features as ranking strategy. Compared with these two traditional strategies, the proposed ranking strategy shows better performance for three contact types, in particular for long range contacts. Third, the proposed RRCRank has been compared with several state-of-the-art methods in CASP11 and CASP12. The results show that the RRCRank could achieve comparable prediction precisions and is better than three methods in most assessment metrics. The learning-to-rank algorithm is introduced to develop a novel rank-based method for the residue-residue contact prediction of proteins, which achieves state-of-the-art performance based on the extensive assessment.
Tài liệu tham khảo
Lindorff-Larsen K, Piana S, Dror RO, Shaw DE. How fast-folding proteins fold. Science. 2011;334(6055):517–20.
Bradley P, Misura KM, Baker D. Toward high-resolution de novo structure prediction for small proteins. Science. 2005;309(5742):1868–71.
Tai C-H, Bai H, Taylor TJ, Lee B: Assessment of template-free modeling in CASP10 and ROLL. Proteins-structure Function Bioinformatics 2014, 82 Suppl 2(Supplement S2):57–83.
Piana S, Klepeis JL, Shaw DE. Assessing the accuracy of physical models used in protein-folding simulations: quantitative evidence from long molecular dynamics simulations. Curr Opin Struct Biol. 2014;24(1):98–105.
Marks DS, Hopf TA, Sander C. Protein structure prediction from sequence variation. Nat Biotechnol. 2012;30(11):1072–80.
Ma J, Wang S, Wang Z, Xu J: Protein Contact Prediction by Integrating Joint Evolutionary Coupling Analysis and Supervised Learning. In: Research in Computational Molecular Biology: 2015. Springer: 218–221.
Zhang Y. I-TASSER: fully automated protein structure prediction in CASP8. Proteins Structure Function Bioinformatics. 2009;77(9):100–13.
Wang S, Ma J, Peng J, Xu J: Protein structure alignment beyond spatial proximity. Sci Rep 2013, 3(3):1448–1448.
Xu J, Jiao F, Berger B. A parameterized algorithm for protein structure alignment. J Comput Biol. 2007;14(5):564–77.
Wang Z, Eickholt J, Cheng J. APOLLO: a quality assessment service for single and multiple protein models. Bioinformatics. 2011;27(12):1715–6.
Miller CS, Eisenberg D. Using inferred residue contacts to distinguish between correct and incorrect protein models. Bioinformatics. 2008;24(14):1575–82.
Tress ML, Valencia A. Predicted residue–residue contacts can help the scoring of 3D models. Proteins Structure Function Bioinformatics. 2010;78(8):1980–91.
Kliger Y, Levy O, Oren A, Ashkenazy H, Tiran Z, Novik A, Rosenberg A, Amir A, Wool A, Toporik A. Peptides modulating conformational changes in secreted chaperones: from in silico design to preclinical proof of concept. Proc Natl Acad Sci. 2009;106(33):13797–801.
Korber BT, Farber RM, Wolpert DH, Lapedes AS. Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci U S A. 1993;90(15):7176–80.
Clarke ND. Covariation of residues in the homeodomain sequence family. Protein Sci. 1995;4(11):2269–78.
Gobel U, Sander C, Schneider R, Valencia A. Correlated mutations and residue contacts in proteins. Proteins-Structure Function and Genetics. 1994;18(4):309–17.
Neher E. How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci. 1994;91(1):98–102.
Taylor WR, Hatrick K. Compensating changes in protein multiple sequence alignments. Protein Eng. 1994;7(3):341–8.
Olmea O, Valencia A. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Folding design. 1997;2(3):25.
Pazos F, Helmer-Citterich M, Ausiello G, Valencia A. Correlated mutations contain information about protein-protein interaction. J Mol Biol. 1997;271(4):511–23.
Larson SM, Di Nardo AA, Davidson AR. Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. J Mol Biol. 2000;303(3):433–46.
Kass I, Horovitz A. Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations. Proteins: Structure, Function, and Bioinformatics. 2002;48(4):611–7.
Orly N, Miriam E, Amnon H. Detection and reduction of evolutionary noise in correlated mutation analysis. Protein Engineering Design Selection. 2005;18(5):247–53.
Lapedes AS, Giraud BG, Liu L, Stormo GD. Correlated mutations in models of protein sequences: phylogenetic and structural effects. Lecture Notes-Monograph Series. 1999:236–56.
Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc Natl Acad Sci. 2009;106(1):67–72.
Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28(2):184–90.
Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical review E, Statistical, nonlinear, and soft matter physics. 2013;87(1):012707.
Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three dimensions. PLoS Comput Biol. 2014;10(10):e1003847.
Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era. Proc Natl Acad Sci. 2013;110(39):15674–9.
Balakrishnan S, Kamisetty H, Carbonell JG, Lee SI, Langmead CJ: Learning generative models for protein fold families. Proteins-structure Function Bioinformatics 2011, 79(4):1061–1078.
Wu S, Zhang Y. A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics. 2008;24(7):924–31.
Yuan Z. Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics. 2005;6(1):248.
Cheng J, Baldi P. Improved residue contact prediction using support vector machines and a large feature set. BMC bioinformatics. 2007;8:113.
Shackelford G, Karplus K. Contact prediction using mutual information and neural nets. Proteins. 2007;69(Suppl 8):159–64.
Punta M, Rost B. PROFcon: novel prediction of long-range contacts. Bioinformatics. 2005;21(13):2960–8.
Xue B, Faraggi E, Zhou Y. Predicting residue–residue contact maps by a two-layer, integrated neural-network method. Proteins: Structure, Function, and Bioinformatics. 2009;76(1):176–83.
Fariselli P, Casadio R. A neural network based predictor of residue contacts in proteins. Protein Eng. 1999;12(1):15–21.
Tegge AN, Wang Z, Eickholt J, Cheng J. NNcon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res. 2009;37(suppl 2):W515–8.
Li Y, Fang Y, Fang J. Predicting residue-residue contacts using random forest models. Bioinformatics. 2011;27(24):3379–84.
Wang X, Chen Z, Wang C, Yan R, Zhang Z, Aguilar RC. Predicting residue-residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PLoS One. 2011;6(10):e26767.
Bjorkholm P, Daniluk P, Kryshtafovych A, Fidelis K, Andersson R, Hvidsten TR. Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts. Bioinformatics. 2009;25(10):1264–70.
Wang Z, Xu J. Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics. 2013;29(13):i266–73.
Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics. 2015;31(7):999–1006.
Kosciolek T, Jones DT. Accurate contact predictions using covariation techniques and machine learning. Proteins Structure Function Bioinformatics. 2015;84(S1):145–51.
Yang J, Jin Q-Y, Zhang B, Shen H-B: R2C: Improving ab initio residue contact map prediction using dynamic fusion strategy and Gaussian noise filter. Bioinformatics 2016:btw181.
Shao Y, Bystroff C. Predicting interresidue contacts using templates and pathways. Proteins: Structure, Function, and Bioinformatics. 2003;53(S6):497–502.
Misura KM, Chivian D, Rohl CA, Kim DE, Baker D. Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc Natl Acad Sci. 2006;103(14):5361–6.
Dong Q, Hu X. RRCRank: a fusion method using rank strategy for residue-residue contacts prediction. Eur Biophys J. 2017;46(Supplement 1):43–402.
Wu J, Huang J, Ye Z. Learning to rank diversified results for biomedical information retrieval from multiple features. Biomed Eng Online 2014, 13 Suppl. 2:S3.
Jing X, Wang K, Lu R, Dong Q. Sorting protein decoys by machine-learning-to-rank. Sci Rep. 2016;6:31571.
Leaman R, Islamaj Dogan R, Lu Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–17.
Abstracts. Eur Biophys J. 2017;46(1):43–402.
Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. Evaluation of residue-residue contact prediction in CASP10. Proteins. 2014;82:138.
Hobohm U, Sander C. Enlarged representative set of protein structures. Protein Sci. 1994;3(3):522–4.
Bacardit J, Widera P, Márquez-Chamorro A, Divina F, Aguilar-Ruiz JS, Krasnogor N. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics. 2012;28(19):2441–8.
Kinch LN, Li W, Schaeffer RD, Dunbrack RL, Monastyrskyy B, Kryshtafovych A, Grishin NV. CASP 11 target classification. Proteins Structure Function. Bioinformatics. 2016;84(Suppl 1):20.
Harrington EF. Online ranking / collaborative filtering using the perceptron algorithm. In: Proc of the Twentieth International Conference on. Mach Learn. 2003:250–7.
Joachims T. Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining; 2002. p. 133–42.
Chirita P-A, Diederich J, Nejdl W: MailRank: using ranking for spam detection. In: Proceedings of the 14th ACM international conference on Information and knowledge management: 2005. ACM: 373–380.
Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011;9(2):173–5.
Joachims T. Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining; 2006. p. 217–26.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman JD. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Cheng J, Randall AZ, Sweredoski MJ, Baldi P. SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005;33(Web Server issue):72–6.
Atchley WR, Zhao J, Fernandes AD, Drüke T: Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(18):págs. 6395–6400.
Seemayer S, Gruber M, Soding J. CCMpred--fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics. 2014;30(21):3128–30.
Joachims T. Making large scale SVM learning practical. In: Universität Dortmund; 1999. p. 499–526.
Monastyrskyy B, Dandrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: assessment of the CASP11 results. Proteins-structure Function Bioinformatics. 2015;84(S1):131–44.