Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map
Tóm tắt
Protein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. In this study, we have developed a new structure-aware method GraphSol to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps only from the sequence. GraphSol was shown to substantially outperform other sequence-based methods. The model was proven to be stable by consistent
$${\text{R}}^{2}$$
of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based protein solubility predictions. More importantly, this architecture could be easily extended to other protein prediction tasks requiring a raw protein sequence.
Tài liệu tham khảo
Habibi N, Hashim SZM, Norouzi A, Samian MR (2014) A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinform 15(1):134
Chan W-C, Liang P-H, Shih Y-P, Yang U-C, Lin W-C, Hsu C-N (2010) Learning to predict expression efficacy of vectors in recombinant protein production. BMC bioinform 11(S1):S21
Samak T, Gunter D, Wang Z: Prediction of protein solubility in E. coli. In: 2012 IEEE 8th international conference on E-science. New York: IEEE; 2012. p. 1–8.
Fang Y, Fang J (2013) Discrimination of soluble and aggregation-prone proteins based on sequence information. Mol BioSyst 9(4):806–811
Agostini F, Vendruscolo M, Tartaglia GG (2012) Sequence-based prediction of protein solubility. J Mol Biol 421(2–3):237–241
Madhavan A, Sindhu R, Binod P, Sukumaran RK, Pandey A (2017) Strategies for design of improved biocatalysts for industrial applications. Biores Technol 245:1304–1313
Tjong H, Zhou H-X (2008) Prediction of protein solubility from calculation of transfer free energy. Biophys J 95(6):2601–2609
De Simone A, Dhulesia A, Soldi G, Vendruscolo M, Hsu STD, Chiti F, Dobson CM (2011) Experimental free energy surfaces reveal the mechanisms of maintenance of protein solubility. Proc Natl Acad Sci 108(52):21057–21062
Hou Q, Kwasigroch JM, Rooman M, Pucci F (2020) SOLart: a structure-based method to predict protein solubility and aggregation. Bioinformatics 36(5):1445–1452
Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D (2012) PROSO II–a new method for protein solubility prediction. FEBS J 279(12):2192–2200
Magnan CN, Randall A, Baldi P (2009) SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics 25(17):2200–2207
Huang H-L, Charoenkwan P, Kao T-F, Lee H-C, Chang F-L, Huang W-L, Ho S-J, Shu L-S, Chen W-L, Ho S-Y (2012) Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC bioinform 13:S3
Suykens JAK (2002) Least squares support vector machines. World Scientific, Singapore
Rawi R, Mall R, Kunji K, Shen C-H, Kwong PD, Chuang G-Y (2018) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34(7):1092–1098
Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J (2017) Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics 33(19):3098–3100
LeCun Y. LeNet-5, convolutional neural networks. 2015; 20(5):14. http://yann.lecun.com/exdb/lenet.
Khurana S, Rawi R, Kunji K, Chuang G-Y, Bensmail H, Mall R (2018) DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34(15):2605–2613
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. https://arxiv.org/abs/1406.2661
Han X, Zhang L, Zhou K, Wang X (2019) ProGAN: Protein solubility generative adversarial nets for data augmentation in DNN framework. Comput Chem Eng 131:106533
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song YS (2019) Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 32:9689–9701
Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform 20(1):723
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998–6008. https://arxiv.org/abs/1706.03762
Chen S, Sun Z, Lin L, Liu Z, Liu X, Chong Y, Lu Y, Zhao H, Yang Y (2019) To improve protein sequence profile prediction through image captioning on pairwise residue distance map. J Chem Inf Model 60(1):391–399
Zheng S, Li Y, Chen S, Xu J, Yang Y (2020) Predicting drug–protein interaction using quasi-visual question answering system. Nat Mach Intell 2(2):134–140
Gligorijević V, Barot M, Bonneau R (2018) deepNF: deep network fusion for protein function prediction. Bioinformatics 34(22):3873–3881
Zamora-Resendiz R, Crivelli S. Structural learning of proteins using graph convolutional neural networks. bioRxiv; 2019. p. 610444.
Gligorijevic V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, Chandler C, Taylor15 BC, Fisk10 IM, Vlamakis H. Structure-based protein function prediction using graph convolutional networks. https://www.biorxiv.org/content/10.1101/786236v2.abstract
Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin AM (2018) Assessment of contact predictions in CASP12: co-evolution and deep learning coming of age. Proteins Struct Funct Bioinform 86:51–66
Wang S, Sun S, Xu J (2018) Analysis of deep learning methods for blind protein contact prediction in CASP12. Proteins Struct Funct Bioinform 86:67–77
Adhikari B, Hou J, Cheng J (2018) DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics 34(9):1466–1472
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y (2018) Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 34(23):4039–4045
Niwa T, Ying B-W, Saito K, Jin W, Takada S, Ueda T, Taguchi H (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc Natl Acad Sci 106(11):4201–4206
Han X, Wang X, Zhou K (2019) Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics 35(22):4640–4646
Shimizu Y, Kanamori T, Ueda T (2005) Protein synthesis by pure translation systems. Methods 36(3):299–304
Li Z, Yang Y, Zhan J, Dai L, Zhou Y. Energy functions in de novo protein design: current challenges and future prospects. 2013. https://www.annualreviews.org/doi/full/10.1146/annurev-biophys-083012-130315
Mount DW (2008) Using BLOSUM in sequence alignments. Cold Spring Harb Protoc 2008(6):pdb.top39
Taherzadeh G, Zhou Y, Liew AWC, Yang Y (2016) Sequence-based prediction of protein–carbohydrate binding sites using support vector machines. J Chem Inf Model 56(10):2115–2122
Meiler J, Müller M, Zeidler A, Schmäschke F (2001) Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Mol Model Annu 7(9):360–369
Narjeskhatoon Habibi1* SZMH, ANaMRS, 3,4: A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. 2014. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-134
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M (2017) Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res 45(D1):D170–D176
Heffernan R, Yang Y, Paliwal K, Zhou Y (2017) Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33(18):2842–2849
Emerson IA, Amala A (2017) Protein contact maps: a binary depiction of protein 3D structures. Phys A 465:782–791
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint. arXiv:1609.02907; 2016.
Zheng S, Yan X, Yang Y, Xu J (2019) Identifying structure–property relationships through SMILES syntax analysis with self-attention mechanism. J Chem Inf Model 59(2):914–923
Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y. A structured self-attentive sentence embedding. arXiv preprint . arXiv:1703.03130; 2017.
Kingma DP, Ba J: Adam. A method for stochastic optimization. arXiv preprint . arXiv:1412.6980; 2014.