Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map

Springer Science and Business Media LLC - Tập 13 - Trang 1-10 - 2021
Jianwen Chen1, Shuangjia Zheng1, Huiying Zhao2, Yuedong Yang1,3
1School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
2Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China
3Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Guangzhou, China

Tóm tắt

Protein solubility is significant in producing new soluble proteins that can reduce the cost of biocatalysts or therapeutic agents. Therefore, a computational model is highly desired to accurately predict protein solubility from the amino acid sequence. Many methods have been developed, but they are mostly based on the one-dimensional embedding of amino acids that is limited to catch spatially structural information. In this study, we have developed a new structure-aware method GraphSol to predict protein solubility by attentive graph convolutional network (GCN), where the protein topology attribute graph was constructed through predicted contact maps only from the sequence. GraphSol was shown to substantially outperform other sequence-based methods. The model was proven to be stable by consistent $${\text{R}}^{2}$$ of 0.48 in both the cross-validation and independent test of the eSOL dataset. To our best knowledge, this is the first study to utilize the GCN for sequence-based protein solubility predictions. More importantly, this architecture could be easily extended to other protein prediction tasks requiring a raw protein sequence.

Tài liệu tham khảo

Habibi N, Hashim SZM, Norouzi A, Samian MR (2014) A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinform 15(1):134 Chan W-C, Liang P-H, Shih Y-P, Yang U-C, Lin W-C, Hsu C-N (2010) Learning to predict expression efficacy of vectors in recombinant protein production. BMC bioinform 11(S1):S21 Samak T, Gunter D, Wang Z: Prediction of protein solubility in E. coli. In: 2012 IEEE 8th international conference on E-science. New York: IEEE; 2012. p. 1–8. Fang Y, Fang J (2013) Discrimination of soluble and aggregation-prone proteins based on sequence information. Mol BioSyst 9(4):806–811 Agostini F, Vendruscolo M, Tartaglia GG (2012) Sequence-based prediction of protein solubility. J Mol Biol 421(2–3):237–241 Madhavan A, Sindhu R, Binod P, Sukumaran RK, Pandey A (2017) Strategies for design of improved biocatalysts for industrial applications. Biores Technol 245:1304–1313 Tjong H, Zhou H-X (2008) Prediction of protein solubility from calculation of transfer free energy. Biophys J 95(6):2601–2609 De Simone A, Dhulesia A, Soldi G, Vendruscolo M, Hsu STD, Chiti F, Dobson CM (2011) Experimental free energy surfaces reveal the mechanisms of maintenance of protein solubility. Proc Natl Acad Sci 108(52):21057–21062 Hou Q, Kwasigroch JM, Rooman M, Pucci F (2020) SOLart: a structure-based method to predict protein solubility and aggregation. Bioinformatics 36(5):1445–1452 Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D (2012) PROSO II–a new method for protein solubility prediction. FEBS J 279(12):2192–2200 Magnan CN, Randall A, Baldi P (2009) SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics 25(17):2200–2207 Huang H-L, Charoenkwan P, Kao T-F, Lee H-C, Chang F-L, Huang W-L, Ho S-J, Shu L-S, Chen W-L, Ho S-Y (2012) Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC bioinform 13:S3 Suykens JAK (2002) Least squares support vector machines. World Scientific, Singapore Rawi R, Mall R, Kunji K, Shen C-H, Kwong PD, Chuang G-Y (2018) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34(7):1092–1098 Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J (2017) Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics 33(19):3098–3100 LeCun Y. LeNet-5, convolutional neural networks. 2015; 20(5):14. http://yann.lecun.com/exdb/lenet. Khurana S, Rawi R, Kunji K, Chuang G-Y, Bensmail H, Mall R (2018) DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34(15):2605–2613 Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. https://arxiv.org/abs/1406.2661 Han X, Zhang L, Zhou K, Wang X (2019) ProGAN: Protein solubility generative adversarial nets for data augmentation in DNN framework. Comput Chem Eng 131:106533 Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song YS (2019) Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 32:9689–9701 Heinzinger M, Elnaggar A, Wang Y, Dallago C, Nechaev D, Matthes F, Rost B (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform 20(1):723 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998–6008. https://arxiv.org/abs/1706.03762 Chen S, Sun Z, Lin L, Liu Z, Liu X, Chong Y, Lu Y, Zhao H, Yang Y (2019) To improve protein sequence profile prediction through image captioning on pairwise residue distance map. J Chem Inf Model 60(1):391–399 Zheng S, Li Y, Chen S, Xu J, Yang Y (2020) Predicting drug–protein interaction using quasi-visual question answering system. Nat Mach Intell 2(2):134–140 Gligorijević V, Barot M, Bonneau R (2018) deepNF: deep network fusion for protein function prediction. Bioinformatics 34(22):3873–3881 Zamora-Resendiz R, Crivelli S. Structural learning of proteins using graph convolutional neural networks. bioRxiv; 2019. p. 610444. Gligorijevic V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, Chandler C, Taylor15 BC, Fisk10 IM, Vlamakis H. Structure-based protein function prediction using graph convolutional networks. https://www.biorxiv.org/content/10.1101/786236v2.abstract Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin AM (2018) Assessment of contact predictions in CASP12: co-evolution and deep learning coming of age. Proteins Struct Funct Bioinform 86:51–66 Wang S, Sun S, Xu J (2018) Analysis of deep learning methods for blind protein contact prediction in CASP12. Proteins Struct Funct Bioinform 86:67–77 Adhikari B, Hou J, Cheng J (2018) DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics 34(9):1466–1472 Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y (2018) Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 34(23):4039–4045 Niwa T, Ying B-W, Saito K, Jin W, Takada S, Ueda T, Taguchi H (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc Natl Acad Sci 106(11):4201–4206 Han X, Wang X, Zhou K (2019) Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics 35(22):4640–4646 Shimizu Y, Kanamori T, Ueda T (2005) Protein synthesis by pure translation systems. Methods 36(3):299–304 Li Z, Yang Y, Zhan J, Dai L, Zhou Y. Energy functions in de novo protein design: current challenges and future prospects. 2013. https://www.annualreviews.org/doi/full/10.1146/annurev-biophys-083012-130315 Mount DW (2008) Using BLOSUM in sequence alignments. Cold Spring Harb Protoc 2008(6):pdb.top39 Taherzadeh G, Zhou Y, Liew AWC, Yang Y (2016) Sequence-based prediction of protein–carbohydrate binding sites using support vector machines. J Chem Inf Model 56(10):2115–2122 Meiler J, Müller M, Zeidler A, Schmäschke F (2001) Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Mol Model Annu 7(9):360–369 Narjeskhatoon Habibi1* SZMH, ANaMRS, 3,4: A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. 2014. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-134 Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402 Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M (2017) Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res 45(D1):D170–D176 Heffernan R, Yang Y, Paliwal K, Zhou Y (2017) Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33(18):2842–2849 Emerson IA, Amala A (2017) Protein contact maps: a binary depiction of protein 3D structures. Phys A 465:782–791 Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint. arXiv:1609.02907; 2016. Zheng S, Yan X, Yang Y, Xu J (2019) Identifying structure–property relationships through SMILES syntax analysis with self-attention mechanism. J Chem Inf Model 59(2):914–923 Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y. A structured self-attentive sentence embedding. arXiv preprint . arXiv:1703.03130; 2017. Kingma DP, Ba J: Adam. A method for stochastic optimization. arXiv preprint . arXiv:1412.6980; 2014.