A novel sequence-based method of predicting protein DNA-binding residues, using a machine learning approach
Tóm tắt
Protein-DNA interactions play an essential role in transcriptional regulation, DNA repair, and many vital biological processes. The mechanism of protein-DNA binding, however, remains unclear. For the study of many diseases, researchers must improve their understanding of the amino acid motifs that recognize DNA. Because identifying these motifs experimentally is expensive and time-consuming, it is necessary to devise an approach for computational prediction. Some in silico methods have been developed, but there are still considerable limitations. In this study, we used a machine learning approach to develop a new sequence-based method of predicting protein-DNA binding residues. To make these predictions, we used the properties of the micro-environment of each amino acid from the AAIndex as well as conservation scores. Testing by the cross-validation method, we obtained an overall accuracy of 94.89%. Our method shows that the amino acid micro-environment is important for DNA binding, and that it is possible to identify the protein-DNA binding sites with it.
Tài liệu tham khảo
Ahmad, S., Gromiha, M.M., and Sarai, A. (2004). Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 20, 477–486.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. (2000). The protein data bank. Nucleic Acids Res. 28, 235–242.
Bullock, A.N., and Fersht, A.R. (2001). Rescuing the function of mutant p53. Nat. Rev. Cancer 1, 68–76.
Cai, Y., He, J., Li, X., Lu, L., Yang, X., Feng, K., Lu, W., and Kong, X. (2009). A novel computational approach to predict transcription factor DNA binding preference. J. Proteome Res. 8, 999–1003.
Cao, X., Kambe, F., Lu, X., Kobayashi, N., Ohmori, S., and Seo, H. (2005). Glutathionylation of two cysteine residues in paired domain regulates DNA binding activity of Pax-8. J. Biol. Chem. 280, 25901–25906.
Fugmann, S.D., and Schatz, D.G. (2001). Identification of basic residues in RAG2 critical for DNA binding by the RAG1-RAG2 complex. Mol. Cell 8, 899–910.
Gao, M., and Skolnick, J. (2008). DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res. 36, 3978–3992.
Gromiha, M.M., Siebers, J.G., Selvaraj, S., Kono, H., and Sarai, A. (2005). Role of inter and intramolecular interactions in protein-DNA recognition. Gene 364, 108–113.
Ho, S.Y., Yu, F.C., Chang, C.Y., and Huang, H.L. (2007). Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method. Biosystems 90, 234–241.
Horton, P., Park, K.J., Obayashi, T., Fujita, N., Harada, H., Adams-Collier, C.J., and Nakai, K. (2007). WoLF PSORT: protein localization predictor. Nucleic Acids Res. 35, W585–587.
Hwang, S., Gou, Z., and Kuznetsov, I.B. (2007). DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23, 634–636.
Jamal Rahi, S., Virnau, P., Mirny, L.A., and Kardar, M. (2008). Predicting transcription factor specificity with all-atom models. Nucleic Acids Res. 36, 6209–6217.
Jones, S., and Thornton, J.M. (2004). Searching for functional sites in protein structures. Curr. Opin. Chem. Biol. 8, 3–7.
Kaplan, T., Friedman, N., and Margalit, H. (2005). Ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput. Biol. 1, e1.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948.
Luscombe, N.M., Austin, S.E., Berman, H.M., and Thornton, J.M. (2000). An overview of the structures of protein-DNA complexes. Genome Biol. 1, REVIEWS001.
Noyes, M.B., Christensen, R.G., Wakabayashi, A., Stormo, G.D., Brodsky, M.H., and Wolfe, S.A. (2008). Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133, 1277–1289.
Ofran, Y., Mysore, V., and Rost, B. (2007). Prediction of DNAbinding residues from sequence. Bioinformatics 23, i347–353.
Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1226–1238.
Pietsch, E.C., Perchiniak, E., Canutescu, A.A., Wang, G., Dunbrack, R.L., and Murphy, M.E. (2008). Oligomerization of BAK by p53 utilizes conserved residues of the p53 DNA binding domain. J. Biol. Chem. 283, 21294–21304.
Qian, Z., Cai, Y.D., and Li, Y. (2006). A novel computational method to predict transcription factor DNA binding preference. Biochem. Biophys. Res. Commun. 348, 1034–1037.
Salamov, A.A., and Solovyev, V.V. (1997). Protein secondary structure prediction using local alignments. J. Mol. Biol. 268, 31–36.
Sim, J., Kim, S.Y., and Lee, J. (2005). Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinformatics 21, 2844–2849.
Sinha, S., van Nimwegen, E., and Siggia, E.D. (2003). A probabilistic method to detect regulatory modules. Bioinformatics 19, i292–301.
Tan, K., McCue, L.A., and Stormo, G.D. (2005). Making connections between novel transcription factors and their DNA motifs. Genome Res. 15, 312–320.
Valdar, W.S. (2002). Scoring residue conservation. Proteins 48, 227–241.
Vavouri, T., and Elgar, G. (2005). Prediction of cis-regulatory elements using binding site matrices—the successes, the failures and the reasons for both. Curr. Opin. Genet. Dev. 15, 395–402.
Wang, L., and Brown, S.J. (2006). Prediction of DNA-binding residues from sequence features. J. Bioinform Comput. Biol. 4, 1141–1158.
Warner, J.B., Philippakis, A.A., Jaeger, S.A., He, F.S., Lin, J., and Bulyk, M.L. (2008). Systematic identification of mammalian regulatory motifs’ target genes and functions. Nat. Methods 5, 347–353.
Whitington, T., Perkins, A.C., and Bailey, T.L. (2009). High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites. Nucleic Acids Res. 37, 14–25.
Wong, W.S., and Nielsen, R. (2007). Finding cis-regulatory modules in Drosophila using phylogenetic hidden Markov models. Bioinformatics 23, 2031–2037.
Wu, J., Liu, H., Duan, X., Ding, Y., Wu, H., Bai, Y., and Sun, X. (2009). Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 25, 30–35.