In-silico predictive mutagenicity model generation using supervised learning approaches

Springer Science and Business Media LLC - Tập 4 - Trang 1-11 - 2012
Abhik Seal1, Anurag Passi2, UC Abdul Jaleel3, David J Wild1
1Indiana University Bloomington, School of Informatics and computing, Bloomington, USA
2Open Source Drug Discovery, Council of Scientific and Industrial Research, New Delhi, India
3Department of Cheminformatics, Malabar Christian College, Kerala, India

Tóm tắt

Experimental screening of chemical compounds for biological activity is a time consuming and expensive practice. In silico predictive models permit inexpensive, rapid “virtual screening” to prioritize selection of compounds for experimental testing. Both experimental and in silico screening can be used to test compounds for desirable or undesirable properties. Prior work on prediction of mutagenicity has primarily involved identification of toxicophores rather than whole-molecule predictive models. In this work, we examined a range of in silico predictive classification models for prediction of mutagenic properties of compounds, including methods such as J48 and SMO which have not previously been widely applied in cheminformatics. The Bursi mutagenicity data set containing 4337 compounds (Set 1) and a Benchmark data set of 6512 compounds (Set 2) were taken as input data set in this work. A third data set (Set 3) was prepared by joining up the previous two sets. Classification algorithms including Naïve Bayes, Random Forest, J48 and SMO with 10 fold cross-validation and default parameters were used for model generation on these data sets. Models built using the combined performed better than those developed from the Benchmark data set. Significantly, Random Forest outperformed other classifiers for all the data sets, especially for Set 3 with 89.27% accuracy, 89% precision and ROC of 95.3%. To validate the developed models two external data sets, AID1189 and AID1194, with mutagenicity data were tested showing 62% accuracy with 67% precision and 65% ROC area and 91% accuracy, 91% precision with 96.3% ROC area respectively. A Random Forest model was used on approved drugs from DrugBank and metabolites from the Zinc Database with True Positives rate almost 85% showing the robustness of the model. We have created a new mutagenicity benchmark data set with around 8,000 compounds. Our work shows that highly accurate predictive mutagenicity models can be built using machine learning methods based on chemical descriptors and trained using this set, and these models provide a complement to toxicophores based methods. Further, our work supports other recent literature in showing that Random Forest models generally outperform other comparable machine learning methods for this kind of application.

Tài liệu tham khảo

van Ravenzwaay B, Herold M, Kamp H, Kapp MD, Fabian E, Looser R, Krennrich G, Mellert W, Prokoudine A, Strauss V, Walk T, Wiemer J: Metabolomics: A tool for early detection of toxicological effects and an opportunity for biology based grouping of chemicals-From QSAR to QBAR. Mutat Res. 2012, [In Press] Ames B: The detection of environmental mutagens and potential. Cancer. 1984, 53: 2030-2040. Mortelmans K, Zeiger E: The ames salmonella/microsome mutagenicity assay. Mutat Res. 2000, 455 (1–2): 29-60. Kazius J, McGuire J, Bursi R: Derivation and validation of toxicophores for mutagenicity prediction. J Med Chem. 2005, 48 (1): 312-320. 10.1021/jm040835a. Helma C, Cramer T, Kramer S, Raedt L: Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J Chem Inf Comput Sci. 2004, 44: 1402-1411. 10.1021/ci034254q. Hansen K, Mika S, Schroeter T, Sutter A, Laak A, Hartmann ST, Heinrich N, MullerK P: Benchmark data set for in-silico prediction of ames mutagenicity. J Chem Inf Model. 2009, 49: 2077-2081. 10.1021/ci900161g. Zhang QZ, Aires-de-Sousa J: Random forest prediction of mutagenicity from empirical physicochemical descriptors. J Chem Inf Model. 2007, 47: 1-8. 10.1021/ci050520j. Feng J, Lurati L, Ouyang H, Robinson T, Wang Y, Yuan S, Young SS: Predictive toxicology: benchmarking molecular descriptors and statistical methods. J Chem Inf Comput Sci. 2003, 43: 1463-1470. 10.1021/ci034032s. King RD, Muggletont SH, Srinivasani A, Sternberg MJE: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc Natl Acad Sci. 1996, 93: 438-442. 10.1073/pnas.93.1.438. Judson R, Elloumi F, Setzer RW, Li Z, Shah I: A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model. BMC Bioinf. 2008, 9-241. Ferrari T, Gini G: An open source multistep model to predict mutagenicity from statistical analysis and relevant structural alerts. Chem Cent J. 2010, 4 (Suppl 1): S2-10.1186/1752-153X-4-S1-S2. Benfenati E: The CAESAR project for in silico models for the REACH legislation. Chem Central J. 2010, 4 (Suppl 1): I1-10.1186/1752-153X-4-S1-I1. Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie QA, Tong W: Three new consensus QSAR models for the prediction of ames genotoxicity. Mutagenesis. 2004, 19: 365-377. 10.1093/mutage/geh043. Ashby J, Tennant RW: Chemical structure, salmonella mutagenicity and extent of carcinogenicity as indicators of genotoxic carcinogenesis among 222 chemicals tested in rodents by the U.S. NCI/NTP. Mutat Res. 1988, 204 (1): 17-115. 10.1016/0165-1218(88)90114-0. Hakimelahi GH, Khodarahmi GA: The Identification of Toxicophores for the Prediction of Mutagenicity Hepatotoxicity and Cardiotoxicity. J Iran Chem Soc. 2005, 2: 244-267. 10.1007/BF03245929. Blagg J: Structure activity relationships for in vitro and in vivo toxicity. Annu R Med Chem. 2006, 41: 353-358. Bongsup PC, Beland FA, Marques FM: NMR structural studies of a 15-mer DNA sequence from a rasprotooncogene modified at the first base of codon 61 with the carcinogen 4 -aminobiphenyl. Biochemistry. 1992, 31 (40): 9587-9602. 10.1021/bi00155a011. Li J, Dierkes P, Gutsell S, Stott I: Assessing different classifiers for in-silico prediction of ames test mutagenicity. In a poster in the 4thJoint Sheffield Conference on Chemoinformatics: 2007. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Guo AC, Wishart DS: DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res. 2011, 39: D1035-D1041. 10.1093/nar/gkq1126. Irwin J, Shoichet B: Zinc – a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005, 45 (1): 177-182. 10.1021/ci049714+. Accelrys, Inc., 10188 Telesis Court, Suite 100, San Diego, CA. URL: [http://accelrys.com/products/pipeline-pilot/] Gold LS, Slone TH, Ames BN, Manley NB, Garfinkel GB, Rohrbach L: Carcinogenic Potency Database. In Handbook of Carcinogenic Potency and Genotoxicity Databases. 1997, Boca Raton: CRC Press, 1-605. Liu K, Feng J, Young SS, Power MV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. J Chem Inf Model. 2005, 45 (2): 515-522. 10.1021/ci049847v. Burden FR: Molecular identification number for substructure searches. J Chem Inf Comput Sci. 1989, 29: 225-227. 10.1021/ci00063a011. Schierz AC: Virtual screening of bioassay data. J Cheminformatics. 2009, 1: 21-10.1186/1758-2946-1-21. [http://www.cs.waikato.ac.nz/ml/weka/index.html] Friedman N, Geiger D, Goldszmidt M: Bayesian network classifiers. Mach Learn. 1997, 29: 131-163. 10.1023/A:1007465528199. Keerthi S, Gilbert E: Convergence of a generalized SMO algorithm for SVM classifier design. Mach Learn. 2002, 46: 351-360. 10.1023/A:1012431217818. Murthy A: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowledge Discovery. 1998, 2: 345-389. 10.1023/A:1009744630224. Dietterich TG: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000, 40: 139-157. 10.1023/A:1007607513941. Ehrman TM, Barlow DJ, Hylands J, 2: Virtual Screening of Chinese Herbs with Random Forest. J ChemInf Model. 2007, 47: 264-278. 10.1021/ci600289v. Singla D, Anurag M, Dash D, Raghava G: A web server for predicting inhibitors against bacterial target GlmU protein. BMCPharmacol. 2011, 11: 5- Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W: A comparison of Random Forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinforma. 2009, 10: 213-10.1186/1471-2105-10-213. Lagadic D, Rissel M, Le Bot MA: Guillouzo toxic effects of tacrine on primary hepatocytes and liver epithelial cells in culture. Cell Biol Toxicol. 1998, 14: 5361-5373. Fuchs S, Simon Z, Brezis M: Fatal hepatic failure associated with ciprofloxacin. Lancet. 1994, 242: 738-739. Ashby J: Fundamental structural alerts to potential carcinogenicity or non carcinogenicity. Environ Mutagen. 1985, 7: 919-921. 10.1002/em.2860070613.