Virtual screening of bioassay data

Springer Science and Business Media LLC - Tập 1 - Trang 1-12 - 2009
Amanda C Schierz1
1Smart Technology Research Centre, Bournemouth University, Poole, UK

Tóm tắt

There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets. Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance. Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.

Tài liệu tham khảo

DiMasi JA, Hansen RW, Grabowski HG: The price of innovation: new estimates of drug development costs. Journal of Health Economics. 2003, 22: 151-185. 10.1016/S0167-6296(02)00126-1. Leach AR, Gillet VJ: An Introduction to Chemoinformatics. 2003, The Netherlands, Dordrecht: Kluwer Academic Publishers Bradley D: Dealing with a data dilemma. Nature Reviews: Drug Discovery. 2008, 7: 632-633. 10.1038/nrd2649. Ehrman TM, Barlow DJ, Hylands J: Virtual Screening of Chinese Herbs with Random Forest. J Chem Inf Model. 2007, 47 (2): 264-278. 10.1021/ci600289v. Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J: Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques. J Chem Inf Model. 2007, 47: 92-103. 10.1021/ci6002619. Chen B, Wild DJ: PubChem BioAssays as a data source for predictive models. Journal of Molecular Graphics and Modelling. 2009, Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2005, San Francisco: Morgan Kaufmann Bolton EE, Wang Y, Thiessen PA, Bryant SH: PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry. 2008, 4: 217-241. 10.1016/S1574-1400(08)00012-1. Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research. 2009, W623-33. 10.1093/nar/gkp456. 37 Web Server PubChem Help: Sometime I see errors in the substance record, where I should report?. [http://pubchem.ncbi.nlm.nih.gov/help.html] Liu K, Feng J, Young SS: PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. J Chem Inf Model. 2005, 45: 515-522. 10.1021/ci049847v. Elkan C: The Foundations of Cost-Sensitive Learning. Proceedings of the Seventeenth International Conference on Artificial Intelligence: 4-10 August 2001; Seattle. 2001, 973-978. Drummond C, Holte RC: Cost curves: An improved Method for visualizing classifier performance. Machine Learning. 2006, 65 (1): 95-130. 10.1007/s10994-006-8199-5. Seo YW, Sycara K: Cost-Sensitive Access Control for Illegitimate Confidential Access by Insiders. Proceedings of IEEE Intelligence and Security Informatics: 23-24 May 2006. Edited by: Mchrotra S, et al. 2006, San Diego: Berlin: Springer-Verlag; LNCS 3975, 117-128. Lo HL, Chang C, Chiang T, Hsiao C, Huang A, Kuo T, Lai W, Yang M, Yeh J, Yen C, Lin S: Learning to Improve Area-Under-FROC for Imbalanced Medical Data Classification Using an Ensemble Method. SIGKDD Explorations. 2008, 10 (2): 43-46. 10.1145/1540276.1540290. Sheng VS, Ling CX: Thresholding for Making Classifiers Cost-sensitive. Proceedings of the Twenty-first National Conference on Artificial Intelligence: 16-20 July 2006; Boston. 2006, 476-480. Hollmen J, Skubacz M, Taniguchi M: Input dependent misclassification costs for cost-sensitive classifiers. Data Mining II - Proceedings of the second international conference on data mining. Edited by: Ebechen N, Brebbia N. 2000, Cambridge: MIT Press, 495-503. Domingos P: MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth ACM SIGKDD Int'l. Conf. on Knowledge Discovery & Data Mining. 1999, San Diego. ACM, 155-164. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Delivery Rev. 1997, 23 (1-3): 3-25. 10.1016/S0169-409X(96)00423-1.