Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection
Tóm tắt
One of the most important characteristics of Quantitative Structure ActivityRelashionships (QSAR) models is their predictive power. The latter can bedefined as the ability of a model to predict accurately the target property(e.g., biological activity) of compounds that were not used for model development.We suggest that this goal can be achieved by rational division of an experimentalSAR dataset into the training and test set, which are used for model developmentand validation, respectively. Given that all compounds are represented by pointsin multidimensional descriptor space, we argue that training and test sets mustsatisfy the following criteria: (i) Representative points of the test set must beclose to those of the training set; (ii) Representative points of the training setmust be close to representative points of the test set; (iii) Training set must bediverse. For quantitative description of these criteria, we use molecular datasetdiversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci.,40 (2000) 414–425). For rational division of a dataset into the training and testsets, we use three closely related sphere-exclusion algorithms. Using severalexperimental datasets, we demonstrate that QSAR models built and validated withour approach have statistically better predictive power than models generated witheither random or activity ranking based selection of the training andtest sets.We suggest that rational approaches to the selection of training andtest setsbased on diversity principles should be used routinely in all QSAR modelingresearch.