Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection

Molecular Diversity - Tập 5 - Trang 231-243 - 2000
Alexander Golbraikh1, Alexander Tropsha1
1The Laboratory for Molecular Modeling, School of Pharmacy, University of North Carolina, Chapel Hill

Tóm tắt

One of the most important characteristics of Quantitative Structure ActivityRelashionships (QSAR) models is their predictive power. The latter can bedefined as the ability of a model to predict accurately the target property(e.g., biological activity) of compounds that were not used for model development.We suggest that this goal can be achieved by rational division of an experimentalSAR dataset into the training and test set, which are used for model developmentand validation, respectively. Given that all compounds are represented by pointsin multidimensional descriptor space, we argue that training and test sets mustsatisfy the following criteria: (i) Representative points of the test set must beclose to those of the training set; (ii) Representative points of the training setmust be close to representative points of the test set; (iii) Training set must bediverse. For quantitative description of these criteria, we use molecular datasetdiversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci.,40 (2000) 414–425). For rational division of a dataset into the training and testsets, we use three closely related sphere-exclusion algorithms. Using severalexperimental datasets, we demonstrate that QSAR models built and validated withour approach have statistically better predictive power than models generated witheither random or activity ranking based selection of the training andtest sets.We suggest that rational approaches to the selection of training andtest setsbased on diversity principles should be used routinely in all QSAR modelingresearch.

Tài liệu tham khảo

Hansch, C., Fujita, T., J. Am. Chem. Soc., 86 (1964) 1616–1626. Kubinyi, H., In: Mannhold, R. et al. (eds.) Methods and Principles in Medicinal Chemistry, VCH, Weinheim, 1993. Randi´c, M., J. Am. Chem. Soc., 97 (1975) 6609–6615. Kier, L.B. and Hall, L.H., Molecular Connectivity in Chemistry and Drug Research. Academic Press, New York, 1976. Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986. Kier, L.B., Quant. Struct.-Act. Relat. 4 (1985) 109–116. Kier, L.B., Quant. Struct-Act. Relat. 6 (1987) 8–12. Hall, L.H. and Kier, L.B., Quant. Struct.-Act. Relat 9 (1990) 115–131. Hall, L.H., Mohney, B.K. and Kier, L.B., Quant. Struct.-Act. Relat., 10 (1991) 43–51. Hall, L.H., Mohney, B.K. and Kier, L.B., J. Chem. Inf. Comput. Sci., 31 (1991) 76–82. Kier, L.B. and Hall, L.H., Molecular Structure Description: The Electrotopological State, Academic Press, 1999. Kellogg, G.E., Kier, L.B., Gaillard, P. and Hall, L.H., J. Comput. Aid. Mol. Des. 10 (1996) 513–520. Sheridan, R.P., Nachbar, R.B. and Bush, B.L., J. Comput.-Aid Mol. Des. 8 (1994) 323–340. Matter, H., J. Medic. Chem. 40(8) (1997) 1219–1229. Clementi, S. and Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 319–338. Wold, S., In: Waterbeemd, H. van de (ed.), Chemometrics Methods in, VCH, (1995) 195–218. Hoffman B., Cho S.J., Zheng W., Wyrick S., Nichols D.E. and Mailman R.B., J. Med. Chem. 42 (1999) 3217–3226. Zheng, W. and Tropsha, A., J. Chem. Inf. Comput. Sci., 40 (2000) 185–194. Ajay. J. Med. Chem. 36 (1993) 3565–3571. Cramer III, R.D., Patterson, D.E. and Bunce, J.D., J. Am. Chem. Soc. 110 (1988) 5959–5967. Marshall, G.R. and Cramer III, R.D., Trends Pharmacol. Sci. 9 (1988) 285–289. Pérez, C., Pastor, M., Ortiz, AR. and Gago, F., J. Med. Chem. 41 (1998) 836–852. Cho, S.J. and Tropsha, A., J. Med. Chem. 38 (1995) 1060–1066. Klebe, G., In: Kubinyi, H., Folkers, G., Martin, Y.C., (eds.) 3D QSAR in Drug Design. Volume 3. Recent Advances, Kluwer/ESCOM: Dordrecht, (1998) pp. 87–104. Kubinyi, H., Hamprecht, F.A. and Mietzner, T., J. Med. Chem., 41 (1998) 2553–2564. Topliss, J.G. and Edwards, R.P., J. Med. Chem. 22 (1979) 1238–1244. Gironés, X., Gallegos, A. and Ramon, C.-D., J. Chem. Inf. Comput. Sci. 46 (2000) 1400–1407. Bordás, B., Kömíves, T., Szántó , Z. and Lopata, A., J. Agric. Food Chem. 48 (2000) 926–931. Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y. and Weinstein, J.N., J. Med. Chem. 44 (2001) 3254–3263. Randi´c, M. and Basak, S.C., J. Chem. Inf. Comput. Sci. 40 (2000) 899–905. Suzuki, T., Ide, K., Ishida, M. and Shapiro, S., J. Chem. Inf. Comput. Sci. 41 (2001) 718–726. Recanatini, M., Cavalli, A., Belluti, F., Piazzi, L., Rampa, A., Bisi, A., Gobbi, S., Valenti, P., Andrisano, V., Bartolini, M. and Cavrini, V., J. Med. Chem. 43 (2000) 2007–2018. Moró n, J.A., Campillo, M., Perez, V., Unzeta, M. and Pardo, L., J. Med. Chem. 43 (2000) 1684–1691. Golbraikh, A. and Tropsha, A., J. Mol. Graphics Model. 20 (2002) 269–276. Wold, S. and Eriksson, L., Statistical Validation of QSAR Results. In: Waterbeemd, H. van de (ed.), Chemometrics Methods in Molecular Design, VCH, (1995) 309–318. Clark, R.D., Sprous, D.G. and Leonard, J.M., Validating Models Based on Large Dataset. In: Höltje, H.-D., Sippl, W., (eds.) Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on Quantitative Structure-Activity Relationships. Aug 27 - Sept 1 (2000), Duesseldorf, Germany. Prous Science, (2001) 475–485. Novellino, E., Fattorusso, C. and Greco, G., Pharm. Acta Helv. 70 (1995) 149–154. Norinder, U., J. Chemomet. 10 (1996) 95–105. Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput. Sci. 41 (2001) 1022–1027. Sachs, L., Applied Statistics. A Handbook of Techniques. Springer-Verlag, (1984). Huuskonen, J., J. Chem. Inf. Comput. Sci. 41 (2001) 425–429. Tetko, I.V., Kovalishyn, V.V. and Livingstone D.J., J. Med. Chem. 44 (2001) 2411–2420. Wu, W., Walczak, B., Massart, D.L., Heuerding, S., Erni, F., Last, I.R. and Prebble, K.A., Chemometr. Intell. Lab. Syst. 33 (1996) 35–46. Yasri, A. and Hartsough, D., J. Chem. Inf. Comput. Sci. 41 (2001) 1218–1227 Bernard P., Kireev D.B., Chretien J.R., Fortier P.L. and Coppet L., J. Comput. Aided Mol. Des. 13 (1999) 355–371. Takeuchi, Y., Shands, E.F.B., Beusen, D.D. and Marshall, G.R., J. Med. Chem. 41 (1998)3609–3623. Kauffman, G.V. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553–1560. Mattioni, B.E. and Jurs, P.C., J. Chem. Inf. Comput. Sci., in press. Gasteiger, J. and Zupan, J., Angewandte chemie. 32(4) (1993) 503. Loukas, Y.L., J. Med. Chem. 44 (2001) 2772–2783. Bernard, P, Pintore, M, Berthon, J.Y. and Chretien, J.R., Eur. J. Med. Chem. 36 (2001) 1–19. Burden, F.R. and Winkler, D.A., J. Med. Chem. 42 (1999) 3183–3187. Burden, F.R., Ford, M.G., Whitley, D.C. and Winkler, D.A., J. Chem. Inf. Comput. Sci. 40 (2000) 1423–1430. Adams, M.J., Chemometrics in Analytical Spectroscopy. The Royal Society of Chemistry, UK, 1995. Potter, T. and Matter, H., J. Med. Chem. 41 (1998) 478–488. Lajiness, M., Johnson, M.A. and Maggiora, G.M., In: Fauchere, J.L., (ed.), QSAR: Quantitative Structure-Activity Relationships in Drug Design Alan R. Liss Inc.: New York, (1989) pp. 173–176. Taylor, R., J. Chem. Inf. Comput. Sci. 35 (1995) 59–67. Snarey, M., Terrett, N.K., Willett, P. and Wilton, D.J., J. Mol. Graphics Mod. 15 (1997) 372–385. Kennard, R.W. and Stone, L.A., Technometrics 11 (1969) 137–148. Bourguignon, B., Deaguiar, P.F., Thorre, K. and Massart, D.L., J. Chromatogr. Sci. 32 (1994) 144–152. Bourguignon, B., Deaguiar, P.F., Khots, M.S. and Massart, D.L., Anal. Chem. 66 (1994) 893–904. Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg, B., Wold, S. and Andrews, P., Int. J. Pept. Protein. Res. 37 (1991) 414–424. Eriksson, L. and Johansson, E., Chemometr. Intell. Lab. Syst. 34 (1996) 1–19. Carlson, R., Design and Optimization in Organic Synthesis. Elsevier, (1992). Martin, E.J. and Critchlow, R.E., J. Comb. Chem. 1 (1999) 32–45. Miller, A. and Nguyen, N.-K., Appl. Stat. 43 (1994) 669–678. Mitchell, T.J., Technometrics 16 (1974) 203–210. Mitchell, T.J., Technometrics 42 (2000) 48–54. Reynolds, C.H., Druker, R. and Pfahler, L.B., J. Chem. Inf. Comput. Sci. 38 (1998) 305–312. Bucholz, E., Brown, R.L., Tropsha, A., Booth, R.G. and Wyrick, S.D., J. Med. Chem. 42 (1999) 3041–3054. Golbraikh, A., Bonchev, D., Xiao, Y.-D. and Tropsha, A., In: Rational Approaches to Drug Design. Proceedings of the 13th European Symposium on quantitative Structure-Activity relationships, Prous Science, (2001) pp. 219–223. Golbraikh A., Bonchev, D. and Tropsha, A., J. Chem. Inf. Comput. Sci. 41 (2001) 147–158. Kier, L.B. and Hall, L.H., Quant. Struct.-Act. Relat. 10 (1991) 134–140. Petitjean, M., J. Chem. Inf. Comput. Sci. 32 (1992) 331–337. Wiener, H., J. Am. Chem. Soc. 69 (1947) 17. Platt, J.R., J. Phys. Chem. 56 (1952) 328. Shannon, C. and Weaver, W., Mathematical theory of Communication, University of Illinois, Urbana, (1949). Bonchev, D., Mekenyan, O. and Trinajstic, N., J. Comput. Chem., 2 (1981) 127–148. Gutman I., Ruscić, B., Trinajstić, N. and Wilcox, C.F., Jr., J. Chem. Phys., 62 (1975) 3399. Rücker, G. and Rücker, C., J. Chem. Inf. Comput. Sci., 33 (1993) 683–695. Bonchev, D., In: Devillers J., Balaban, A.T. (eds.), Topological Indices and Related Descriptors, Gordon and Breach, Reading, U.K. (1999) pp. 361–401. Bonchev, D., SAR/QSAR Env. Res., 7 (1997) 23–43. Golbraikh, A., J. Chem. Inf. Comput. Sci. 40 (2000) 414–425.