Feature selection based on graph Laplacian by using compounds with known and unknown activities

Journal of Chemometrics - Tập 31 Số 8 - 2017
Razieh Sheikhpour1, Mehdi Agha Sarram1, Sajjad Gharaghani2, Mohammad Ali Zare Chahooki1
1Department of Computer Engineering, Yazd University, Yazd, Iran
2Laboratory of Bioinformatics and Drug Design (LBD), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

Tóm tắt

A semisupervised feature selection method based on graph Laplacian (S2FSGL) was proposed for quantitative structure‐activity relationship (QSAR) models, which uses an ℓ2,1‐norm and compounds with both known and unknown activities. In the proposed S2FSGL method, 2 graphs Gunsup and Gsup are constructed. It uses the label information of compounds with known activities and the local structure of compounds with known and unknown activities to select the most important descriptors. The weight matrix of graph Gunsup models the local structure of the compounds with known and unknown activities. The S2FSGL method uses the ℓ2,1‐norm to consider the correlation between different descriptors when conducting descriptor selection. The performance of the proposed S2FSGL coupled with a kernel smoother model was evaluated using 2 QSAR data sets and compared with the performance of other feature selection methods. For the evaluation of the performance of QSAR models and selected descriptors, several different training and test sets were produced for each data set. The comparison between the statistical parameters of QSAR models built based on the semisupervised feature selection method and those obtained by other feature selection methods revealed the superiority of the proposed S2FSGL in selecting the most relevant descriptors. The results showed that the use of compounds with unknown activities beside compounds with known activities can be helpful in selecting the relevant descriptors of QSAR models.

Từ khóa


Tài liệu tham khảo

10.1021/mp300237z

10.1021/ci049933v

10.1002/9783527645121.ch1

10.1016/j.ejmech.2008.09.050

10.1016/j.eswa.2010.11.011

10.1016/j.trac.2012.09.008

10.1016/j.chemolab.2013.08.004

10.1016/j.chemolab.2017.02.006

Doquire G, 2011, Graph Laplacian for semi‐supervised feature selection in regression problems, Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics), 248

10.1016/j.neucom.2012.10.028

ChangX YangY.Semi‐supervised feature analysis by mining correlations among multiple tasks.2014:11.http://arxiv.org/abs/1411.6232.

10.1109/TNN.2010.2047114

10.1109/TNNLS.2014.2314123

Levatic J, 2013, Semi‐supervised learning for quantitative structure‐activity modeling, Informatica, 37, 173

Ma Z, 2011, Exploiting the entire feature space with sparsity for automatic image annotation, Proc 19th ACM Int Conf Multimed—MM'11, 283

10.1016/j.patcog.2016.11.003

10.1007/978-3-642-23780-5_23

10.1007/978-3-642-22691-5_45

10.1016/j.patrec.2010.12.014

10.1016/j.neucom.2012.05.031

Lv S, 2013, 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 664

10.1109/ICMLC.2010.5581007

Zhao Z, 2007, Proceedings of the 7th SIAM International Conference on Data Mining, 641

10.1016/j.neucom.2007.06.014

10.1109/TMM.2012.2199293

Zeng Z, 2015, Semi‐supervised feature selection based on local discriminative information, Neurocomputing

10.1016/j.bmc.2007.03.065

10.1080/00268970903078559

10.1016/j.chemolab.2009.05.005

10.1021/ci9000103

10.1016/j.chemolab.2012.12.002

10.1142/S0129065710002474

10.1002/qsar.200960053

BindingDB.https://www.bindingdb.org/bind/index.jsp.

10.1007/s00894-005-0050-6

10.1016/j.aca.2007.04.009

10.1002/jcc.21707

He X, 2005, Laplacian score for feature selection, Adv Neural Inf Process Syst 18, 507

10.1109/TPAMI.2007.250598

Alpaydin E, 2010, Introduction to Machine Learning

Roy K, 2016, Be Aware of Error Measures. Further Studies on Validation of Predictive QSAR Models, 10.1016/j.chemolab.2016.01.008

10.1080/1062936X.2015.1084647

10.1016/j.proeng.2011.08.957

10.1016/j.ejmech.2013.10.029