Learning sparse gradients for variable selection and dimension reduction

Machine Learning - Tập 87 - Trang 303-355 - 2012
Gui-Bo Ye1, Xiaohui Xie1
1School of Information and Computer Science, University of California, Irvine, USA

Tóm tắt

Variable selection and dimension reduction are two commonly adopted approaches for high-dimensional data analysis, but have traditionally been treated separately. Here we propose an integrated approach, called sparse gradient learning (SGL), for variable selection and dimension reduction via learning the gradients of the prediction function directly from samples. By imposing a sparsity constraint on the gradients, variable selection is achieved by selecting variables corresponding to non-zero partial derivatives, and effective dimensions are extracted based on the eigenvectors of the derived sparse empirical gradient covariance matrix. An error analysis is given for the convergence of the estimated gradients to the true ones in both the Euclidean and the manifold setting. We also develop an efficient forward-backward splitting algorithm to solve the SGL problem, making the framework practically scalable for medium or large datasets. The utility of SGL for variable selection and feature extraction is explicitly given and illustrated on artificial data as well as real-world examples. The main advantages of our method include variable selection for both linear and nonlinear predictions, effective dimension reduction with sparse loadings, and an efficient algorithm for large p, small n problems.

Tài liệu tham khảo

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Bach, F. R. (2008). Consistency of the group Lasso and multiple kernel learning. The Journal of Machine Learning Research, 9, 1179–1225. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2434. Bertin, K., & Lecué, K. (2008). Selection of variables and dimension reduction in high-dimensional non-parametric regression. Electronic Journal of Statistics, 2, 1224–1241. Bickel, P., & Li, B. (2007). Local polynomial regression on unknown manifolds. In IMS lecture notes-monograph series: Vol. 54. Complex datasets and inverse problems: tomography, networks and beyond, (177–186). Cai, J. F., Chan, R. H., & Shen, Z. (2008). A framelet-based image inpainting algorithm. Applied and Computational Harmonic Analysis, 24(2), 131–149. Combettes, P. L., & Wajs, V. R. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4), 1168–1200 (electronic). doi:10.1137/050626090. Cook, R. D. & Yin, X. (2001). Dimension reduction and visualization in discriminant analysis. Australian & New Zealand Journal of Statistics, 43(2), 147–199. doi:10.1111/1467-842X.00164 With a discussion by A. H. Welsh, Trevor Hastie, Mu Zhu, S. J. Sheather, J. W. McKean, Xuming He and Wing-Kam Fung and a rejoinder by the authors. Cucker, F., & Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint (Vol. 24). Cambridge: Cambridge Univ Press. Daubechies, I., Defrise, M., & De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457. doi:10.1002/cpa.20042 Dhillon, I. S., Mallela, S., & Kumar, R. (2003). A divisive information theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3, 1265–1287. Do Carmo, M., & Flaherty, F. (1992). Riemannian geometry. Basel: Birkhauser. Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3), 613–627. Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306. Donoho, D., & Grimes, C. (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10), 5591–5596. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499 with discussion, and a rejoinder by the authors. Fukumizu, K., Bach, F. R., & Jordan, M. I. (2009). Kernel dimension reduction in regression. Annals of Statistics, 37(4), 1871–1905. doi:10.1214/08-AOS637 Golub, G. H., & Van Loan, C. F. (1989). Matrix computations. Baltimore: Johns Hopkins University Press. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531–537. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422. Guyon, I., & Ellsseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hiriart-Urruty, J., & Lemaréchal, C. (1993). Convex analysis and minimization algorithms. Berlin: Springer. Hristache, M., Juditsky, A., & Spokoiny, V. (2001). Structure adaptive approach for dimension reduction. Annals of Statistics, 29(6), 1537–1566. Lafferty, J., & Wasserman, L. (2008). Rodeo: sparse, greedy nonparametric regression. Annals of Statistics, 36(1), 28–63. doi:10.1214/009053607000000811 Langford, J., Li, L., & Zhang, T. (2009). Sparse online learning via truncated gradient. In D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (Eds.) Advances in neural information processing systems (Vol. 21, pp. 905–912). Cambridge: MIT Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–342 with discussion and a rejoinder by the author. Li, K. C. (1992). On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. Journal of the American Statistical Association, 87(420), 1025–1039. Li, B., Zha, H., & Chiaromonte, F. (2005) Contour regression: a general approach to dimension reduction. Ann Statist pp 1580–1616. Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics, 34(5), 2272–2297. doi:10.1214/009053606000000722 Mackey, L. (2009). Deflation methods for sparse PCA. Advances in Neural Information Processing Systems, 21, 1017–1024. McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 141, 148–188. Micchelli, C. A., & Pontil, M. (2005). On learning vector-valued functions. Neural Computation, 17(1), 177–204. doi:10.1162/0899766052530802 Micchelli, C. A., & Pontil, M. (2007). Feature space perspectives for learning the kernel. Machine Learning, 66, 297–319. Micchelli, C. A., Morales, J. M., & Pontil, M. (2010). A family of penalty functions for structured sparsity. Advances in Neural Information Processing Systems, 23, 1612–1623. Mukherjee, S. Zhou, D.X. (2006). Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7, 519–549. Mukherjee, S., & Wu, Q. (2006). Estimation of gradients and coordinate covariation in classification. Journal of Machine Learning Research, 7, 2481–2514. Mukherjee, S., Wu, Q., & Zhou, D. (2010). Learning gradients on manifolds. Bernoulli, 16(1), 181–207. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Ruppert, D., & Ward, M. P. (1994). Multivariate locally weighted least squares regression. Annals of Statistics, 22(3), 1346–1370. Samarov, A. M. (1993). Exploring regression structure using nonparametric functional estimation. Journal of the American Statistical Association, 88(423), 836–847. Schölkopf, B., & Smola, A. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT. Tenenbaum, J., Silva, V., Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 58(1), 267–288. van der Vaart, A. W., & Wellner, J. A. (1996). Springer Series in Statistics. Weak convergence and empirical processes. New York: Springer. With applications to statistics. Vapnik, V. N. (1998). Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. New York: Wiley. Weston, J., Elisseff, A., Schölkopf, B., & Tipping, M. (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439–1461. Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 64(3), 363–410 10.1111/1467-9868.03411. Ye, G. B. & Zhou, D. X. (2008). Learning and approximation by Gaussians on Riemannian manifolds. Advances in Computational Mathematics, 29(3), 291–310. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85. doi:10.1214/aos/1079120130 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 67(2), 301–320. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265–286.