Joint covariate selection and joint subspace selection for multiple classification problems

Statistics and Computing - Tập 20 - Trang 231-252 - 2009
Guillaume Obozinski1, Ben Taskar2, Michael I. Jordan3
1Department of Statistics, University of California at Berkeley, Berkeley, USA;
2Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA
3Department of Statistics and Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, USA

Tóm tắt

We address the problem of recovering a common set of covariates that are relevant simultaneously to several classification problems. By penalizing the sum of ℓ 2 norms of the blocks of coefficients associated with each covariate across different classification problems, similar sparsity patterns in all models are encouraged. To take computational advantage of the sparsity of solutions at high regularization levels, we propose a blockwise path-following scheme that approximately traces the regularization path. As the regularization coefficient decreases, the algorithm maintains and updates concurrently a growing set of covariates that are simultaneously active for all problems. We also show how to use random projections to extend this approach to the problem of joint subspace selection, where multiple predictors are found in a common low-dimensional subspace. We present theoretical results showing that this random projection approach converges to the solution yielded by trace-norm regularization. Finally, we present a variety of experimental results exploring joint covariate selection and joint subspace selection, comparing the path-following approach to competing algorithms in terms of prediction accuracy and running time.

Tài liệu tham khảo

Abernethy, J., Bach, F., Evgeniou, T., Vert, J.-P.: (2008). A new approach to collaborative filtering: Operator estimation with spectral regularization. Technical report, Computer Science Division, University of California at Berkeley Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005) Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. (2008) Bach, F.: Consistency of trace norm minimization. J. Mach. Learn. Res. 9, 1019–1048 (2008) Bach, F., Lanckriet, G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the Twenty-first International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco (2004) Ben-David, S., Schuller-Borbely, R.: A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn. 73, 273–287 (2008) Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) Chiaromonte, F., Cook, R.D.: Sufficient dimension reduction and graphics in regression. Ann. Inst. Stat. Math. 54(4), 768–795 (2002) Donoho, D.: For most large underdetermined systems of linear equations the minimal ℓ 1-norm solution is also the sparsest solution. Technical Report 2004–10, Statistics Department, Stanford University (2004) Draper, N.R., Smith, H.: Applied Regression Analysis. Wiley–Interscience, New York (1998) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004) Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117 (2004) Fazel, M., Hindi, H., Boyd, S.: A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American Control Conference, vol. 6, pp. 4734–4739 (2001) Fazel, M., Hindi, H., Boyd, S.: Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. In: Proceedings of the American Control Conference, vol. 3, pp. 2156–2162 (2003) Fu, W., Knight, K.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000) Fukumizu, K., Bach, F.R., Jordan, M.I.: Kernel dimension reduction in regression. Ann. Stat. (2008, to appear) Hastie, T., Tibshirani, R., Friedman, J.: Elements of Statistical Learning. Springer, Berlin (2001) Jebara, T.: Multi-task feature and kernel selection for SVMs. In: Proceedings of the International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2004) Khan, J., Wei, J., Ringnér, M., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7, 673–679 (2001) Kim, Y., Kim, J., Kim, Y.: Blockwise sparse regression. Stat. Sin. 16(2), 375–390 (2006) Li, K.C.: Sliced inverse linear regression for dimension reduction. J. Am. Stat. Assoc. 86, 316–342 (1991) Maurer, A.: Bounds for linear multi-task learning. J. Mach. Learn. Res. 7, 117–139 (2006) Meier, L., van de Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B 70(1), 53–71 (2008) Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000) Park, M.Y., Hastie, T.: Regularization path algorithms for detecting gene interactions. Technical Report 2006-13, Department of Statistics, Stanford University (2006) Recht, B., Xu, W., Hassibi, B.: Necessary and sufficient conditions for success of the nuclear norm heuristic for rank minimization. Technical report, California Institute of Technology (2008) Rosset, S., Zhu, J.: Piecewise linear regularized solution paths. Ann. Stat. 35(3), 1012–1030 (2007) Srebro, N., Shraibman, A.: Rank, Trace-Norm and Max-Norm, vol. 3559, pp. 545–560. Springer, New York (2005). Srebro, N., Alon, N., Jaakkola, T.: Generalization error bounds for collaborative prediction with low-rank matrices. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge (2005a) Srebro, N., Rennie, J., Jaakkola, T.S.: Maximum-margin matrix factorization. In: Advances in Neural Information Processing. MIT Press, Cambridge (2005b) Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58(1), 267–288 (1996) Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: efficient boosting procedures for multiclass object detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 762–769. IEEE Computer Society, Washington (2004) Tseng, P., Yun, S.: A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Comput. Optim. Appl. (2008, to appear) van Breukelen, M., Duin, R.P.W., Tax, D.M.J., den Hartog, J.E.: Handwritten digit recognition by combined classifiers. Kybernetika 34(4), 381–386 (1998) Wu, B.: Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics 22(5), 472–476 (2005) Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B 1(68), 49–67 (2006) Zhao, P., Yu, B.: Stagewise lasso. J. Mach. Learn. Res. 8, 2701–2726 (2007) Zhao, P., Rocha, G., Yu, B.: Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat. (2008, to appear)