Tikhonov, Ivanov and Morozov regularization for support vector machine learning
Tóm tắt
Learning according to the structural risk minimization principle can be naturally expressed as an Ivanov regularization problem. Vapnik himself pointed out this connection, when deriving an actual learning algorithm from this principle, like the well-known support vector machine, but quickly suggested to resort to a Tikhonov regularization schema, instead. This was, at that time, the best choice because the corresponding optimization problem is easier to solve and in any case, under certain hypothesis, the solutions obtained by the two approaches coincide. On the other hand, recent advances in learning theory clearly show that the Ivanov regularization scheme allows a more effective control of the learning hypothesis space and, therefore, of the generalization ability of the selected hypothesis. We prove in this paper the equivalence between the Ivanov and Tikhonov approaches and, for the sake of completeness, their connection to Morozov regularization, which has been show to be useful when effective estimation of the noise in the data is available. We also show that this equivalence is valid under milder conditions on the loss function with respect to Vapnik’s original proposal. These results allows us to derive several methods for performing SRM learning according to an Ivanov or Morozov regularization scheme, but using Tikhonov-based solvers, which have been thoroughly studied in the last decades and for which very efficient implementations have been proposed.
Tài liệu tham khảo
Anguita, D., Ghio, A., Greco, N., Oneto, L., & Ridella, S. (2010). Model selection for support vector machines: Advantages and disadvantages of the machine learning theory. In International joint conference on neural networks (pp. 1–8).
Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2011). The impact of unlabeled patterns in rademacher complexity theory for kernel classifiers. In: Advances in neural information processing systems, (pp. 585–593).
Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2011). In-sample model selection for support vector machines. In International joint conference on neural networks (pp. 1154–1161).
Anguita, D., Ghio, A., Oneto, L., & Ridella, S. (2012). In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 23(9), 1390–1406.
Anthony, M. (2001). Discrete mathematics of neural networks: Selected topics. Philadelphia: Society for Industrial Mathematics.
Aronszajn, N. (1951). Theory of reproducing kernels. Cambridge: Harvard University.
Bach, F. R., Thibaux, R., & Jordan, M. I. (2005). Computing regularization paths for learning multiple kernels. In Advances in neural information processing systems.
Bartlett, P., Bousquet, O., & Mendelson, S. (2005). Local rademacher complexities. Annals of Statistics, 33(4), 1497–1537.
Bartlett, P., Jordan, M., & McAuliffe, J. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.
Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536.
Bartlett, P. L., & Mendelson, S. (2003). Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3, 463–482.
Bauschke, H. H., & Combettes, P. L. (2011). Convex analysis and monotone operator theory in Hilbert spaces. New York: Springer.
Berlinet, A., & Thomas-Agnan, C. (2004). Reproducing kernel Hilbert spaces in probability and statistics. New York: Springer.
Bishop, C. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.
Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In Advanced lectures on machine learning (pp. 169–207).
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. The Journal of Machine Learning Research, 2, 499–526.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Collins, M., Schapire, R., & Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48(1–3), 253–285.
Cortes, C., Kloft, M., & Mohri, M. (2013). Learning kernels using local Rademacher complexity. In Advances in neural information processing systems.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Dinuzzo, F., & Schölkopf, B. (2012). The representer theorem for hilbert spaces: A necessary and sufficient condition. Arxiv preprint arXiv:1205.1928.
Duan, K., Keerthi, S., & Poo, A. (2003). Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51, 41–59.
Elisseeff, A., Evgeniou, T., & Pontil, M. (2005). Stability of randomized learning algorithms. Journal of Machine Learning Research, 6, 55–79.
Evgeniou, T., Pontil, M., & Elisseeff, A. (2004). Leave one out error, stability, and generalization of voting combinations of classifiers. Machine Learning, 55(1), 71–97.
Fan, R., Chen, P., & Lin, C. (2005). Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, 1889–1918.
Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2009). Agnostic learning of monomials by halfspaces is hard. In Annual IEEE symposium on foundations of computer science (pp. 385–394).
Flannery, B., Press, W., Teukolsky, S., & Vetterling, W. (1992). Numerical recipes in C. New York: Press Syndicate of the University of Cambridge.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.
Goldstein, A. A. (1977). Optimization of Lipschitz continuous functions. Mathematical Programming, 13(1), 14–22.
Gunter, L., & Zhu, J. (2005). Computing the solution path for the regularized support vector regression. In Neural information processing systems (pp. 481–488).
Guyon, I., Saffari, A., Dror, G., & Cawley, G. (2010). Model selection: Beyond the Bayesian/frequentist divide. The Journal of Machine Learning Research, 11, 61–87.
Haagerup, U. (1981). The best constants in the Khintchine inequality. Studia Mathematica, 70(3), 231–283.
Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. The Journal of Machine Learning Research, 5, 1391–1415.
Intel: Intel Visual Fortran Composer XE. (2012). http://software.intel.com/en-us/articles/intel-compilers/. Accessed: 18/07/2012.
Ivanov, V. (1976). The theory of approximate methods and their application to the numerical solution of singular integral equations. New York: Springer.
Keerthi, S., & Gilbert, E. (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46(1), 351–360.
Keerthi, S., & Lin, C. (2003). Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7), 1667–1689.
Keerthi, S., Shevade, S., Bhattacharyya, C., & Murthy, K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637–649.
Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5), 1902–1914.
Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6), 2593–2656.
Lawler, E., & Wood, D. (1966). Branch-and-bound methods: A survey. Operations Research, 14, 699–719.
Lee, W., Bartlett, P., & Williamson, R. (1998). The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5), 1974–1980.
Martein, L., & Schaible, S. (1987). On solving a linear program with one quadratic constraint. Decisions in Economics and Finance, 10(1), 75–90.
McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 141(1), 148–188.
Mendelson, S. (2003). On the performance of kernel classes. The Journal of Machine Learning Research, 4, 759–771.
Milenova, B., Yarmus, J., & Campos, M. (2005). SVM in oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In International conference on very large data bases (pp. 1152–1163).
Morozov, V., Nashed, Z., & Aries, A. (1984). Methods for solving incorrectly posed problems. New York: Springer.
Munder, S., & Gavrila, D. (2006). An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1863–1868.
Oneto, L., Ghio, A., Ridella, S., & Anguita, D. (2014). Fully empirical and data-dependent stability-based bounds. IEEE Transactions on Cybernetics. doi:10.1109/TCYB.2014.2361857.
Park, M. Y., & Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 659–677.
Pelckmans, K., Suykens, J., & De Moor, B. (2004) Morozov, Ivanov and Tikhonov regularization based ls-SVMs. In Neural information processing systems (pp. 1216–1222).
Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines. Advances in Kernel Methods—Support Vector Learning, 208, 1–21.
Platt, J. (1999). Using analytic QP and sparseness to speed training of support vector machines. In Advances in neural information processing systems (pp. 557–563).
Poggio, T., Mukherjee, S., Rifkin, R., Rakhlin, A., & Verri, A. (2002). b. In Uncertainty in geometric computations.
Poggio, T., Rifkin, R., Mukherjee, S., & Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428(6981), 419–422.
Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10(4), 955–974.
Schölkopf, B. (2001). The kernel trick for distances. In Neural information processing systems (pp. 301–307).
Schölkopf, B., Herbrich, R., & Smola, A. (2001). A generalized representer theorem. In Computational learning theory (pp. 416–426).
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.
Shawe-Taylor, J., & Sun, S. (2011). A review of optimization methodologies in support vector machines. Neurocomputing, 74(17), 3609–3618.
Suykens, J., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3), 293–300.
Tikhonov, A., Arsenin, V., & John, F. (1977). Solutions of ill-posed problems. Washington, DC: Winston.
Tomasi, C. (2004). Learning theory: Past performance and future results. Nature, 428(6981), 378–378.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vapnik, V. (2000). The nature of statistical learning theory. New York: Springer.
Vorontsov, K. (2010). Exact combinatorial bounds on the probability of overfitting for empirical risk minimization. Pattern Recognition and Image Analysis, 20(3), 269–285.
Yuille, A., & Rangarajan, A. (2003). The concave–convex procedure. Neural Computation, 15(4), 915–936.