Entropic risk minimization for nonparametric estimation of mixing distributions
Tóm tắt
We discuss a nonparametric estimation method for the mixing distributions in mixture models. The problem is formalized as a minimization of a one-parameter objective functional, which becomes the maximum likelihood estimation or the kernel vector quantization in special cases. Generalizing the theorem for the nonparametric maximum likelihood estimation, we prove the existence and discreteness of the optimal mixing distribution and provide an algorithm to calculate it. It is demonstrated that with an appropriate choice of the parameter, the proposed method is less prone to overfitting than the maximum likelihood method. We further discuss the connection between the unifying estimation framework and the rate-distortion problem.
Tài liệu tham khảo
Amari, S., Fujita, N., & Shinomoto, S. (1992). Four types of learning curves. Neural Computation, 4(4), 605–618.
Arikan, E., & Merhav, N. (1998). Guessing subject to distortion. IEEE Transactions on Information Theory, 44(3), 1041–1056.
Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749.
Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge: Cambridge University Press.
Barron, A. R., Roos, T., Watanabe, K. (2014). Bayesian properties of normalized maximum likelihood and its fast computation. In Proceedings of the 2014 IEEE International Symposium on Information Theory (pp. 1667–1671).
Basu, A., Harris, I. R., Hjort, N. L., & Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3), 549–559.
Berger, T. (1971). Rate distortion theory: A mathematical basis for data compression. Englewood Cliffs, NJ: Prentice-Hall.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39–B, 1–38.
Eguchi, S., & Kato, S. (2010). Entropy and divergence associated with power function and the statistical application. Entropy, 12, 262–274.
Eguchi, S., Komori, O., & Kato, S. (2011). Projective power entropy and maximum Tsallis entropy distributions. Entropy, 13, 1746–1764.
Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9), 2053–2081.
Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley Conference in Honor of J. Neyman and J. Kiefer (Vol. 2, pp. 807–810).
Lashkari, D., Golland, P. (2007). Convex clustering with exemplar-based models. In Advances in neural information processing systems 19.
Lindsay, B. G. (1983). The geometry of mixture likelihoods: A general theory. The Annals of Statistics, 11(1), 86–94.
Lindsay, B. G. (1995). Mixture models: Theory geometry and applications. Hayward, CA: Institute of Mathematical Statistics.
Murata, N., Takenouchi, T., Kanamori, T., & Eguchi, S. (2004). Information geometry of U-boost and Bregman divergence. Neural Computation, 16(7), 1437–1481.
Nowozin, S., Bakir, G. (2008). A decoupled approach to exemplar-based unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning (ICML).
Renyi, A. (1961). On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 547–561). University of California Press, Berkeley.
Rose, K. (1994). A mapping approach to rate-distortion computation and analysis. IEEE Transactions on Information Theory, 40(6), 1939–1952.
Rudloff, B., Sass, J., & Wunderlich, R. (2008). Entropic risk constraints for utility maximization. In C. Tammer & F. Heyde (Eds.), Festschrift in celebration of Prof. Dr. Wilfried Grecksch’s 60th Birthday (pp. 149–180). Aachen: Shaker.
Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K. R., Ratsch, G., et al. (1999). Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10, 1000–1017.
Tipping, M. & Schölkopf, B. (2001). A kernel approach for vector quantization with guaranteed distortion bounds. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).
Tsallis, C. (2009). Introduction to nonextensive statistical mechanics. New York: Springer.
Watanabe, S. (2005). Algebraic geometry of singular learning machines and symmetry of generalization and training errors. Neurocomputing, 67, 198–213.