Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Machine Learning - Tập 91 - Trang 325-349 - 2013

Mohammad Gheshlaghi Azar^1,2, Rémi Munos³, Hilbert J. Kappen¹

¹Department of Biophysics, Radboud University Nijmegen, EZ Nijmegen, The Netherlands

²School of Computer Science, Carnegie Mellon University, Pittsburgh, USA

³INRIA Lille, SequeL Project, Villeneuve d’Ascq, France

Tóm tắt

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ∈[0,1) only O(Nlog(N/δ)/((1−γ)3 ε 2)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1−δ. Further, we prove that, for small values of ε, an order of O(Nlog(N/δ)/((1−γ)3 ε 2)) samples is required to find an ε-optimal policy w.p. 1−δ. We also prove a matching lower bound of Θ(Nlog(N/δ)/((1−γ)3 ε 2)) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, ε, δ and 1/(1−γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−γ).

Tài liệu tham khảo

Azar, M. G., Munos, R., Ghavamzadeh, M., & Kappen, H. J. (2011a). Reinforcement learning with a near optimal rate of convergence. Tech. rep. http://hal.inria.fr/inria-00636615. Azar, M. G., Munos, R., Ghavamzadeh, M., & Kappen, H. J. (2011b). Speedy Q-learning. In Advances in neural information processing systems (Vol. 24, pp. 2411–2419). Azar, M. G., Munos, R., Kappen, H. J. (2012). On the sample complexity of reinforcement learning with a generative model. In ICML. Omnipress. Bartlett, P. L., & Tewari, A. (2009). REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th conference on uncertainty in artificial intelligence (pp. 35–42). Bertsekas, D. P. (2007). Dynamic programming and optimal control (Vol. II, 3rd edn.). Belmount: Athena Scientific. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. New York: Cambridge University Press. Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7, 1079–1105. Hagerup, L., & Rüb, C. (1990). A guided tour of Chernoff bounds. Information Processing Letters, 33, 305–308. Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11, 1563–1600. Kakade, S. M. (2004). On the sample complexity of reinforcement learning. Ph.D. thesis, Gatsby Computational Neuroscience Unit. Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems (Vol. 12, pp. 996–1002). Cambridge: MIT Press. Lattimore, T., & Hutter, M. (2012a). PAC bounds for discounted MDPs. CoRR arXiv:1202.3890. Lattimore, T., & Hutter, M. (2012b). PAC bounds for discounted MDPs. In Algorithmic learning theory (pp. 320–334). Berlin: Springer. Mannor, S., & Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5, 623–648. Munos, R., & Moore, A. (1999). Influence and variance of a Markov chain: application to adaptive discretizations in optimal control. In Proceedings of the 38th IEEE conference on decision and control. Puterman, M. L. (1994). Markov decision processes, discrete stochastic dynamic programming. New York: Wiley. Singh, S. P., & Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3), 227–233. Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19, 794–802. Strehl, A. L., Li, L., & Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10, 2413–2444. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press. Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool. Szita, I., & Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th international conference on machine learning (pp. 1031–1038). Omnipress. Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., & Weinberger, M. J. (2003). Inequalities for the L1 deviation of the empirical distribution. Tech. rep. Wiering, M., & van Otterlo, M. (2012). Reinforcement learning: State-of-the-Art (pp. 3–39). Berlin: Springer. Chap. 1.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA