A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Springer Science and Business Media LLC - Tập 53 - Trang 24792-24803 - 2023
Guoyu Zuo1,2, Zhipeng Tian1, Gao Huang1,2,3
1Faculty of Information Technology, Beijing University of Technology, Beijing, China
2Beijing Key Laboratory of Computing Intelligence and Intelligent Systems, Beijing, China
3Beijing Advanced Innovation Center for Intelligent Robots and Systems, Beijing Institute of Technology, Beijing, China

Tóm tắt

Learning from visual observations is a significant yet challenging problem in Reinforcement Learning (RL). Two respective problems, representation learning and task learning, need to solve to infer an optimal policy. Some methods have been proposed to utilize data augmentation in reinforcement learning to directly learn from images. Although these methods can improve generation in RL, they are often found to make the task learning unsteady and can even lead to divergence. We investigate the causes of instability and find it is usually rooted in high-variance of Q-functions. In this paper, we propose an easy-to-implement and unified method to solve above-mentioned problems, Data-augmented Reinforcement Learning with Ensemble Exploration and Exploitation (DAR-EEE). Bootstrap ensembles are incorporated into data augmented reinforcement learning and provide uncertainty estimation of both original and augmented states, which can be utilized to stabilize and accelerate the task learning. Specially, a novel strategy called uncertainty-weighted exploitation is designed for rational utilization of transition tuples. Moreover, an efficient exploration method using the highest upper confidence is used to balance exploration and exploitation. Our experimental evaluation demonstrates the improved sample efficiency and final performance of our method on a range of difficult image-based control tasks. Especially, our method has achieved the new state-of-the-art performance on Reacher-easy and Cheetah-run tasks.

Tài liệu tham khảo

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533 Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint https://arxiv.org/abs/1707.06347 arXiv:1707.06347 Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752 Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2021) Improving sample efficiency in model-free reinforcement learning from images. Proceed AAAI Conf Artif Intell 35:10674–10681 Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 1577–1584. https://doi.org/10.1109/IROS.2018.8593951 Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2019) Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence 35(12):10674–10681. https://doi.org/10.1609/aaai.v35i12.17276 Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K (2019) Generalization in reinforcement learning with selective noise injection and information bottleneck. Adv Neural Inf Proces Syst, 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/e2ccf95a7f2e1878fcafc8376649b6e8-Paper.pdf Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International conference on machine learning, PMLR, pp 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html Choi J, Guo Y, Moczulski M, Oh J, Wu N, Norouzi M, Lee H (2018) Contingency-Aware Exploration in Reinforcement Learning Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep Exploration via Bootstrapped DQN. Advances in neural information processing systems, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf Chen RY, Sidor S, Abbeel P, Schulman J (2017) UCB Exploration via Q-Ensembles. arXiv preprint https://arxiv.org/abs/1706.01502 arXiv:1706.01502 Polikar R (2006) Essemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45. https://doi.org/10.1109/MCAS.2006.1688199 Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification and regression-recent developments, applications and future directions [review article]. IEEE Comput Intell Mag 11(1):41–53 Kim Y, Sohn SY (2012) Stock fraud detection using peer group analysis. Expert Syst Appl 39(10):8986–8992 Kavzoglu T, Colkesen I (2013) An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping. Int J Remote Sens 34(11–12):4224–4241 Min H, Liu B (2015) Ensemble of extreme learning machine for remote sensing image classification. Neurocomputing 149(pt.a):65–70 Laskin M, Srinivas A, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Advances in neural information processing systems 33:19884–19895. https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint https://arxiv.org/abs/2004.13649 arXiv:2004.13649 Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, Budden D, Abdolmaleki A, Merel J, Lefrancq A, et al. (2018) Deepmind control suite. arXiv preprint https://arxiv.org/abs/1801.00690 arXiv:1801.00690 Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint https://arxiv.org/abs/1812.05905 arXiv:1812.05905 Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint https://arxiv.org/abs/2007.05929 arXiv:2007.05929 Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. Adv Neural Inf Proces Syst, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141. PMLR Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 5026–5033. https://doi.org/10.1109/IROS.2012.6386109 Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, PMLR pp. 2555–2565. https://proceedings.mlr.press/v97/hafner19a.html Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: Learning behaviors by latent imagination. arXiv preprint https://arxiv.org/abs/1912.01603 arXiv:1912.01603 Lan Q, Pan Y, Fyshe A, White M (2019) Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint https://arxiv.org/abs/2002.06487 arXiv:2002.06487