A stable data-augmented reinforcement learning method with ensemble exploration and exploitation
Tóm tắt
Learning from visual observations is a significant yet challenging problem in Reinforcement Learning (RL). Two respective problems, representation learning and task learning, need to solve to infer an optimal policy. Some methods have been proposed to utilize data augmentation in reinforcement learning to directly learn from images. Although these methods can improve generation in RL, they are often found to make the task learning unsteady and can even lead to divergence. We investigate the causes of instability and find it is usually rooted in high-variance of Q-functions. In this paper, we propose an easy-to-implement and unified method to solve above-mentioned problems, Data-augmented Reinforcement Learning with Ensemble Exploration and Exploitation (DAR-EEE). Bootstrap ensembles are incorporated into data augmented reinforcement learning and provide uncertainty estimation of both original and augmented states, which can be utilized to stabilize and accelerate the task learning. Specially, a novel strategy called uncertainty-weighted exploitation is designed for rational utilization of transition tuples. Moreover, an efficient exploration method using the highest upper confidence is used to balance exploration and exploitation. Our experimental evaluation demonstrates the improved sample efficiency and final performance of our method on a range of difficult image-based control tasks. Especially, our method has achieved the new state-of-the-art performance on Reacher-easy and Cheetah-run tasks.
Tài liệu tham khảo
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint https://arxiv.org/abs/1707.06347 arXiv:1707.06347
Lee AX, Nagabandi A, Abbeel P, Levine S (2020) Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741–752
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2021) Improving sample efficiency in model-free reinforcement learning from images. Proceed AAAI Conf Artif Intell 35:10674–10681
Dwibedi D, Tompson J, Lynch C, Sermanet P (2018) Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 1577–1584. https://doi.org/10.1109/IROS.2018.8593951
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R (2019) Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence 35(12):10674–10681. https://doi.org/10.1609/aaai.v35i12.17276
Igl M, Ciosek K, Li Y, Tschiatschek S, Zhang C, Devlin S, Hofmann K (2019) Generalization in reinforcement learning with selective noise injection and information bottleneck. Adv Neural Inf Proces Syst, 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/e2ccf95a7f2e1878fcafc8376649b6e8-Paper.pdf
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In International conference on machine learning, PMLR, pp 1861–1870. https://proceedings.mlr.press/v80/haarnoja18b.html
Choi J, Guo Y, Moczulski M, Oh J, Wu N, Norouzi M, Lee H (2018) Contingency-Aware Exploration in Reinforcement Learning
Osband I, Blundell C, Pritzel A, Roy BV (2016) Deep Exploration via Bootstrapped DQN. Advances in neural information processing systems, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf
Chen RY, Sidor S, Abbeel P, Schulman J (2017) UCB Exploration via Q-Ensembles. arXiv preprint https://arxiv.org/abs/1706.01502 arXiv:1706.01502
Polikar R (2006) Essemble based systems in decision making. IEEE Circ Syst Mag 6(3):21–45. https://doi.org/10.1109/MCAS.2006.1688199
Ren Y, Zhang L, Suganthan PN (2016) Ensemble classification and regression-recent developments, applications and future directions [review article]. IEEE Comput Intell Mag 11(1):41–53
Kim Y, Sohn SY (2012) Stock fraud detection using peer group analysis. Expert Syst Appl 39(10):8986–8992
Kavzoglu T, Colkesen I (2013) An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping. Int J Remote Sens 34(11–12):4224–4241
Min H, Liu B (2015) Ensemble of extreme learning machine for remote sensing image classification. Neurocomputing 149(pt.a):65–70
Laskin M, Srinivas A, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639–5650. PMLR
Laskin M, Lee K, Stooke A, Pinto L, Abbeel P, Srinivas A (2020) Reinforcement learning with augmented data. Advances in neural information processing systems 33:19884–19895. https://proceedings.neurips.cc/paper_files/paper/2020/file/e615c82aba461681ade82da2da38004a-Paper.pdf
Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint https://arxiv.org/abs/2004.13649 arXiv:2004.13649
Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, Budden D, Abdolmaleki A, Merel J, Lefrancq A, et al. (2018) Deepmind control suite. arXiv preprint https://arxiv.org/abs/1801.00690 arXiv:1801.00690
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint https://arxiv.org/abs/1812.05905 arXiv:1812.05905
Schwarzer M, Anand A, Goel R, Hjelm RD, Courville A, Bachman P (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint https://arxiv.org/abs/2007.05929 arXiv:2007.05929
Osband I, Blundell C, Pritzel A, Van Roy B (2016) Deep exploration via bootstrapped DQN. Adv Neural Inf Proces Syst, 29. https://proceedings.neurips.cc/paper_files/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf
Lee K, Laskin M, Srinivas A, Abbeel P (2021) Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141. PMLR
Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 5026–5033. https://doi.org/10.1109/IROS.2012.6386109
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, PMLR pp. 2555–2565. https://proceedings.mlr.press/v97/hafner19a.html
Hafner D, Lillicrap T, Ba J, Norouzi M (2019) Dream to control: Learning behaviors by latent imagination. arXiv preprint https://arxiv.org/abs/1912.01603 arXiv:1912.01603
Lan Q, Pan Y, Fyshe A, White M (2019) Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint https://arxiv.org/abs/2002.06487 arXiv:2002.06487