Offline Pre-trained Multi-agent Decision Transformer

Springer Science and Business Media LLC - Tập 20 Số 2 - Trang 233-248 - 2023
Linghui Meng1, Muning Wen2, Chenyang Le2, Xiyun Li1, Dengpeng Xing1, Weinan Zhang2, Ying Wen2, Haifeng Zhang3, Jun Wang4, Yaodong Yang5, Bo Xu3
1Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
2Shanghai Jiao Tong University, Shanghai 200240, China
3School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
4Department of Computer Science, University College London, London, WC1E 6BT, UK
5Institute for AI, Peking University, Beijing, 100871, China

Tóm tắt

Abstract

Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the combinatorially increased interactions among agents and with the environment. However, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor even datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL. We investigate the generalization of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to online fine tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages the transformer’s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios. On the StarCraft II offline dataset, MADT outperforms the state-of-the-art offline reinforcement learning (RL) baselines, including BCQ and CQL. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases. To the best of our knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.

Từ khóa


Tài liệu tham khảo

Y. D. Yang, J. Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. [Online], Available: https://arxiv.org/abs/2011.00583, 2020.

S. Shalev-Shwartz, S. Shammah, A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. [Online], Available: https://arxiv.org/abs/1610.03295, 2016.

M. Zhou, J. Luo, J. Villella, Y. D. Yang, D. Rusu, J. Y. Miao, W. N. Zhang, M. Alban, I. Fadakar, Z. Chen, A. C. Huang, Y. Wen, K. Hassanzadeh, D. Graves, D. Chen, Z. B. Zhu, N. Nguyen, M. Elsayed, K. Shao, S. Ahilan, B. K. Zhang, J. N. Wu, Z. G. Fu, K. Rezaee, P. Yadmellat, M. Rohani, N. P. Nieves, Y. H. Ni, S. Banijamali, A. C. Rivers, Z. Tian, D. Palenicek, H. bou Ammar, H. B. Zhang, W. L. Liu, J. Y. Hao, J. Wang. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. [Online], Available: https://arxiv.org/abs/2010.09776, 2020.

H. F. Zhang, W. Z. Chen, Z. R. Huang, M. N. Li, Y. D. Yang, W. N. Zhang, J. Wang. Bi-level actor-critic for multi-agent coordination. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 7325–7332, 2020.

M. N. Li, Z. W. Qin, Y. Jiao, Y. D. Yang, J. Wang, C. X. Wang, G. B. Wu, J. P. Ye. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In Proceedings of World Wide Web Conference, ACM, San Francisco, USA, pp. 983–994, 2019. DOI: https://doi.org/10.1145/3308558.3313433.

Y. D. Yang, R. Luo, M. N. Li, M. Zhou, W. N. Zhang, J. Wang. Mean field multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 5571–5580, 2018.

Y. D. Yang, L. T. Yu, Y. W. Bai, Y. Wen, W. N. Zhang, J. Wang. A study of AI population dynamics with million-agent reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, ACM, Stockholm, Sweden, pp. 2133–2135, 2018.

P. Peng, Y. Wen, Y. D. Yang, Q. Yuan, Z. K. Tang, H. T. Long, J. Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play StarCraft combat games. [Online], Available: https://arxiv.org/abs/1703.10069, 2017.

M. Zhou, Z. Y. Wan, H. J. Wang, M. N. Wen, R. Z. Wu, Y. Wen, Y. D. Yang, W. N. Zhang, J. Wang. MALib: A parallel framework for population-based multi-agent reinforcement learning. [Online], Available: https://arxiv.org/abs/2106.07551, 2021.

X. T. Deng, Y. H. Li, D. H. Mguni, J. Wang, Y. D. Yang. On the complexity of computing Markov perfect equilibrium in general-sum stochastic games. [Online], Available: https://arxiv.org/abs/2109.01795, 2021.

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, S. Levine. Soft actor-critic algorithms and applications. [Online], Available: https://arxiv.org/abs/1812.05905, 2018.

R. Munos, T. Stepleton, A. Harutyunyan, M. G. Bellemare. Safe and efficient off-policy reinforcement learning. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 1054–1062, 2016.

L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, M. Michalski. SEED RL: Scalable and efficient deep-RL with accelerated central inference. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 1407–1416, 2018.

K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, Dollár, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15979–15988, 2021. DOI: https://doi.org/10.1109/CVPR52688.2022.01553.

Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.

S. Kim, J. Kim, H. W. Chun. Wave2Vec: Vectorizing electroencephalography bio-signal for prediction of brain disease. International Journal of Environmental Research and Public Health, vol. 15, no. 8, Article number 1750, 2018. DOI: https://doi.org/10.3390/ijerph15081750.

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 159, 2020. DOI: https://doi.org/10.5555/3495724.3495883.

L. L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. [Online], Available: https://arxiv.org/abs/2106.01345, 2021.

Y. D. Yang, J. Luo, Y. Wen, O. Slumbers, D. Graves, H. bou Ammar, J. Wang, M. E. Taylor. Diverse auto-curriculum is critical for successful real-world multiagent learning systems. In Proceedings of the 20th International Conference on Autonomous Agents and Multi-agent Systems, ACM, pp. 51–56, 2021.

N. Perez-Nieves, Y. D. Yang, O. Slumbers, D. H. Mguni, Y. Wen, J. Wang. Modelling behavioural diversity for learning in open-ended games. In Proceedings of the 38th International Conference on Machine Learning, pp. 8514–8524, 2021.

X. Y. Liu, H. T. Jia, Y. Wen, Y. J. Hu, Y. F. Chen, C. J. Fan, Z. P. Hu, Y. D. Yang. Unifying behavioral and response diversity for open-ended learning in zero-sum games. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 941–952, 2021.

S. Levine, A. Kumar, G. Tucker, J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. [Online], Available: https://arxiv.org/abs/2005.01643, 2020.

R. Sanjaya, J. Wang, Y. D. Yang. Measuring the non-transitivity in chess. Algorithms, vol. 15, no. 5, Article number 152, 2022. DOI: https://doi.org/10.3390/a15050152.

X. D. Feng, O. Slumbers, Y. D. Yang, Z. Y. Wan, B. Liu, S. McAleer, Y. Wen, J. Wang. Discovering multi-agent auto-curricula in two-player zero-sum games. [Online], Available: https://arxiv.org/abs/2106.02745, 2021.

M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C. M. Hung, P. H. S. Torr, J. Foerster, S. Whiteson. The StarCraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, Canada, pp. 2186–2188, 2019.

Z. Li, S. R. Xue, X. H. Yu, H. J Gao. Controller optimization for multirate systems based on reinforcement learning. International Journal of Automation and Computing, vol. 17, no. 3, pp. 417–427, 2020. DOI: https://doi.org/10.1007/s11633-020-1229-0.

Y. Li, D. Xu. Skill learning for robotic insertion based on one-shot demonstration and reinforcement learning. International Journal of Automation and Computing, vol. 18, no. 3, pp. 457–467, 2021. DOI: https://doi.org/10.1007/s11633-021-1290-3.

C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. [Online], Available: https://arxiv.org/abs/1912.06680, 2019.

A. Kumar, J. Fu, G. Tucker, S. Levine. Stabilizing off-policy Q-learning via bootstrapping error reduction. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 11761–11771, 2019.

A. Kumar, A. Zhou, G. Tucker, S. Levine. Conservative Q-learning for offline reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 100, 2020. DOI: https://doi.org/10.5555/3495724.3495824.

S. Fujimoto, D. Meger, D. Precup. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 2052–2062, 2019.

T. Matsushima, H. Furuta, Y. Matsuo, O. Nachum, S. X. Gu. Deployment-efficient reinforcement learning via model-based offline optimization. In Proceedings of the 9th International Conference on Learning Representations, 2021.

D. J. Su, J. D. Lee, J. M. Mulvey, H. V. Poor. MUSBO: Model-based uncertainty regularized and sample efficient batch optimization for deployment constrained reinforcement learning. [Online], Available: https://arxiv.org/abs/2102.11448, 2021.

Y. Q. Yang, X. T. Ma, C. H. Li, Z. W. Zheng, Q. Y. Zhang, G. Huang, J. Yang, Q. C. Zhao. Believe what you see: Implicit constraint approach for offline multi-agent reinforcement learning. [Online], Available: https://arxiv.org/abs/2106.03400, 2021.

J. C. Jiang, Z. Q. Lu. Offline decentralized multi-agent reinforcement learning. [Online], Available: https://arxiv.org/abs/2108.01832, 2021.

A. Nair, M. Dalal, A. Gupta, S. Levine. Accelerating online reinforcement learning with offline datasets. [Online], Available: https://arxiv.org/abs/2006.09359, 2020.

M. Janner, Q. Y. Li, S. Levine. Offline reinforcement learning as one big sequence modeling problem. [Online], Available: https://arxiv.org/abs/2106.02039, 2021.

L. C. Dinh, Y. D. Yang, S. McAleer, Z. Tian, N. P. Nieves, O. Slumbers, D. H. Mguni, H. bou Ammar, J. Wang. Online double oracle. [Online], Available: https://arxiv.org/abs/2103.07780, 2021.

D. H. Mguni, Y. T. Wu, Y. L. Du, Y. D. Yang, Z. Y. Wang, M. N. Li, Y. Wen, J. Jennings, J. Wang. Learning in nonzero-sum stochastic games with potentials. In Proceedings of the 38th International Conference on Machine Learning, pp. 7688–7699, 2021.

Y. D. Yang, Y. Wen, J. Wang, L. H. Chen, K. Shao, D. Mguni, W. N. Zhang. Multi-agent determinantal Q-learning. In Proceedings of the 37th International Conference on Machine Learning, pp. 10757–10766, 2020.

T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, S. Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 4295–4304, 2018.

Y. Wen, Y. D. Yang, R. Luo, J. Wang, W. Pan. Probabilistic recursive reasoning for multi-agent reinforcement learning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.

Y. Wen, Y. D. Yang, J. Wang. Modelling bounded rationality in multi-agent interactions by generalized recursive reasoning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, Japan, pp. 414–421, 2020. DOI: https://doi.org/10.24963/ijcai.2020/58.

S. Hu, F. D. Zhu, X. J. Chang, X. D. Liang. UPDeT: Universal multi-agent reinforcement learning via policy decoupling with transformers. [Online], Available: https://arxiv.org/abs/2101.08001, 2021.

K. Son, D. Kim, W. J. Kang, D. E. Hostallero, Y. Yi. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 5887–5896, 2019.

J. G. Kuba, M. N. Wen, L. H. Meng, S. D. Gu, H. F. Zhang, D. H. Mguni, J. Wang, Y. D. Yang. Settling the variance of multi-agent policy gradients. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 13458–13470, 2021.

J. G. Kuba, R. Q. Chen, M. N. Wen, Y. Wen, F. L. Sun, J. Wang, Y. D. Yang. Trust region policy optimisation in multi-agent reinforcement learning. In Proceedings of the 10th International Conference on Learning Representations, 2022.

S. D. Gu, J. G. Kuba, M. N. Wen, R. Q. Chen, Z. Y. Wang, Z. Tian, J. Wang, A. Knoll, Y. D. Yang. Multi-agent constrained policy optimisation. [Online], Available: https://arxiv.org/abs/2110.02793, 2021.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017. DOI: https://doi.org/10.5555/3295222.3295349.

I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3104–3112, 2014. DOI: https://doi.org/10.5555/2969033.2969173.

Q. Wang, B. Li, T. Xiao, J. B. Zhu, C. L. Li, D. F. Wong, L. S. Chao. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1810–1822, 2019. DOI: https://doi.org/10.18653/v1/P19-1176.

L. H. Dong, S. Xu, B. Xu. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 5884–5888, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462506.

K. Han, Y. H. Wang, H. T. Chen, X. H. Chen, J. Y. Guo, Z. H. Liu, Y. H. Tang, A. Xiao, C. J. Xu, Y. X. Xu, Z. H. Yang, Y. M. Zhang, D. C. Tao. A survey on vision transformer. [Online], Available: https://arxiv.org/abs/2012.12556, 2020.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2020.

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, vol. 8, no. 3, pp. 229–256, 1992. DOI: https://doi.org/10.1007/BF00992696.

I. Mordatch, P. Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, Article number 183, 2018. DOI: https://doi.org/10.5555/3504035.3504218.

C. Yu, A. Velu, E. Vinitsky, J. X. Gao, Y. Wang, A. Bayen, Y. Wu. The surprising effectiveness of PPO in cooperative, multi-agent games. [Online], Available: https://arxiv.org/abs/2103.01955, 2021.

J. Fu, A. Kumar, O. Nachum, G. Tucker, S. Levine. D4RL: Datasets for deep data-driven reinforcement learning. [Online], Available: https://arxiv.org/abs/2004.07219, 2020.

Z. D. Zhu, K. X. Lin, A. K. Jain, J. Zhou. Transfer learning in deep reinforcement learning: A survey. [Online], Available: https://arxiv.org/abs/2009.07888, 2020.