Deep multiagent reinforcement learning: challenges and directions

Artificial Intelligence Review - Tập 56 - Trang 5023-5056 - 2022
Annie Wong1, Thomas Bäck1, Anna V. Kononova1, Aske Plaat1
1Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands

Tóm tắt

This paper surveys the field of deep multiagent reinforcement learning (RL). The combination of deep neural networks with RL has gained increased traction in recent years and is slowly shifting the focus from single-agent to multiagent environments. Dealing with multiple agents is inherently more complex as (a) the future rewards depend on multiple players’ joint actions and (b) the computational complexity increases. We present the most common multiagent problem representations and their main challenges, and identify five research areas that address one or more of these challenges: centralised training and decentralised execution, opponent modelling, communication, efficient coordination, and reward shaping. We find that many computational studies rely on unrealistic assumptions or are not generalisable to other settings; they struggle to overcome the curse of dimensionality or nonstationarity. Approaches from psychology and sociology capture promising relevant behaviours, such as communication and coordination, to help agents achieve better performance in multiagent settings. We suggest that, for multiagent RL to be successful, future research should address these challenges with an interdisciplinary approach to open up new possibilities in multiagent RL.

Tài liệu tham khảo

Albrecht SV, Stone P (2018) Autonomous agents modelling other agents: a comprehensive survey and open problems. Artif Intell 258:66–95 Amato C, Oliehoek F (2015) Scalable planning and learning for multiagent pomdps. Proc AAAI Conf Artif Intell 29:1995–2002 Amir O, Kamar E, Kolobov A, Grosz B (2016) Interactive teaching strategies for agent training. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence 2016. https://www.microsoft.com/en-us/research/publication/interactive-teaching-strategies-agent-training/ Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA (2017) Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 34(6):26–38 Arulkumaran K, Cully A, Togelius J (2019) Alphastar: an evolutionary computation perspective. In: Proceedings of the genetic and evolutionary computation conference companion, pp 314–315 Åström KJ (1965) Optimal control of markov decision processes with incomplete state estimation. J Math Anal Appl 10:174–205 Axelrod R, Hamilton WD (1981) The evolution of cooperation. Science 211(4489):1390–1396 Bäck T, Schwefel HP (1993) An overview of evolutionary algorithms for parameter optimization. Evol Comput 1(1):1–23 Bahdanau D, Brakel P, Xu K, Goyal A, Lowe R, Pineau J, Courville A, Bengio Y (2017) An actor-critic algorithm for sequence prediction. In: International conference on learning representations. https://openreview.net/forum?id=SJDaqqveg Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, Mordatch I (2019) Emergent tool use from multi-agent autocurricula. In: Eigth international conference on learning representations (ICLR) Bao W, Liu Xy (2019) Multi-agent deep reinforcement learning for liquidation strategy analysis. arXiv preprint. arXiv:1906.11046 Bellman R (1957) A markovian decision process. J Math Mech 6(5):679–684 Berner C, Brockman G, Chan B, Cheung V, Debiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C, Józefowicz R, Gray S, Olsson C, Pachocki JW, Petrov M, de Oliveira Pinto HP, Raiman J, Salimans T, Schlatter J, Schneider J, Sidor S, Sutskever I, Tang J, Wolski F, Zhang S (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint. arXiv:1912.06680 Bernstein DS, Givan R, Immerman N, Zilberstein S (2002) The complexity of decentralized control of Markov decision processes. Math Oper Res 27(4):819–840 Bloembergen D, Tuyls K, Hennes D, Kaisers M (2015) Evolutionary dynamics of multi-agent learning: a survey. J Artif Intell Res 53:659–697 Bowling M, Veloso M (2001) Rational and convergent learning in stochastic games. In: International joint conference on artificial intelligence, Citeseer, vol 17, pp 1021–1026 Bowling M, Veloso M (2002) Multiagent learning using a variable learning rate. Artif Intell 136(2):215–250 Bowling M, Burch N, Johanson M, Tammelin O (2015) Heads-up limit hold’em poker is solved. Science 347(6218):145–149 Brown GW (1951) Iterative solution of games by fictitious play. Activity Anal Prod Allocation 13(1):374–376 Brown N, Sandholm T (2018) Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424 Brown N, Sandholm T (2019) Superhuman ai for multiplayer poker. Science 365(6456):885–890 Burden J (2020) Automating abstraction for potential-based reward shaping. PhD thesis, University of York Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern Part C (Appl Rev) 38(2):156–172 Canese L, Cardarilli GC, Di Nunzio L, Fazzolari R, Giardino D, Re M, Spanò S (2021) Multi-agent reinforcement learning: a review of challenges and applications. Appl Sci 11(11):4948 Cao K, Lazaridou A, Lanctot M, Leibo JZ, Tuyls K, Clark S (2018) Emergent communication through negotiation. In: International conference on learning representations (ICLR) (Poster), https://openreview.net/forum?id=Hk6WhagRW Castellini J, Devlin S, Oliehoek FA, Savani R (2021) Difference rewards policy gradients. In: Proceedings of the 20th international conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, AAMAS ’21, Richland, SC, pp 1475–1477 Cheng CA, Kolobov A, Swaminathan A (2021) Heuristic-guided reinforcement learning. Adv Neural Inf Process Syst 34:13550–13563 Chu T, Wang J, Codecá L, Li Z (2020) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans Intell Transp Syst 21(3):1086–1095 Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31. Curran Associates, Red Hook, pp 4759–4770 Colin TR, Belpaeme T, Cangelosi A, Hemion N (2016) Hierarchical reinforcement learning as creative problem solving. Robot Autonom Syst 86:196–206 Colman AM (2003) Cooperation, psychological game theory, and limitations of rationality in social interaction. Behav Brain Sci 26:139–198 Da Silva FL, Costa AHR (2019) A survey on transfer learning for multiagent reinforcement learning systems. J Artif Intell Res 64:645–703 Da Silva FL, Glatt R, Costa AHR (2017) Simultaneously learning and advising in multiagent reinforcement learning. In: Proceedings of the 16th international conference on autonomous agents and multiagent systems (AAMAS 2017), pp 1100–1108 Dai Z, Chen Y, Low BKH, Jaillet P, Ho TH (2020) R2-B2: recursive reasoning-based bayesian optimization for no-regret learning in games. In: Proceedings of the 37th international conference on machine learning, PMLR, pp 2291–2301 Dankwa S, Zheng W (2019) Twin delayed DDPG: a deep reinforcement learning technique to model a continuous movement of an intelligent robot agent. In: Proceedings of the 3rd international conference on vision, image and signal processing, pp 1–5 Das A, Kottur S, Moura JM, Lee S, Batra D (2017) Learning cooperative visual dialog agents with deep reinforcement learning. In: Proceedings of the IEEE international conference on computer vision, pp 2951–2960 Devlin S, Kudenko D (2011) Theoretical considerations of potential-based reward shaping for multi-agent systems. In: The 10th International conference on autonomous agents and multiagent systems. ACM, New York, pp 225–232 Devlin S, Kudenko D, Grześ M (2011) An empirical study of potential-based reward shaping and advice in complex, multi-agent systems. Adv Complex Syst 14(02):251–278 Devlin S, Yliniemi L, Kudenko D, Tumer K (2014) Potential-based difference rewards for multiagent reinforcement learning. In: Proceedings of the 2014 international conference on autonomous agents and multi-agent systems, pp 165–172 Devlin SM, Kudenko D (2012) Dynamic potential-based reward shaping. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems, IFAAMAS, pp 433–440 Diallo EAO, Sugiyama A, Sugawara T (2017) Learning to coordinate with deep reinforcement learning in doubles pong game. In: 2017 16th IEEE international conference on machine learning and applications (ICMLA). IEEE, Piscataway, pp 14–19 Ding Z, Dong H (2020) Challenges of reinforcement learning. Springer, Singapore Dovidio JF (1984) Helping behavior and altruism: an empirical and conceptual overview. Adv Exp Soc Psychol 17:361–427 Drugan MM (2019) Reinforcement learning versus evolutionary computation: a survey on hybrid algorithms. Swarm Evol Comput 44:228–246 Du W, Ding S (2021) A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artif Intell Rev 54(5):3215–3238 Du Y, Han L, Fang M, Liu J, Dai T, Tao D (2019) Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. Adv Neural Inf Process Syst 32:4403–4414 Eccles T, Hughes E, Kramár J, Wheelwright S, Leibo JZ (2019) Learning reciprocity in complex sequential social dilemmas. arXiv preprint. arXiv:1903.08082 Everett R, Roberts S (2018) Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In: 2018 Association for the advancement of artificial intelligence spring symposium series Fehr E, Schmidt KM (1999) A theory of fairness, competition, and cooperation. Q J Econ 114(3):817–868 Feriani A, Hossain E (2021) Single and multi-agent deep reinforcement learning for AI-enabled wireless networks: a tutorial. IEEE Commun Survey Tutor 23(2):1226–1252 Foerster J, Assael IA, De Freitas N, Whiteson S (2016) Learning to communicate with deep multi-agent reinforcement learning. Adv Neural Inf Process Syst 29:2137–2145 Foerster J, Chen RY, Al-Shedivat M, Whiteson S, Abbeel P, Mordatch I (2018a) Learning with opponent-learning awareness. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, AAMAS ’18, pp 122–130 Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018b) Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI conference on artificial intelligence, vol 32 Frith C, Frith U (2005) Theory of mind. Curr Biol 15(17):644–645 Gigerenzer G, Goldstein DG (1996) Reasoning the fast and frugal way: models of bounded rationality. Psychol Rev 103(4):650 Gilovich T, Griffin D, Kahneman D (2002) Heuristics and biases: the psychology of intuitive judgment. Cambridge University Press, Cambridge Gomes J, Mariano P, Christensen AL (2014) Avoiding convergence in cooperative coevolution with novelty search. In: Proceedings of the 2014 international conference on autonomous agents and multi-agent systems, pp 1149–1156 Gomes J, Mariano P, Christensen AL (2017) Dynamic team heterogeneity in cooperative coevolutionary algorithms. IEEE Trans Evol Comput 22(6):934–948 Graesser L, Keng WL (2019) Foundations of deep reinforcement learning: theory and practice in Python. Addison-Wesley Professional, Boston Greensmith E, Bartlett PL, Baxter J (2004) Variance reduction techniques for gradient estimates in reinforcement learning. J Mach Learn Res 5(9):1471–1530 Gronauer S, Diepold K (2021) Multi-agent deep reinforcement learning: a survey. Artif Intell Rev 55(6):1–49 Grondman I, Busoniu L, Lopes GA, Babuska R (2012) A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C (Appl Rev 42(6):1291–1307 Gu S, Geng M, Lan L (2021) Attention-based fault-tolerant approach for multi-agent reinforcement learning systems. Entropy 23(9):1133 Gupta JK, Egorov M, Kochenderfer M (2017) Cooperative multi-agent control using deep reinforcement learning. In: International conference on autonomous agents and multiagent systems. Springer, Cham, pp 66–83 Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P et al (2018) Soft actor-critic algorithms and applications. arXiv preprint. arXiv:1812.05905 Hamrick JB, Friesen AL, Behbahani F, Guez A, Viola F, Witherspoon S, Anthony T, Buesing LH, Veličković P, Weber T (2021) On the role of planning in model-based deep reinforcement learning. In: International conference on learning representations. https://openreview.net/forum?id=IrM64DGB21 Hansen EA, Bernstein DS, Zilberstein S (2004) Dynamic programming for partially observable stochastic games. Am Assoc Artif Intell 4:709–715 Hausknecht M, Stone P (2015) Deep recurrent q-learning for partially observable mdps. In: 2015 AAAAI fall symposium series Hausknecht M, Stone P (2016) Grounded semantic networks for learning shared communication protocols. In: International conference on machine learning (workshop) Havrylov S, Titov I (2017) Emergence of language with multi-agent games: learning to communicate with sequences of symbols. In: Advances in neural information processing systems (NIPS 2017) proceedings, vol 30 He H, Boyd-Graber J, Kwok K, Daumé III H (2016) Opponent modeling in deep reinforcement learning. In: International Conference on Machine Learning, Proceedings of Machine Learning Research, pp 1804–1813 Heinrich J, Silver D (2016) Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint. arXiv:1603.01121 Heinrich J, Lanctot M, Silver D (2015) Fictitious self-play in extensive-form games. In: International conference on machine learning, PMLR, pp 805–813 Hernandez-Leal P, Rosman B, Taylor ME, Sucar LE, Munoz de Cote E (2016) A Bayesian approach for learning and tracking switching, non-stationary opponents. In: Proceedings of the 2016 international conference on autonomous agents & multiagent systems, pp 1315–1316 Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Autonom Agents Multi-Agent Syst 33(6):750–797 Holmesparker C, Agogino AK, Tumer K (2016) Combining reward shaping and hierarchies for scaling to large multiagent systems. Knowl Eng Rev 31(1):3–18 Hong ZW, Su SY, Shann TY, Chang YH, Lee CY (2018) A deep policy inference Q-network for multi-agent systems. In: Proceedings of the 17th international conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, AAMAS ’18, pp 1388–1396 Huang Y, Huang L, Zhu Q (2022) Reinforcement learning for feedback-enabled cyber resilience. Annu Rev Control 53:273–295 Hughes E, Leibo JZ, Phillips M, Tuyls K, Dueñez-Guzman E, García Castañeda A, Dunning I, Zhu T, McKee K, Koster R, et al. (2018) Inequity aversion improves cooperation in intertemporal social dilemmas. In: Advances in neural information processing systems, vol 31 Ilhan E, Gow J, Perez-Liebana D (2019) Teaching on a budget in multi-agent deep reinforcement learning. In: 2019 IEEE conference on games (CoG). IEEE, Piscataway pp 1–8 Iqbal S, Sha F (2019) Actor-attention-critic for multi-agent reinforcement learning. In: International conference on machine learning, PMLR, pp 2961–2970 Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castaneda AG, Beattie C, Rabinowitz NC, Morcos AS, Ruderman A et al (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364(6443):859–865 Jaques N, Lazaridou A, Hughes E, Gulcehre C, Ortega P, Strouse D, Leibo JZ, De Freitas N (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: International conference on machine learning, PMLR, pp 3040–3049 Jiang J, Lu Z (2018) Learning attentional communication for multi-agent cooperation. In: Advances in neural information processing systems, vol 31 Jin J, Song C, Li H, Gai K, Wang J, Zhang W (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. In: Cuzzocrea A, Allan J, Paton NW, Srivastava D, Agrawal R, Broder AZ, Zaki MJ, Candan KS, Labrinidis A, Schuster A, Wang H (eds) Proceedings of the 27th ACM international conference on information and knowledge management. ACM, New York, pp 2193–2201 Johanson M, Burch N, Valenzano R, Bowling M (2013) Evaluating state-space abstractions in extensive-form games. In: Proceedings of the 2013 international conference on autonomous agents and multi-agent systems, pp 271–278 Jorge E, Kågebäck M, Johansson FD, Gustavsson E (2017) Learning to play guess who? and inventing a grounded language as a consequence. arXiv preprint. arXiv:1611.03218 Kakade SM (2003) On the sample complexity of reinforcement learning. University of London, University College London, London Kim DK, Liu M, Omidshafiei S, Lopez-Cot S, Riemer M, Habibi G, Tesauro G, Mourad S, Campbell M, How JP (2020) Learning hierarchical teaching policies for cooperative agents. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, Richland, SC, AAMAS ’20, pp 620–628 Kim W, Cho M, Sung Y (2019) Message-dropout: An efficient training method for multi-agent deep reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 6079–6086. https://doi.org/10.1609/aaai.v33i01.33016079 Konda VR, Tsitsiklis JN (2003) Actor-critic algorithms. J Control Optim 42(4):1143–1166 Kottur S, Moura JMF, Lee S, Batra D (2017) Natural language does not emerge ’naturally’ in multi-agent dialog. In: Conference on empirical methods in natural language processing (EMNLP), pp 2962–2967. https://aclanthology.info/papers/D17-1321/d17-1321 Kraemer L, Banerjee B (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190:82–94 Kuhn HW, Tucker AW (1953) Contributions to the theory of games, vol 2. Princeton University Press, Princeton Kumar A, Zilberstein S (2009) Dynamic programming approximations for partially observable stochastic games. In: Proceedings of the 22nd international FLAIRS conference, pp 547–552 Kurek M, Jaśkowski W (2016) Heterogeneous team deep q-learning in low-dimensional multi-agent environments. In: 2016 IEEE conference on computational intelligence and games (CIG). IEEE, Piscataway, pp 1–8 Lazaridou A, Baroni M (2020) Emergent multi-agent communication in the deep learning era. arXiv preprint, arXiv:2006.02419 Lazaridou A, Peysakhovich A, Baroni M (2017) Multi-agent cooperation and the emergence of (natural) language. In: International conference on learning representations. https://openreview.net/forum?id=Hk8N3Sclg LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444 Lehman J, Stanley KO (2008) Exploiting open-endedness to solve problems through the search for novelty. In: Artificial Life XI, Citeseer, pp 329–336 Lehman J, Chen J, Clune J, Stanley KO (2018a) Es is more than just a traditional finite-difference approximator. In: Proceedings of the genetic and evolutionary computation conference, pp 450–457. https://doi.org/10.1145/3205455.3205474 Lehman J, Chen J, Clune J, Stanley KO (2018b) Safe mutations for deep and recurrent neural networks through output gradients. arXiv preprint. arXiv:1712.06563 Lehman J, Chen J, Clune J, Stanley KO (2018c) Safe mutations for deep and recurrent neural networks through output gradients. In: Proceedings of the genetic and evolutionary computation conference, association for computing machinery, New York, NY, USA, GECCO ’18, pp 117–124. https://doi.org/10.1145/3205455.3205473 Leibo JZ, Zambaldi V, Lanctot M, Marecki J, Graepel T (2017) Multi-agent reinforcement learning in sequential social dilemmas. In: Proceedings of the 16th conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, Richland, SC, AAMAS ’17, pp 464–473 Leibo JZ, d’Autume CdM, Zoran D, Amos D, Beattie C, Anderson K, Castañeda AG, Sanchez M, Green S, Gruslys A, et al. (2018) Psychlab: a psychology laboratory for deep reinforcement learning agents. arXiv preprint .arXiv:1801.08116 Lerer A, Peysakhovich A (2018) Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint. arXiv:1707.01068 Levine S (2017) Berkeley CS 294-112, Lecture notes: model-based reinforcement learning. http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_9_model_based_rl.pdf. Last visited on 12 May 2021 Li S, Wu Y, Cui X, Dong H, Fang F, Russell S (2019) Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 4213–4220 Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: The international conference on learning representations. http://arxiv.org/abs/1509.02971 Lin K, Zhao R, Xu Z, Zhou J (2018) Efficient large-scale fleet management via multi-agent deep reinforcement learning. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1774–1783 Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: 11th International conference on machine learning. Elsevier, Amsterdam, pp 157–163 Liu S, Lever G, Merel J, Tunyasuvunakool S, Heess N, Graepel T (2019) Emergent coordination through competition. arXiv preprint. arXiv:1902.07151 Liu Z, Chen B, Zhou H, Koushik G, Hebert M, Zhao D (2020) Mapper: multi-agent path planning with evolutionary reinforcement learning in mixed dynamic environments. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, Piscataway, pp 11748–11754 Lowe R, Wu YI, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems, vol 30 Lowe R, Foerster J, Boureau YL, Pineau J, Dauphin Y (2019) On the pitfalls of measuring emergent communication. In: Proceedings of the 18th international conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, Richland, SC, AAMAS ’19, pp 693–701 Ma Z, Luo Y, Ma H (2021) Distributed heuristic multi-agent path finding with communication. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, Piscataway, pp 8699–8705 Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) Maven: Multi-agent variational exploration. In: Advances in neural information processing systems, vol 32 Majumdar S, Khadka S, Miret S, Mcaleer S, Tumer K (2020) Evolutionary reinforcement learning for sample-efficient multiagent coordination. In: International conference on machine learning, PMLR, pp 6651–6660 Mao H, Alizadeh M, Menache I, Kandula S (2016) Resource management with deep reinforcement learning. In: Ford B, Snoeren AC, Zegura EW (eds) Proceedings of the 15th ACM workshop on hot topics in networks, ACM Press, New York, pp 50–56. https://doi.org/10.1145/3005745.3005750 Mao H, Gong Z, Ni Y, Xiao Z (2017) Accnet: Actor-coordinator-critic net for “learning-to-communicate” with deep multi-agent reinforcement learning. arXiv preprint. arXiv:1706.03235 Mao H, Zhang Z, Xiao Z, Gong Z, Ni Y (2020) Learning multi-agent communication with double attentional deep reinforcement learning. Autonom Agents Multi-Agent Syst 34(1):1–34 Marewski JN, Gaissmaier W, Gigerenzer G (2010) Good judgments do not require complex cognition. Cogn Process 11(2):103–121 Markovitch S, Reger R (2005) Learning and exploiting relative weaknesses of opponent agents. Autonom Agents Multi-Agent Syst 10(2):103–130 McKee KR, Gemp I, McWilliams B, Duèñez Guzmán EA, Hughes E, Leibo JZ (2020) Social diversity and social preferences in mixed-motive reinforcement learning. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, Richland, SC, AAMAS ’20, pp 869–877 Minsky M (1961) Steps toward artificial intelligence. Proc IRE 49(1):8–30. https://doi.org/10.1109/JRPROC.1961.287775 Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. arXiv preprint. arXiv:1312.5602 Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533 Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning, PMLR, New York, pp 1928–1937 Moravčík M, Schmid M, Burch N, Lisỳ V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M (2017) Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513 Moreno P, Hughes E, McKee KR, Pires BA, Weber T (2021) Neural recursive belief states in multi-agent reinforcement learning. arXiv preprint. arXiv:2102.02274 Moriarty DE, Schultz AC, Grefenstette JJ (1999) Evolutionary algorithms for reinforcement learning. J Artif Intell Res 11:241–276 Nevmyvaka Y, Feng Y, Kearns M (2006) Reinforcement learning for optimized trade execution. In: Proceedings of the 23rd international conference on machine learning, pp 673–680 Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: Theory and application to reward shaping. ICML 99:278–287 Nguyen DT, Kumar A, Lau HC (2018) Credit assignment for collective multiagent rl with global rewards. In: Proceedings of the 31th advances in neural information processing systems. MIT, Cambridge Nguyen, T. T., Nguyen, N. D., & Nahavandi, S. (2020). Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE Ttrans cybernetics 50(9):3826–3839. Nitschke GS, Eiben A, Schut MC (2012) Evolving team behaviors with specialization. Genet Program Evol Mach 13(4):493–536 Omidshafiei S, Kim DK, Liu M, Tesauro G, Riemer M, Amato C, Campbell M, How JP (2019) Learning to teach in cooperative multiagent reinforcement learning. Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 6128–6136 Palanisamy P (2020) Multi-agent connected autonomous driving using deep reinforcement learning. In: International joint conference on neural networks. IEEE, Piscataway, pp 1–7 Papoudakis G, Christianos F, Rahman A, Albrecht SV (2019) Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint. arXiv:1906.04737 Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J (2017) Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint. arXiv:1703.10069 Peng Z, Zhang L, Luo T (2018) Learning to communicate via supervised attentional message processing. In: Proceedings of the 31st international conference on computer animation and social agents, pp 11–16 Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7–9):1180–1190 Peysakhovich A, Lerer A (2018) Prosocial learning agents solve generalized stag hunts better than selfish ones. In: International foundation for autonomous agents and multi agent systems, Richland, SC, AAMAS ’18, pp 2043–2044 Plaat A (2020) Learning to play: reinforcement learning and games. Springer, Cham Prasad A, Dusparic I (2019) Multi-agent deep reinforcement learning for zero energy communities. In: 2019 IEEE PES innovative smart grid technologies Europe (ISGT-Europe). IEEE, Piscataway, pp 1–5 Premack D, Woodruff G (1978) Does the chimpanzee have a theory of mind? Behav Brain Sci 1(4):515–526 Proper S, Tumer K (2012) Modeling difference rewards for multiagent learning. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems), Conitzer, Winikoff, Padgham, pp 1397–1398 Rashid T, Farquhar G, Peng B, Whiteson S (2020) Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In: Advances in neural information processing systems, vol 33, pp 10199–10210 Rashid T, Samvelyan M, Schroeder de Witt C, Farquhar G, Foerster JN, Whiteson S (2020b) Monotonic value function factorisation for deep multi-agent reinforcement learning. J Mach Learn Res 21:1–51 Rusu AA, Colmenarejo SG, Gulcehre C, Desjardins G, Kirkpatrick J, Pascanu R, Mnih V, Kavukcuoglu K, Hadsell R (2016) Policy distillation. arXiv preprint. arXiv:1511.06295 Sallab AE, Abdou M, Perot E (2017) Yogamani S (2017) Deep reinforcement learning framework for autonomous driving. Electron Imaging 19:70–76 Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T et al (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604–609 Schroeder de Witt C, Foerster J, Farquhar G, Torr P, Boehmer W, Whiteson S (2019) Multi-agent common knowledge reinforcement learning. In: Advances in neural information processing systems, vol 32, pp 9927–9939 Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: International conference on machine learning, PMLR, pp 1889–1897 Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint. arXiv:1707.06347 Shapley LS (1953) Stochastic games. Proc Natl Acad Sci USA 39(10):1095–1100 Sheikh HU, Bölöni L (2020) Multi-agent reinforcement learning for problems with combined individual and team reward. In: 2020 international joint conference on neural networks (IJCNN). IEEE, Piscataway, pp 1–8 Shoham Y, Leyton-Brown K (2008) Multiagent systems: algorithmic, game-theoretic, and logical foundations. Cambridge University Press, Cambridge Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, PMLR, pp 387–395 Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489 Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359 Simon HA (1957) Models of man, social and rational: mathematical essays on rational human behavior in a social setting. Wiley, New York Simon HA (1990) Bounded rationality. Springer, New York Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International conference on machine learning, PMLR, pp 5887–5896 Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958 Stanley HE (1971) Phase transitions and critical phenomena. Clarendon Press, Oxford Su J, Adams S, Beling P (2021) Value-decomposition multi-agent actor-critics. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 11352–11360 Suay HB, Brys T, Taylor ME, Chernova S (2016) Learning from demonstration for shaping through inverse reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp 429–437 Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J (2018) Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint. arXiv:1712.06567 Sukhbaatar S, Fergus R, et al. (2016) Learning multiagent communication with backpropagation. In: Advances in neural information processing systems, vol 29 Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K, Graepel T (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th International conference on autonomous agents and multiagent systems, international foundation for autonomous agents and multi agent systems, Richland, SC, AAMAS ’18, pp 2085–2087 Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT, Cambridge Sutton RS, Barto AG, et al. (1998) Introduction to reinforcement learning, vol 135. MIT, Cambridge Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, vol 12 Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, Aru J, Vicente R (2017) Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 12(4):1–15. https://doi.org/10.1371/journal.pone.0172395 Tan M (1993) Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the 10th international conference on machine learning, pp 330–337 Taylor ME, Stone P (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10(1):1633–1685 Taylor JET, Taylor GW (2021) Artificial cognition: how experimental psychology can help generate explainable artificial intelligence. Psychon Bull Rev 28(2):454–475 Terry JK, Grammel N, Hari A, Santos L, Black B (2021) Revisiting parameter sharing in multi-agent deep reinforcement learning. arXiv preprint. arXiv:2005.13625 Tian R, Tomizuka M, Sun L (2021) Learning human rewards by inferring their latent intelligence levels in multi-agent games: a theory-of-mind approach with application to driving data. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, Piscataway, pp 4560–4567 Van Der Ree M, Wiering M (2013) Reinforcement learning in the game of othello: Learning against a fixed opponent and learning from self-play. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, Piscataway, pp 108–115 Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30 Van Otterlo M, Wiering M (2012) Reinforcement learning and markov decision processes. In: Wiering M, van Otterlo M (eds) Reinforcement learning. Adaptation, learning, and optimization, vol 12. Springer, Berlin, pp 3–42 Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature 575(7782):350–354 Wang W, Hao J, Wang Y, Taylor M (2018) Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint. arXiv:1803.00162 Wang RE, Everett M, How JP (2019) R-MADDPG for partially observable environments and limited communication. In: International conference on machine learning 2019 workshop (RL4RealLife) Wen Z, O’Neill D, Maei H (2015) Optimal demand response using device-based reinforcement learning. IEEE Trans Smart Grid 6(5):2312–2324 Wen Y, Yang Y, Luo R, Wang J, Pan W (2019) Probabilistic recursive reasoning for multi-agent reinforcement learning. In: 7th international conference on learning representations, ICLR 2019 Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229–256 Wu Y, Mansimov E, Grosse RB, Liao S, Ba J (2017a) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In: Advances in neural information processing systems, vol 30, pp 5279–5288 Wu Y, Mansimov E, Liao S, Radford A, Schulman J (2017b) OpenAI Baselines: ACKTR & A2C. https://openai.com/blog/baselines-acktr-a2c//. Accessed 16 Dec 2021 Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J (2018) Mean field multi-agent reinforcement learning. In: International conference on machine learning, PMLR, pp 5571–5580 Yang Y, Hao J, Chen G, Tang H, Chen Y, Hu Y, Fan C, Wei Z (2020a) Q-value path decomposition for deep multiagent reinforcement learning. In: International conference on machine learning, PMLR, pp 10706–10715 Yang Y, Wen Y, Wang J, Chen L, Shao K, Mguni D, Zhang W (2020b) Multi-agent determinantal Q-learning. In: International conference on machine learning, PMLR, pp 10757–10766 Yang Y, Wang J (2020) An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583 Yang Y, Wang J (2021) An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint. arXiv:2011.00583 Ye N, Somani A, Hsu D, Lee WS (2017) Despot: Online pomdp planning with regularization. J Artif Intell Res 58:231–266 Yliniemi L, Tumer K (2014) Multi-objective multiagent credit assignment through difference rewards in reinforcement learning. In: Asia-Pacific conference on simulated evolution and learning. Springer, Cham, pp 407–418 Yu Y (2018) Towards sample efficient reinforcement learning. In: International joint conference on artificial intelligence, pp 5739–5743 Yu L, Song J, Ermon S (2019) Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning (pp. 7194–7201). PMLR Zhang X, Clune J, Stanley KO (2017) On the relationship between the openai evolution strategy and stochastic gradient descent. arXiv preprint. arXiv:1712.06564 Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: a selective overview of theories and algorithms. Springer, Cham, pp 321–384. https://doi.org/10.1007/978-3-030-60990-0_12, Zheng Y, Meng Z, Hao J, Zhang Z (2018a) Weighted double deep multiagent reinforcement learning in stochastic cooperative environments. In: Pacific RIM international conference on artificial intelligence. Springer, Berlin, pp 421–429 Zheng Y, Meng Z, Hao J, Zhang Z, Yang T, Fan C (2018b) A deep bayesian policy reuse approach against non-stationary agents. In: Proceedings of the 32nd international conference on neural information processing systems, pp 962–972 Zhou M, Liu Z, Sui P, Li Y, Chung YY (2020) Learning implicit credit assignment for cooperative multi-agent reinforcement learning. In: Advances in neural information processing systems, vol 33, pp 11853–11864 Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, Piscataway, pp 3357–3364 Zou H, Ren T, Yan D, Su H, Zhu J (2021) Learning task-distribution reward shaping with meta-learning. In: Proceedings of the AAAI conference on artificial intelligence, Vancouver, BC, Canada, pp 2–9