Approachability in Stackelberg Stochastic Games with Vector Costs

Dynamic Games and Applications - Tập 7 - Trang 422-442 - 2016
Dileep Kalathil1, Vivek S. Borkar2, Rahul Jain3
1EECS at UC Berkeley, Berkeley, USA
2EE department, IIT Bombay, Mumbai, India
3Department of EE, CS and ISE, University of Southern California, Los Angeles, USA

Tóm tắt

The notion of approachability was introduced by Blackwell (Pac J Math 6(1):1–8, 1956) in the context of vector-valued repeated games. The famous ‘Blackwell’s approachability theorem’ prescribes a strategy for approachability, i.e., for ‘steering’ the average vector cost of a given agent toward a given target set, irrespective of the strategies of the other agents. In this paper, motivated by the multi-objective optimization/decision-making problems in dynamically changing environments, we address the approachability problem in Stackelberg stochastic games with vector-valued cost functions. We make two main contributions. Firstly, we give a simple and computationally tractable strategy for approachability for Stackelberg stochastic games along the lines of Blackwell’s. Secondly, we give a reinforcement learning algorithm for learning the approachable strategy when the transition kernel is unknown. We also recover as a by-product Blackwell’s necessary and sufficient conditions for approachability for convex sets in this setup and thus a complete characterization. We give sufficient conditions for non-convex sets.

Tài liệu tham khảo

Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for Markov decision processes with average cost. SIAM J Control Optim 40(3):681–698 Altman E (1999) Constrained Markov decision processes, vol 7. CRC Press, Boca Raton Aubin JP, Cellina A (1984) Differential inclusions: set-valued maps and viability theory. Springer, New York Bardi M, Capuzzo-Dolcetta I (2008) Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations. Springer, Berlin Barwell AD (2011) Omega-limit sets of discrete dynamical systems. Ph.D. dissertation, University of Birmingham Benaim M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J Control Optim 44(1):328–348 Benaim M, Hofbauer J, Sorin S (2006) Stochastic approximations and differential inclusions, part ii: applications. Math Oper Res 31(4):673–695 Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pac J Math 6(1):1–8 Borkar VS (1991) Topics in controlled Markov chains. Longman Scientific & Technical, Harlow Borkar VS (1998) Asynchronous stochastic approximation. SIAM J Control and Optim 36(3):840–851 Borkar VS (2005) An actor-critic algorithm for constrained Markov decision processes. Syst Control Lett 54(3):207–213 Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Hindustan Publ Agency, New Delhi, Cambridge University Press, Cambridge Borkar VS, Meyn SP (2000) The ode method for convergence of stochastic approximation and reinforcement learning. SIAM J Control and Optim 38(2):447–469 Even-Dar E, Kakade S, Mansour Y (2009) Online Markov decision processes. Math Oper Res 34(3):726–736 Filar J, Vrieze K (1996) Competitive Markov decision processes. Springer, New York Kamal S (2010) A vector minmax problem for controlled Markov chains. Arxiv preprint, arXiv:1011.0675v1 Mannor S, Shimkin N (2003) The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Math Oper Res 28(2):327–345 Mannor S, Shimkin N (2004) A geometric approach to multi-criterion reinforcement learning. J Mach Learn Res 5:325–360 Milman E (2006) Approachable sets of vector payoffs in stochastic games. Games Econom Behav 56(1):135–147 Patek SD (1997) Stochastic shortest path games: theory and algorithms. Ph.D. dissertation, Lab. for Information and Decision Systems, MIT Perchet V (2014) Approachability, regret and calibration; implications and equivalences. J Dyn Games 1(2):181–254 Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York Shimkin N, Shwartz A (1993) Guaranteed performance regions in Markovian systems with competing decision makers. IEEE Trans Autom Control 38(1):84–95 Steuer RE (1989) Multiple criteria optimization: theory, computation, and application. Wiley, New York Tucker W (1999) The Lorenz attractor exists. Comptes Rendus Acad Sci Ser I Math 328(12):1197–1202 Wagner DH (1977) Survey of measurable selection theorems. SIAM J Control and Optim 15(5):859–903 Yu JY, Mannor S, Shimkin N (2009) Markov decision processes with arbitrary reward processes. Math Oper Res 34(3):737–757 Yu H, Bertsekas DP (2013) On boundedness of Q-learning iterates for stochastic shortest path problems. Math Oper Res 38(2):209–227