Four proofs of Gittins’ multiarmed bandit theorem

Esther Frostig1, Gideon Weiss1
1Department of Statistics, The University of Haifa, Mount Carmel, Israel

Tóm tắt

We study four proofs that the Gittins index priority rule is optimal for alternative bandit processes. These include Gittins’ original exchange argument, Weber’s prevailing charge argument, Whittle’s Lagrangian dual approach, and Bertsimas and Niño-Mora’s proof based on the achievable region approach and generalized conservation laws. We extend the achievable region proof to infinite countable state spaces, by using infinite dimensional linear programming theory.

Từ khóa


Tài liệu tham khảo

Anderson, E., & Nash, P. (1987). Linear programming in infinite dimensional spaces. Theory and application. Chichester: Wiley-Interscience.

Barvinok, A. (2002). A course in convexity. AMS graduate studies in mathematics: Vol. 54.

Battacharya, P., Georgiadis, L., & Tsoucas, P. (1992). Extended polymatroids, properties and optimization. In E. Balas, G. Cornnéjols, & R. Kannan edrs (Eds.), Integer programming and combinatorial optimization, IPCO2 (pp. 298–315). Pittsburgh: Carnegie-Mellon University Press.

Bellman, R. (1956). A problem in the sequential design of experiments. Sankhia, 16, 221–229.

Bertsimas, D., & Niño-Mora, J. (1996). Conservation laws, extended polymatroids and multi-armed bandit problems. Mathematics of Operations Research, 21, 257–306.

Chakravorty, J., & Mahajan, A. (2013). Multi-armed bandits, Gittins index, and its calculation. http://www.ece.mcgill.ca/~amahaj1/projects/bandits/book/2013-bandit-computations.pdf.

Dacre, M., Glazebrook, K., & Niño-Mora, J. (1999). The achievable region approach to the optimal control of stochastic systems. Journal of the Royal Statistical Society, Series B, Methodological, 61, 747–791. With discussion.

Denardo, E. V., Park, H., & Rothblum, U. G. (2007). Risk-sensitive and risk-neutral multiarmed bandits. Mathematics of Operations Research, 32(2), 374–394.

Denardo, E. V., Feinberg, E. A., & Rothblum, U. G. (2013). The multi-armed bandit, with constraints. Annals of Operations Research, 208, 37–62. Volume 1 of this publication.

Edmonds, J. (1970). Submodular functions, matroids and certain polyhedra. In R. Guy, H. Hanani, N. Sauer, & J. Schönheim (Eds.), Proceedings of Calgary international conference on combinatorial structures and their applications (pp. 69–87). New York: Gordon and Breach.

Federgruen, A., & Groenevelt, H. (1988). Characterization and optimization of achievable performances in general queuing systems. Operations Research, 36, 733–741.

Gittins, J. C., & Jones, D. M. (1974). A dynamic allocation indices for the sequential design of experiments. In J. Gani, K. Sarkadi, & I. Vince (Eds.), Progress in statistics, European Meeting of statisticians 1972 (Vol. 1, pp. 241–266). Amsterdam: North Holland.

Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, 14, 148–167.

Gittins, J. C. (1989). Multiarmed bandits allocation indices. New York: Wiley.

Gittins, J. C., Glazebrook, K., & Weber, R. R. (2011). Multiarmed bandits allocation indices (2nd ed.). New York: Wiley.

Glazebrook, K. D., & Garbe, R. (1999). Almost optimal policies for stochastic systems which almost satisfy conservation laws. Annals of Operations Research, 92, 19–43.

Glazebrook, K., & Niño-Mora, J. (2001). Parallel scheduling of multiclass M/M/m queues: approximate and heavy-traffic optimization of achievable performance. Operations Research, 49, 609–623.

Harrison, J. M. (1975). Dynamic scheduling of a multiclass queue, discount optimality. Operations Research, 23, 270–282.

Ishikada, A. T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83, 113–154.

Kaspi, H., & Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. The Annals of Applied Probability, 8, 1270–1290.

Katehakis, M. N., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. In J. Van Ryzin (Ed.), I.M.S. Lecture notes-monograph series: Vol. 8. Adaptive statistical procedures and related topics (pp. 29–39).

Katehakis, M. N., & Veinott, A. F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12, 262–268.

Katehakis, M. N., & Rothblum, U. (1996). Finite state multi-armed bandit sensitive-discount, average-reward and average-overtaking optimality. The Annals of Applied Probability, 6(3), 1024–1034.

Klimov, G. P. (1974). Time sharing service systems I. Theory of Probability and Its Applications, 19, 532–551.

Meilijson, I., & Weiss, G. (1977). Multiple feedback at a single server station. Stochastic Processes and Their Applications, 5, 195–205.

Mandelbaum, A. (1986). Discrete multi-armed bandits and multiparameter processes. Probability Theory and Related Fields, 71, 129–147.

Mitten, L. G. (1960). An analytic solution to the least cost testing sequence problem. Journal of Industrial Engineering, 11(1), 17.

Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability. Advances in Applied Probability, 33, 76–98.

Niño-Mora, J. (2002). Dynamic allocation indices for restless projects and queuing admission control: a polyhedral approach. Mathematical Programming Series A, 93, 361–413.

Niño-Mora, J. (2006). Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues. Mathematics of Operations Research, 31, 50–84.

Niño-Mora, J. (2007). A 2/3n 3 fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS Journal on Computing, 19, 596–606

Ross, S. M. (1983). Introduction to stochastic dynamic programming. New York: Academic Press.

Royden, H. L. (1971). Real analysis. New York: Macmillan.

Rudin, W. (1964). Principles of mathematical analysis. New York: McGraw-Hill.

Sevcik, K. C. (1974). Scheduling for minimum total loss using service time distributions. Journal of the Association for Computing Machinery, 21, 66–75.

Shanthikumar, J. G., & Yao, D. D. (1992). Multiclass queuing systems: polymatroidal structure and optimal scheduling control. Operations Research 40, 293–299.

Shapiro, A. (2001). On duality theory of conic linear problems. In M. A. Goberna & M. A. Lopez (Eds.), Semi-infinite programming (pp. 135–165). Netherlands: Kluwer.

Sonin, I. M. (2008). A generalized Gittins index for a Markov chain and its recursive calculation. Statistics & Probability Letters, 78(12), 1526–1533.

Tcha, D., & Pliska, S. R. (1975). Optimal control of single server queuing networks and multi-class M/G/1 queues with feedback. Operations Research, 25, 248–258.

Tsitsiklis, J. N. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 4, 194–199.

Tsoucas, P. (1991). The region of achievable performance in a model of Klimov. Research Report RC16543, IBM T.J. Watson Research Center Yorktown Heights, New York.

Varaiya, P., Walrand, J., & Buyukkoc, C. (1985). Extensions of the multiarmed bandit problem: the discounted case. IEEE Transactions on Automatic Control, AC-30, 426–439.

Weber, R. R. (1992). On the Gittins index for multiarmed bandits. Annals of Probability, 2, 1024–1033.

Weber, R. R., & Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability, 27, 637–648.

Weber, R. R., & Weiss, G. (1991). Addendum to ‘on an index policy for restless bandits’. Advances in Applied Probability, 23, 429–430.

Weiss, G. (1988). Branching bandit processes. Probability in the Engineering and Informational Sciences, 2, 269–278.

Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society, Series B, 42, 143–149.

Whittle, P. (1981). Arm acquiring bandits. Annals of Probability, 9, 284–292.

Whittle, P. (1988). Restless bandits: activity allocation in a changing world. Journal of Applied Probability, 25A, 287–298. In J. Gani (Ed.), A celebration of applied probability.

Whittle, P. (1990). Risk-sensitive optimal control. New York: Wiley.