Parallel Rollout for Online Solution of Partially Observable Markov Decision Processes
Tóm tắt
We propose a novel approach, called parallel rollout, to solving (partially observable) Markov decision processes. Our approach generalizes the rollout algorithm of Bertsekas and Castanon (1999) by rolling out a set of multiple heuristic policies rather than a single policy. In particular, the parallel rollout approach aims at the class of problems where we have multiple heuristic policies available such that each policy performs near-optimal for a different set of system paths. Parallel rollout automatically combines the given multiple policies to create a new policy that adapts to the different system paths and improves the performance of each policy in the set. We formally prove this claim for two criteria: total expected reward and infinite horizon discounted reward. The parallel rollout approach also resolves the key issue of selecting which policy to roll out among multiple heuristic policies whose performances cannot be predicted in advance. We present two example problems to illustrate the effectiveness of the parallel rollout approach: a buffer management problem and a multiclass scheduling problem.
Tài liệu tham khảo
Anderson, A. T., Jensen, A., and Nielsen, B. F. 1995. Modelling and performance study of packet-traffic with self-similar characteristics over several timescales with Markovian arrival processes (MAP). Proc. 12th Nordic Teletraffic Seminar, 269–283.
Anderson, A. T., and Nielsen, B. F. 1997. An application of superpositions of two state Markovian sources to the modelling of self-similar behaviour. Proc. IEEE INFOCOM, 196–204.
Asmussen, S., Nerman, 0., and Olsson, M. 1996. Fitting phase-type distributions via the EM algorithm. Scand. J. Statist. 23:419–414.
Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control. Athena Scientific.
Bertsekas, D. P. 1997. Differential training of rollout policies. Proc. 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, IL.
Bertsekas, D. P., and Castanon, D. A. 1999. Rollout algorithms for stochastic scheduling problems. J. of Heuristics 5:89–108.
Bertsekas, D. P., and Tsitsikiis, J. 1996. Neuro-Dynamic Programming. Nashua, NH: Athena Scientific.
Blondia, C. 1993. A discrete-time batch Markovian arrival process as B-ISDN traffic model. Belgian J. of Operations Research, Statistics and Computer Science 32.
Bonald, T., May, M., and Bolot, M. 2000. Analytic evaluation of RED performance, Proc. IEEE INFOCOM, 1415–1424.
Chang, H. S. 2001. On-line sampling-based control for network queueing problems, Ph.D. thesis. Department of Electrical and Computer Engineering, West Lafayette, IN: Purdue University.
Chang, H. S., Givan, R., and Chong, E. K. P. 2000. On-line scheduling via sampling. Proc. 5th Int. Conf. on Artificial Intelligence Planning and Scheduling, 62–71.
Chen, D. T., and Rieders, M. 1996. Cyclic Markov modulated Poisson processes in traffic characterization. Stochastic Models 12(4): 585–610.
Duffield, N. G., and Whitt, W. 1998. A source traffic model and its transient analysis for network control. Stochastic Models 14:51–78.
Firoiu, V., and Borden, M., 2000. A study of active queue management for congestion control. Proc. IEEE INFOCOM, 1435–1445.
Fischer, W., and Meier-Hellstern, K. 1992. The Markov-modulated Poisson process (MMPP) cookbook. Performance Evaluation 18:149–171.
Floyd, S., and Jacobson, V. 1993. Random early detection gateways for congestion avoidance. IEEEIACM Trans. Net. 1(4): 397–413.
Givan, R., Chong, E. K. P., and Chang, H. S. 2002. Scheduling multiclass packet streams to minimize weighted loss. Queueing Systems 41(3): 241–270.
Hashem, E. 1989. Analysis of random drop for gateway congestion control. Tech. Rep. LCS/TR-465, Massachusetts Institute of Technology.
Ho, Y. C., and Cao, X. R. 1991. Perturbation Analysis of Discrete Event Dynamic Systems. Norwell, Massachusetts: Kluwer Academic Publishers.
Keams, M., Mansour, Y., and Ng, A. Y. 1999. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning 49: 193–208.
Keams, M., Mansour, Y., and Ng, A. Y. 2000. Approximate planning in large POMDPs via reusable trajectories. Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K.-R. Miiller (eds), Cambridge, MA: MIT Press.
Kitaeve, M. Y, and Rykov, V. V. 1995. Controlled Queueing Systems. CRC Press.
Kulkami, V. G., and Tedijanto, T. E. 1998. Optimal admission control of Markov-modulated batch arrivals to a finite-capacity buffer. Stochastic Models 14(1): 95–122.
Mayne, D. Q., and Michalska, H. 1990. Receding horizon control of nonlinear system. IEEE Trans. Auto. Contr. 38(7): 814–824.
Misra, V. M., and Gong, W. B. 1998. A hierarchical model for teletraffic. Proc. IEEE CDC 2: 1674–1679.
Neuts, M. F. 1979. Aversatile Markovian point process. J. Appl. Prob. 16: 764–779.
Peha, J. M., and Tobagi, F. A. 1990. Evaluating scheduling algorithms for traffic with heterogeneous performance objectives. Proc. IEEE GLOBECOM, 21–27.
Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: Wiley.
Romanov, A., and Floyd, S. 1995. Dynamics of TCP traffic over ATM network. IEEEJ. of Select. Areas Commun. 13(4): 633–641.
Ross, S. M. 1997. Simulation. San Diego, CA: Academic Press.
Sen, P., Marlaris, B., Rikh, N., and Anastassiou, D. 1989. Models for packet switching of variable-bit-rate video sources. IEEE J. of Select. Areas Commun. 7: 865–869.
Turin, J. W. 1996. Fitting stochastic automata via the EM algorithm. Stochastic Models 12: 405–424.