A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning

David Choi1, Benjamin Van Roy2
1Lincoln Laboratory, Massachusetts Institue of Technology, Lexington, USA
2Departments of Management Science and Engineering and Electrical Engineering, Stanford University, Stanford, USA

Tóm tắt

Từ khóa


Tài liệu tham khảo

Barto A, Crites R 1996. Improving elevator performance using reinforcement learning, Adv Neural Inf Process Syst, 8:1017–1023.

Bellman R, Dreyfuss S 1959. Functional approximations and dynamic programming, Math Tables Other Aids Comput, 13:247–251.

Benveniste A, Métivier M, and Priouret P 1991. Adaptive Algorithms and Stochastic Approximations. Berlin Heidelberg New York: Springer-Verlag

Bertsekas DP 1995a. Nonlinear Programming. Athena Scientific.

Bertsekas DP 1995b. Dynamic Programming and Optimal Control. Athena Scientific.

Bertsekas DP, Singh S 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. Adv Neural Inf Process Syst. MIT, vol. 9, p. 974.

Bertsekas DP, Tsitsiklis JN 1995. Neuro-Dynamic Programming. Athena Scientific.

Borkar V 1995. Probability theory: an advanced course. Berlin Heidelberg New York: Springer-Verlag

Boyan J 1999. Least-squares temporal difference learning. Proceedings of the Sixteenth International Conference (ICML) on Machine Learning, pp. 49–56.

Boyan J 2002. Technical update: least-squares temporal difference learning, Mach Learn, 49(2):233–246.

Bradtke SJ, Barto AG 1996. Linear least-squares algorithms for temporal-difference learning, Mach Learn, 22:33–57.

Choi DS, Van Roy B 2001. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning, proceedings of the international joint conference on machine learning.

Dayan PD 1992. The convergence of TD(λ) for general (λ), Mach Learn, 8:341–362.

de Farias DP, Van Roy B 2000. On the existence of fixed points for approximate value iteration and temporal-difference learning, J Optim Theory Appl, 105(3).

Gurvits L, Lin LJ, and Hanson SJ 1994. incremental learning of evaluation functions for absorbing markov chains: New Methods and Theorems, preprint.

Karatzas I, Shreve SE 1998. Methods of Mathematical Finance. Berlin Heidelberg New York: Springer.

Lagoudakis M, Parr R 2001. Model-free least-squares policy iteration. Neural Inf Process Syst (NPIS-14).

Nedic A, Bertsekas DP 2001. Policy evaluation algorithms with linear function approximation. Tech. Rep. LIDS-P-2537, MIT Laboratory for Information and Decision Systems, December 2001.

Pineda F 1997. Mean-field analysis for batched TD(λ), Neural Comput, 1403–1419.

Sutton RS 1988. Learning to predict by the method of temporal differences, Mach Learn, 3:9–44.

Tadić V 2001. On the convergence of temporal-difference learning with linear function approximation, Mach Learn, 42:241–267.

Tesauro G 1995. Temporal difference learning and TD-gammon, Communications of the ACM, 38(3).

Tsitsiklis JN, Van Roy B 1997. An analysis of temporal-difference learning with function approximation, IEEE Trans Automat Contr, 42:674–690.

Tsitsiklis JN, Van Roy B 1999. Optimal stopping of markov processes: Hilbert Space Theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives, IEEE Trans Automat Contr, 44(10):1840–1851.

Van Roy B 1998. Learning and value function approximation in complex decision processes, Ph.D. dissertation, MIT.

Van Roy B, Bertsekas DP, Lee Y, and Tsitsiklis JN 1999. A Neuro-dynamic programming approach to retailer inventory management, Proc. of the IEEE Conf Decis Contr.

Varaiya P, Walrand J, and Buyukkoc C 1985. Extensions of the multiarmed bandit problem: the discounted case, IEEE Trans Automat Contr, 30(5).

Warmuth M, Forster J 2000. Relative loss bounds for temporal-difference learning. Proc. of the Seventeenth International Conference on Machine Learning, pp. 295–302.

Warmuth M, Schapire R 1997. On the worst-case analysis of temporal-difference learning algorithms, Journal of Machine Learning, 22(1,2,3):95–121.

Zhang W, Dietterich TG 1995. A reinforcement learning approach to job-shop scheduling. Proc. of the International Joint Conference on Artificial Intellience.