A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning
Tóm tắt
Từ khóa
Tài liệu tham khảo
Barto A, Crites R 1996. Improving elevator performance using reinforcement learning, Adv Neural Inf Process Syst, 8:1017–1023.
Bellman R, Dreyfuss S 1959. Functional approximations and dynamic programming, Math Tables Other Aids Comput, 13:247–251.
Benveniste A, Métivier M, and Priouret P 1991. Adaptive Algorithms and Stochastic Approximations. Berlin Heidelberg New York: Springer-Verlag
Bertsekas DP 1995a. Nonlinear Programming. Athena Scientific.
Bertsekas DP 1995b. Dynamic Programming and Optimal Control. Athena Scientific.
Bertsekas DP, Singh S 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. Adv Neural Inf Process Syst. MIT, vol. 9, p. 974.
Bertsekas DP, Tsitsiklis JN 1995. Neuro-Dynamic Programming. Athena Scientific.
Boyan J 1999. Least-squares temporal difference learning. Proceedings of the Sixteenth International Conference (ICML) on Machine Learning, pp. 49–56.
Boyan J 2002. Technical update: least-squares temporal difference learning, Mach Learn, 49(2):233–246.
Bradtke SJ, Barto AG 1996. Linear least-squares algorithms for temporal-difference learning, Mach Learn, 22:33–57.
Choi DS, Van Roy B 2001. A generalized kalman filter for fixed point approximation and efficient temporal-difference learning, proceedings of the international joint conference on machine learning.
Dayan PD 1992. The convergence of TD(λ) for general (λ), Mach Learn, 8:341–362.
de Farias DP, Van Roy B 2000. On the existence of fixed points for approximate value iteration and temporal-difference learning, J Optim Theory Appl, 105(3).
Gurvits L, Lin LJ, and Hanson SJ 1994. incremental learning of evaluation functions for absorbing markov chains: New Methods and Theorems, preprint.
Lagoudakis M, Parr R 2001. Model-free least-squares policy iteration. Neural Inf Process Syst (NPIS-14).
Nedic A, Bertsekas DP 2001. Policy evaluation algorithms with linear function approximation. Tech. Rep. LIDS-P-2537, MIT Laboratory for Information and Decision Systems, December 2001.
Sutton RS 1988. Learning to predict by the method of temporal differences, Mach Learn, 3:9–44.
Tadić V 2001. On the convergence of temporal-difference learning with linear function approximation, Mach Learn, 42:241–267.
Tsitsiklis JN, Van Roy B 1997. An analysis of temporal-difference learning with function approximation, IEEE Trans Automat Contr, 42:674–690.
Tsitsiklis JN, Van Roy B 1999. Optimal stopping of markov processes: Hilbert Space Theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives, IEEE Trans Automat Contr, 44(10):1840–1851.
Van Roy B 1998. Learning and value function approximation in complex decision processes, Ph.D. dissertation, MIT.
Van Roy B, Bertsekas DP, Lee Y, and Tsitsiklis JN 1999. A Neuro-dynamic programming approach to retailer inventory management, Proc. of the IEEE Conf Decis Contr.
Varaiya P, Walrand J, and Buyukkoc C 1985. Extensions of the multiarmed bandit problem: the discounted case, IEEE Trans Automat Contr, 30(5).
Warmuth M, Forster J 2000. Relative loss bounds for temporal-difference learning. Proc. of the Seventeenth International Conference on Machine Learning, pp. 295–302.
Warmuth M, Schapire R 1997. On the worst-case analysis of temporal-difference learning algorithms, Journal of Machine Learning, 22(1,2,3):95–121.
Zhang W, Dietterich TG 1995. A reinforcement learning approach to job-shop scheduling. Proc. of the International Joint Conference on Artificial Intellience.