Temporal-difference search in computer Go
Tóm tắt
Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9×9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9×9 Computer Go Server.
Tài liệu tham khảo
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47(2–3), 235–256.
Balla, R., & Fern, A. (2009). UCT for tactical assault planning in real-time strategy games. In 21st international joint conference on artificial intelligence.
Baxter, J., Tridgell, A., & Weaver, L. (2000). Learning to play chess using temporal differences. Machine Learning, 40(3), 243–263.
Buro, M. (1999). From simple features to sophisticated evaluation functions. In 1st international conference on computers and games (pp. 126–145).
Chaslot, G., Chatriot, L., Fiter, C., Gelly, S., Hoock, J., Perez, J., Rimmel, A., & Teytaud, O. (2008). Combining expert, online, transient and online knowledge in Monte-Carlo exploration. In 8th European workshop on reinforcement learning.
Coulom, R. (2006). Efficient selectivity and backup operators in Monte-Carlo tree search. In 5th international conference on computer and games (pp. 72–83).
Coulom, R. (2007). Computing Elo ratings of move patterns in the game of Go. In Computer games workshop.
Coulom, R. (2008). Whole-history rating: A Bayesian rating system for players of time-varying strength. In Computers and games (pp. 113–124).
Dahl, F. (1999). Honte, a Go-playing program using neural nets. In Machines that learn to play games (pp. 205–223). Commack: Nova Science.
Dayan, P., & Sejnowski, T. (1994). TD(λ) converges with probability 1. Machine Learning, 14(1), 295–301.
Elo, A. (1978). The rating of chess players, past & present. New York: Arco.
Enzenberger, M. (1996). The integration of a priori knowledge into a Go playing neural network. http://www.cs.ualberta.ca/~emarkus/neurogo/neurogo1996.html.
Enzenberger, M. (2003). Evaluation in Go by a neural network using soft segmentation. In 10th advances in computer games conference (pp. 97–108).
Finnsson, H., & Björnsson, Y. (2008). Simulation-based approach to general game playing. In 23rd conference on artificial intelligence (pp. 259–264).
Fürnkranz, J. (2001). Machine learning in games: A survey. In Machines that learn to play games (pp. 11–59). Commack: Nova Science.
Gelly, S., & Silver, D. (2007). Combining online and offline learning in UCT. In 17th international conference on machine learning (pp. 273–280).
Gelly, S., & Silver, D. (2011). Monte-Carlo tree search and rapid action value estimation in computer Go. Artificial Intelligence, 175, 1856–1875.
Gelly, S., Wang, Y., Munos, R., & Teytaud, O. (2006). Modification of UCT with patterns in Monte-Carlo Go (Technical Report 6062). INRIA.
Gordon, G. (1996). Chattering in SARSA(lambda)—a CMU learning lab internal report (Technical report). Carnegie Mellon University.
Haykin, S. (1996). Adaptive filter theory. New York: Prentice Hall.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Huang, S., Coulom, R., & Lin, S. (2009). Monte-Carlo simulation balancing in practice. In 7th international conference on computers and games (pp. 119–126).
Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks (Technical Report Computational Cognitive Science Report 9503). Massachusetts Institute of Technology.
Kocsis, L., & Szepesvari, C. (2006). Bandit based Monte-Carlo planning. In 15th European conference on machine learning (pp. 282–293).
Littman, M. (1994). Markov games as a framework for multi-agent reinforcement learning. In 11th international conference on machine learning (pp. 157–163).
Littman, M. (1996). Algorithms for sequential decision making. PhD thesis, Brown University.
Lorentz, R. (2008). Amazons discover Monte-Carlo. In Computers and games (pp. 13–24).
Matthews, C. (2003). Shape up. GoBase.
Mayer, H. (2007). Board representations for neural Go players learning by temporal difference. In IEEE symposium on computational intelligence and games.
Müller, M. (2002). Computer Go. Artificial Intelligence, 134, 145–179.
Müller, M., & Enzenberger, M. (2009). Fuego—an open-source framework for board games and Go engine based on Monte-Carlo tree search (Technical Report TR09-08). University of Alberta, Department of Computing Science.
Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist systems (Technical Report CUED/F-INFENG/TR 166). Cambridge University Engineering Department.
Schaeffer, J. (1997). One jump ahead: Challenging human supremacy in checkers. Berlin: Springer.
Schaeffer, J. (2000). The games computers (and people) play. Advances in Computers, 50, 189–266.
Schaeffer, J., Hlynka, M., & Jussila, V. (2001). Temporal difference learning applied to a high-performance game-playing program. In 17th international joint conference on artificial intelligence (pp. 529–534).
Schäfer, J. (2008). The UCT algorithm applied to games with imperfect information. Diploma Thesis. Otto-von-Guericke-Universität Magdeburg.
Schraudolph, N., Dayan, P., & Sejnowski, T. (1994). Temporal difference learning of position evaluation in the game of Go. In Advances in neural information processing 6 (pp. 817–824).
Silver, D. (2009). Reinforcement learning and simulation based search in the game of Go. PhD thesis, University of Alberta.
Silver, D., Sutton, R., & Müller, M. (2007). Reinforcement learning of local shape in the game of Go. In 20th international joint conference on artificial intelligence (pp. 1053–1058).
Silver, D., & Tesauro, G. (2009). Monte-Carlo simulation balancing. In 26th international conference on machine learning (pp. 119–126).
Singh, S., & Dayan, P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning, 32(1), 5–40.
Singh, S., Jaakkola, T., Littman, M., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38, 287–308.
Stern, D., Herbrich, R., & Graepel, T. (2006). Bayesian pattern ranking for move prediction in the game of Go. In 23rd international conference of machine learning (pp. 873–880).
Stoutamire, D. (1991). Machine learning, game play, and Go. PhD thesis, Case Western Reserve University.
Sturtevant, N. (2008). An analysis of UCT in multi-player games. In 6th international conference on computers and games (pp. 37–49).
Sutton, R. (1984). Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts.
Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In 7th international conference on machine learning (pp. 216–224).
Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems 8 (pp. 1038–1044).
Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Sutton, R., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems 12 (pp. 1057–1063). Cambridge: MIT Press.
Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6, 215–219.
Tesauro, G., & Galperin, G. (1996). On-line policy improvement using Monte-Carlo search. In Advances in neural information processing 9 (pp. 1068–1074).
Tsitsiklis, J. (2002). On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3, 59–72.
van der Werf, E., Uiterwijk, J., Postma, E., & van den Herik, J. (2002). Local move prediction in Go. In 3rd international conference on computers and games (pp. 393–412).
Veness, J., Silver, D., Blair, A., & Uther, W. (2009). Bootstrapping from game tree search. In Advances in neural information processing systems 19.
Widrow, B., & Stearns, S. (1985). Adaptive signal processing. New York: Prentice-Hall.
Winands, M., & Björnsson, Y. (2009). Evaluation function based Monte-Carlo LOA. In 12th advances in computer games conference.