V_V comments on [Link] AlphaGo: Mastering the ancient game of Go with Machine Learning - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (122)
Well, the value of a state is defined assuming that the optimal policy is used for all the following actions. For tabular RL you can actually prove that the updates converge to the optimal value function/policy function (under some conditions). If NN are used you don't have any convergence guarantees, but in practice the people at DeepMind are able to make it work, and this particular scenario (perfect observability, determinism and short episodes) is simpler than, for instance that of the Atari DQN agent.