You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

bogus comments on [Link] AlphaGo: Mastering the ancient game of Go with Machine Learning - Less Wrong Discussion

14 Post author: ESRogs 27 January 2016 09:04PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (122)

You are viewing a single comment's thread. Show more comments above.

Comment author: bogus 28 January 2016 06:47:57PM *  1 point [-]

In addition, the entire network needs to learn somehow to determine which parts of the network in the past were responsible for current reward signals which are delayed and noisy.

This is a well-known problem, called reinforcement learning. It is a significant component in the reported results. (What happens in practice is that a network's ability to assign "credit" or "blame" for reward signals falls off exponentially with increasing delay. This is a significant limitation, but reinforcement learning is nevertheless very helpful given tight feedback loops.)

Comment author: RaelwayScot 28 January 2016 07:16:56PM 0 points [-]

Yes, but as I wrote above, the problems of credit assignment, reward delay and noise are non-existent in this setting, and hence their work does not contribute at all to solving AI.

Comment author: Vaniver 28 January 2016 07:31:11PM 1 point [-]

Credit assignment and reward delay are nonexistent? What do you think happens when one diffs the board strength of two potential boards?

Comment author: RaelwayScot 28 January 2016 08:39:37PM *  0 points [-]

"Nonexistent problems" was meant as a hyperbole to say that they weren't solved in interesting ways and are extremely simple in this setting because the states and rewards are noise-free. I am not sure what you mean by the second question. They just apply gradient descent on the entire history of moves of the current game such that expected reward is maximized.

Comment author: Vaniver 28 January 2016 11:17:20PM 2 points [-]

It seems to me that the problem of value assignment to boards--"What's the edge for W or B if the game state looks like this?" is basically a solution to that problem, since it gives you the counterfactual information you need (how much would placing a stone here improve my edge?) to answer those questions.

I agree that it's a much simpler problem here than it is in a more complicated world, but I don't think it's trivial.

Comment author: V_V 28 January 2016 08:32:31PM *  0 points [-]

Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions.

In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable.

Credit assignment is taken care of by backpropagation, as usual in neural networks. I don't know why RaelwayScot brought it up, unless they meant something else.

Comment author: RaelwayScot 28 January 2016 11:20:24PM 3 points [-]

I meant that for AI we will possibly require high-level credit assignment, e.g. experiences of regret like "I should be more careful in these kinds of situations", or the realization that one particular strategy out of the entire sequence of moves worked out really nicely. Instead it penalizes/enforces all moves of one game equally, which is potentially a much slower learning process. It turns out playing Go can be solved without much structure for the credit assignment processes, hence I said the problem is non-existent, i.e. there wasn't even need to consider it and further our understanding of RL techniques.

Comment author: Vaniver 28 January 2016 11:19:29PM 0 points [-]

thus you can just play a game to completion without updating and then assign the final reward to all the positions.

Agreed, with the caveat that this is a stochastic object, and thus not a fully simple problem. (Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.)

Comment author: V_V 29 January 2016 12:11:12AM *  0 points [-]

Even if I knew all possible branches of the game tree that originated in a particular state, I would need to know how likely any of those branches are to be realized in order to determine the current value of that state.

Well, the value of a state is defined assuming that the optimal policy is used for all the following actions. For tabular RL you can actually prove that the updates converge to the optimal value function/policy function (under some conditions). If NN are used you don't have any convergence guarantees, but in practice the people at DeepMind are able to make it work, and this particular scenario (perfect observability, determinism and short episodes) is simpler than, for instance that of the Atari DQN agent.