RaelwayScot comments on [Link] AlphaGo: Mastering the ancient game of Go with Machine Learning - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (122)
Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions.
In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable.
Credit assignment is taken care of by backpropagation, as usual in neural networks. I don't know why RaelwayScot brought it up, unless they meant something else.
I meant that for AI we will possibly require high-level credit assignment, e.g. experiences of regret like "I should be more careful in these kinds of situations", or the realization that one particular strategy out of the entire sequence of moves worked out really nicely. Instead it penalizes/enforces all moves of one game equally, which is potentially a much slower learning process. It turns out playing Go can be solved without much structure for the credit assignment processes, hence I said the problem is non-existent, i.e. there wasn't even need to consider it and further our understanding of RL techniques.