[Link] AlphaGo: Mastering the ancient game of Go with Machine Learning

ESRogs

DeepMind's go AI, called AlphaGo, has beaten the European champion with a score of 5-0. A match against top ranked human, Lee Se-dol, is scheduled for March.

Games are a great testing ground for developing smarter, more flexible algorithms that have the ability to tackle problems in ways similar to humans. Creating programs that are able to play games better than the best humans has a long history

[...]

But one game has thwarted A.I. research thus far: the ancient game of Go.

http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html

DeepMind's go AI, called AlphaGo, has beaten the European champion with a score of 5-0. A match against top ranked human, Lee Se-dol, is scheduled for March.

Games are a great testing ground for developing smarter, more flexible algorithms that have the ability to tackle problems in ways similar to humans. Creating programs that are able to play games better than the best humans has a long history

[...]

But one game has thwarted A.I. research thus far: the ancient game of Go.

http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html

Credit assignment and reward delay are nonexistent? What do you think happens when one diffs the board strength of two potential boards?

"Nonexistent problems" was meant as a hyperbole to say that they weren't solved in interesting ways and are extremely simple in this setting because the states and rewards are noise-free. I am not sure what you mean by the second question. They just apply gradient descent on the entire history of moves of the current game such that expected reward is maximized.

1V_V10y

Reward delay is not very significant in this task, since the task is episodic and fully observable, and there is no time preference, thus you can just play a game to completion without updating and then assign the final reward to all the positions. In more general reinforcement learning settings, where you want to update your policy during the execution, you have to use some kind of temporal difference learning method, which is further complicated if the world states are not fully observable. Credit assignment is taken care of by backpropagation, as usual in neural networks. I don't know why RaelwayScot brought it up, unless they meant something else.

24

[Link] AlphaGo: Mastering the ancient game of Go with Machine Learning

24

24

24

[Link] AlphaGo: Mastering the ancient game of Go with Machine Learning

24

24