You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

bogus comments on [Link] AlphaGo: Mastering the ancient game of Go with Machine Learning - Less Wrong Discussion

14 Post author: ESRogs 27 January 2016 09:04PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (122)

You are viewing a single comment's thread. Show more comments above.

Comment author: bogus 29 January 2016 01:04:31PM 1 point [-]

Cite? They use the supervised network for policy selection (i.e. tree pruning) which is a critical part of the system.

Comment author: Gunnar_Zarncke 29 January 2016 02:29:13PM 0 points [-]

I'm referring to figure 1a on page 4 and the explanation below. I can't be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.

Comment author: V_V 29 January 2016 05:55:02PM *  1 point [-]

They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).

In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.

Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.