You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

MrMind comments on Open thread, Jan. 25 - Jan. 31, 2016 - Less Wrong Discussion

3 Post author: username2 25 January 2016 09:07PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (169)

You are viewing a single comment's thread. Show more comments above.

Comment author: bogus 29 January 2016 10:25:15PM 0 points [-]

AlphaGo uses two deep neural networks to prune the enormous search tree of a Go position, and it does so unsupervised.

lol no. The pruning ('policy') network is entirely the result of supervised learning from human games. The other network is used to evaluate game states.

Your other ideas are more interesting, but they are not related to AlphaGo specifically, just deep neural networks.

Comment author: MrMind 01 February 2016 09:32:05AM 0 points [-]

lol no. The pruning ('policy') network is entirely the result of supervised learning from human games.

If I understood correctly, this is only the first stage in the training of the policy network. Then (quoting from Nature):

The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL). The RL policy network pρ is identical in structure to the SL policy network, and its weights ρ are initialised to the same values, ρ = σ. We play games between the current policy network pρ and a randomly selected previous iteration of the policy network.

Comment author: bogus 01 February 2016 08:04:53PM 0 points [-]

The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL).

Except that they don't seem to use the resulting network in actual play; the only use is for deriving their state-evaluation network.