lol no. The pruning ('policy') network is entirely the result of supervised learning from human games.
If I understood correctly, this is only the first stage in the training of the policy network. Then (quoting from Nature):
The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL). The RL policy network pρ is identical in structure to the SL policy network, and its weights ρ are initialised to the same values, ρ = σ. We play games between the current policy network pρ and a randomly selected previous iteration of the policy network.
The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL).
Except that they don't seem to use the resulting network in actual play; the only use is for deriving their state-evaluation network.
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.