Introduction
Talk of incorporating planning techniques such as Monte Carlo tree search (MCTS) into LLMs has been bubbling around the AI sphere recently, both in relation to Google's Gemini and OpenAI's Q*. Much of this discussion has been in the context of AlphaGo, so I decided to go back and read through AlphaGo and some subsequent papers (AlphaGo Zero and AlphaZero). This post highlights what these papers did in the context of LLMs and some thoughts I had while reviewing the papers.
When I say LLMs in this post I am referring to causal/decoder/GPT LLMs.
AlphaGo
Overview
AlphaGo trains two supervised learning (SL) policy networks, a reinforcement learning (RL) policy network, a SL value network, and uses... (read 3204 more words →)
I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network's prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1/361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
Edit: Looking through the paper a bit more,... (read more)