LESSWRONG
LW

jco — LessWrong

Replying toIs AlphaGo actually a consequentialist utility maximizer?

Is AlphaGo actually a consequentialist utility maximizer?

I would guess that the policy network still outperforms.

I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network's prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1/361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.

Edit: Looking through the paper a bit more,... (read more)

Replying toIs AlphaGo actually a consequentialist utility maximizer?

jco2y

Is AlphaGo actually a consequentialist utility maximizer?

Edit: Actually I'm still confused. If I'm reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I'd be pretty surprised if those ended up being the same thing as each other.

The SL policy network isn't trained on any data from the RL policy network, just on predicting the next move in expert games.

The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.

Replying toPlanning in LLMs: Insights from AlphaGo

jco2y

Planning in LLMs: Insights from AlphaGo

Thanks for clarifying! I do agree that that wouldn't work, at least if we wanted what was produced to be in any way useful or meaningful to humans.

Replying toPlanning in LLMs: Insights from AlphaGo

jco2y

Planning in LLMs: Insights from AlphaGo

I'm not surprised BPEs are semi-coherent. As I said, dark knowledge, and anyway, BPEs are a compression algorithm (compression=intelligence) which were trained on a large English text corpus, so them not being random linenoise is no more surprising than n-grams or gzip being able to generate English-y text.

I had this intuition for n-grams (natively) and gzip (from this paper). Never really considered how much BPE compresses the token space, not sure why.

But Whisper-V2 is processing real data still, so it's a mix of learning from data (the Whisper models haven't extracted all possible knowledge from the first pass through the data) and amortizing compute (the training+runtime compute of the Whisper-V2 is being

... (read 413 more words →)

Replying toPlanning in LLMs: Insights from AlphaGo

jco2y

Planning in LLMs: Insights from AlphaGo

Obviously this doesn't work "from scratch", you need enough training for the model to be able to distinguish good outputs from bad outputs and also ever produce good outputs on its own. We're not going to get a ChatGPT-Zero. But I think this post does gesture in the general direction of something real.

While I do think the process you outlined in your post is more concrete and would probably work better and be easier than learning "from scratch", I don't think it's completely obvious that something like this wouldn't work from scratch. It was done for humans, albeit through billions of years of genetic evolution and thousands of years of cultural evolution. Something like ChatGPT-Zero would probably require many more orders of magnitude of compute than systems we are training today, and also some algorithmic/architectural improvements, but I don't think it's completely impossible.

I feel like your post is implying something similar, given the last sentence, so maybe I'm misinterpreting what exactly you're saying won't work.

Replying toPlanning in LLMs: Insights from AlphaGo

jco2y

Planning in LLMs: Insights from AlphaGo

Thanks for the feedback!

LLMs are shockingly good at gibberish, leading to macaronic attacks and other non-obvious implications, so I would not be surprised if an oversight LLM could. (Humans can probably also do this due to dark knowledge but it would be so painful & expensive as to be impractical, as you note.)

I was thinking of the gibberish level of text generated by uniformly sampling from the tokenizer. I had imagined there would be a huge difference between the gibberish level of macaronic attacks and completely random sampling from the tokenizer, but here are the first three examples I generated of 10 tokens uniformly sampled from GPT-2's tokenizer:

"ournament annually amused charismaling Superintendent

... (read 924 more words →)

Planning in LLMs: Insights from AlphaGo

jco

Introduction

Talk of incorporating planning techniques such as Monte Carlo tree search (MCTS) into LLMs has been bubbling around the AI sphere recently, both in relation to Google's Gemini and OpenAI's Q*. Much of this discussion has been in the context of AlphaGo, so I decided to go back and read through AlphaGo and some subsequent papers (AlphaGo Zero and AlphaZero). This post highlights what these papers did in the context of LLMs and some thoughts I had while reviewing the papers.

When I say LLMs in this post I am referring to causal/decoder/GPT LLMs.

AlphaGo

Overview

AlphaGo trains two supervised learning (SL) policy networks, a reinforcement learning (RL) policy network, a SL value network, and uses... (read 3204 more words →)