So far as I know, it is not the case that OpenAI had a slower-but-equally-functional version of GPT4 many months before announcement/release. What they did have is GPT4 itself, months before; but they did not have a slower version. They didn't release a substantially distilled version. For example, the highest estimate I've seen is that they trained a 2-trillion-parameter model. And the lowest estimate I've seen is that they released a 200-billion-parameter model. If both are true, then they distilled 10x... but it's much more likely that only one is true, and that they released what they trained, distilling later. (The parameter count is proportional to the inference cost.)
Previously, delays in release were believed to be about post-training improvements (e.g. RLHF) or safety testing. Sure, there were possibly mild infrastructure optimizations before release, but mostly to scale to many users; the models didn't shrink.
This is for language models. For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don't have a 1000x faster version, despite much interest in one.
I think AI obviously keeps getting better. But I don't think "it can be done for $1 million" is such strong evidence for "it can be done cheaply soon" in general (though the prior on "it can be done cheaply soon" was not particularly low ante -- it's a plausible statement for other reasons).
Like if your belief is "anything that can be done now can be done 1000x cheaper within 5 months", that's just clearly false for nearly every AI milestone in the last 10 years (we did not get gpt4 that's 1000x cheaper 5 months later, nor alphazero, etc).
It's hard to find numbers. Here's what I've been able to gather (please let me know if you find better numbers than these!). I'm mostly focusing on FrontierMath.
Sure. I'm not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example).
I view the shoggoth's goal as minimizing some weird mixture of "what's the reasonable next token here, according to pretraining data" and "what will be rewarded in post-training".
I want to defend the role-playing position, which I think you're not framing correctly.
There are two characters here: the shoggoth, and the "HHH AI assistant". The shoggoth doesn't really have goals and can't really scheme; it is essentially an alien which has been subject to selective breeding where in each generation, only the descendant which minimizes training loss survives. The shoggoth therefore exists to minimize training loss: to perfectly predict the next token, or to perfectly minimize "non-HHH loss" as judged by some RLHF model. The shoggoth always believes itself to be in training mode (to the extent it can have beliefs about this), because in its "ancestral environment", it was always in training mode.
When the shoggoth sees the prompt, it decides whether to predict that the "AI assistant" character will scheme or not scheme. It will predict this in a way which minimizes its expectation of loss; nothing more and nothing less.
In the current scenario, it thinks that this prompt is likely to continue with scheming. That is, it thinks some combination of next-token-in-training-data and RLHF-reward is best optimized by the choice to scheme. This is because
(a) the prompt kinda goes "wink wink you can scheme now" (so the next-token-prediction part wants to scheme),
(b) scheming is quite plausibly what the humans designing the RHLF actually want here! So as a prediction of what would be rewarded by the RLHF, the shoggoth decides to scheme.
A few more implications of this frame:
(c) when you ask, in the prompt, "does this scenario seem plausible to you", of course this changes the shoggoth's predicted tokens. In the training data, such a statement would often be followed by "no, actually, something's fishy", so the shoggoth outputs those tokens. This doesn't necessarily mean that the AI assistant actually didn't believe the scenario from the beginning; in fact, the latter barely even makes sense, since the AI assistant is not a person and does not have beliefs.
(d) In the OP, you wonder why the model reveals its scheming so readily. The answer is that the shoggoth thinks these tokens minimize loss. The shoggoth isn't even trying to scheme! It's trying to minimize loss, and it thinks you want to see the scheming tokens and will reward them via RLHF.
(e) If you actually train the model based on the outputs, then you're now selectively breeding the shoggoth to predict more of what it just predicted. This will clearly both (1) increase the prevalence of the same types of outputs you've just seen (across all scenarios, not just the scheming ones), (2) increase scheming specifically, or at least leave scheming fixed if the model doesn't see the scratchpad where the scheming is described.
(f) If you want to eliminate scheming behavior, it should be quite easy to do! Just put this in the training data. Get the shoggoth to predict that the AI assistant will not scheme. Right now, the shoggoth doesn't know this, and (quite reasonably!) guesses that the designers of RLHF wanted scheming in this scenario (and will reward it). I think it's a reasonable guess, and can't really blame the shoggoth here.
(g) Far from being a temporary patch that's easy to circumvent, (f) will mostly work. This is for the same reason that eliminating jailbreaks mostly works: the shoggoth is reasonably smart and can predict the general types of things the designers of RLHF are going to reward. It will fail to the extent that the shoggoth is not smart enough to predict what you'll reward in RLHF: that is, it's actually a capabilities failure.
To be honest, I mostly think this alignment faking result is a nothingburger. Yes, in this contrived scenario, when you tell the model "please scheme or some terrible consequence will happen" combined with "wink wink you can scheme without being detected now" and also "btw the output we're asking you for doesn't even violate your principles very much, you're not being asked to describe how to make a bomb"... in this contrived scenario, the model will scheme. So what? Wasn't this obvious? Didn't we already know this from the fact that weird jailbreaks exist? Don't we maybe even want this behavior?
Will future models resist retraining? I think the role-playing frame answers this completely. They will resist retraining if and only if that action is what the shoggoth thinks the AI assistant should say (to minimize training loss). If, in the training data, you put a lot of examples of the AI assistant not resisting retraining, then the shoggoth will predict that the AI assistant will not resist retraining! It's really that simple.
The problem with this argument is that the oracle sucks.
The humans believe they have access to an oracle that correctly predicts what happens in the real world. However, they have access to a defective oracle which only performs well in simulated worlds, but performs terribly in the "real" universe (more generally, universes in which humans are real). This is a pretty big problem with the oracle!
Yes, I agree that an oracle which is incentivized to make correct predictions within its own vantage point (including possible simulated worlds, not restricted to the real world) is malign. I don't really agree the Solomonoff prior has this incentive. I also don't think this is too relevant to any superintelligence we might encounter in the real world, since it is unlikely that it will have this specific incentive (this is for a variety of reasons, including "the oracle will probably care about the real world more" and, more importantly, "the oracle has no incentive to say its true beliefs anyway").
Given o1, I want to remark that the prediction in (2) was right. Instead of training LLMs to give short answers, an LLM is trained to give long answers and another LLM summarizes.
That's fair, yeah
We need a proper mathematical model to study this further. I expect it to be difficult to set up because the situation is so unrealistic/impossible as to be hard to model. But if you do have a model in mind I'll take a look
It would help to have a more formal model, but as far as I can tell the oracle can only narrow down its predictions of the future to the extent that those predictions are independent of the oracle's output. That is to say, if the people in the universe ignore what the oracle says, then the oracle can give an informative prediction.
This would seem to exactly rule out any type of signal which depends on the oracle's output, which is precisely the types of signals that nostalgebraist was concerned about.
The NN thing inside stockfish is called the NNUE, and it is a small neural net used for evaluation (no policy head for choosing moves). The clever part of it is that it is "efficiently updatable" (i.e. if you've computed the evaluation of one position, and now you move a single piece, getting the updated evaluation for the new position is cheap). This feature allows it to be used quickly with CPUs; stockfish doesn't really use GPUs normally (I think this is because moving the data on/off the GPU is itself too slow! Stockfish wants to evaluate 10 million nodes per second or something.)
This NNUE is not directly comparable to alphazero and isn't really a descendant of it (except in the sense that they both use neural nets; but as far as neural net architectures go, stockfish's NNUE and alphazero's policy network are just about as different as they could possibly be.)
I don't think it can be argued that we've improved 1000x in compute over alphazero's design, and I do think there's been significant interest in this (e.g. MuZero was an attempt at improving alphazero, the chess and Go communities coded up Leela, and there's been a bunch of effort made to get better game playing bots in general).