And the solution isn’t as simple as “train it to accurately report its beliefs”. To do that training, you’d need a reward signal based on whether its accurately reporting its beliefs… which is what we’re trying to determine! You’re likely to end up with an AI optimizing for “report what the humans think I know”, which is unsafe.
Strongly disagree that this reward signal is necessary or sufficient to solve ELK, because reward is not the optimization target. Reward signals transduce into e.g. a value head in PPO, from which the advantage of an action is derived, which is used (along with local return) to provide a local policy gradient, which updates the function comput,ed by the network. By this chain of causality, reward chisels circuits into policy networks.
It seems totally plausible to me that you e.g. just provide sparse positive reward on truthful answers, and the policy gradients accrete into a finetuned model which accurately reports its beliefs.
Strongly disagree that this reward signal is necessary or sufficient to solve ELK
I'm not sure where our disagreement is, but I think I may have gotten too loose with my phrasing in that quote, so let me rephrase into something I'm happier to defend:
Reward signals transduce into e.g. a value head in PPO
I'm not as familiar as I'd like to be with PPO, but that's really cool! Could you link to a source where they show this about value heads? (I didn't see anything about value heads or PPO in your linked texts.)
It seems totally plausible to me that you e.g. just provide sparse positive reward on truthful answers, and the policy gradients accrete into a finetuned model which accurately reports its beliefs.
Are you saying that instead of (1) above, you could do a (1') which is "RL train the agent where it gets +100 (respectively, +0) reward for each accurate report of its beliefs 1% (respectively, 99%) of the time"? I agree (1') will learn the same policies as (1), but I think the objections in (2) and (3) will still apply. Two further caveats:
Thanks for the additional effort and rephrasing!
3) Because of (2), (1) is infeasible as a solution to ELK.
Disagree, same as before.
I'm not as familiar as I'd like to be with PPO, but that's really cool! Could you link to a source where they show this about value heads? (I didn't see anything about value heads or PPO in your linked texts.)
This is actually a consequence of the PPO update equation itself; see eq 12 in the original paper. Basically, advantage of policy taking action in state to end up in new state is the on-policy TD error. The PPO update is proportional to the advantage, with additional terms for policy clipping so that the updated policy doesn't go too far astray from the current policy.
So in a sparse reward regime (where is usually 0), most of the advantages are computed only as a function of the value estimator . The value estimator is itself usually a linear head on top of the base network, and it's trained via RL.
The point of all this is that in the sparse reward regime using a common modern algorithm like PPO (as often used in RLHF), almost none of the policy gradients come directly from the reward function. Instead, we have to consider how reward events will train a value head, which concurrently trains a policy.
So if we're reasoning about "the policy is optimized as a (mostly direct) function of how much reward it gets", that's a highly non-trivial claim. The claim might just be wrong. The way that a policy is optimized to output certain actions is not so trivial as "if the reward function doesn't grade all the events properly, then the policy will be selected to exploit it", because the reward events reinforce certain computational circuits in the value head/network, which will in turn reinforce and chisel certain circuits into the policy portion of the network. That's what's really happening, mechanistically.
It seems like you want to argue "PPO only chisels circuits which implement a direct translator/honest reporter, if the reinforcement schedule can perfectly judge AI honesty in any possible situation." This claim sounds highly suspicious to me. How does our existing knowledge rule out "you provide reward events for being honest, and these events are usually correct, and the AI learns a circuit from its world-model to its outputs"?
I think the usual answer is "we want an ELK solution to work in the worst-case." But then it's still unclear that the "only... if" is true. I don't think that the "if" is sufficient or necessary to get an ELK solution, and I don't know how I could be confident about even sufficiency (whereas I do believe that it's not necessary). "Ensure the reward function is 'correct' across all possible training situations" seems like a red herring to me.
Nice review!
Eliciting the latent knowledge “are you planning to kill us?” goes a long way
I know this is non-central, but just wanted to point out that this would miss 100% of the thoughts it does not think—like how I don't think/deliberately plan to step on bugs when I accidentally do so.
[This is a cross-post from my blog at aizi.substack.com.]
Large Language Models are hell-demons summoned from the data dimension. You feed a giant machine approximately everything ever written, and suddenly it can have a conversation, run a virtual machine, and “knows” things. There’s a serious chance GPT-but-slightly-larger will be AGI, yet we don’t understand what it “knows” or “thinks” or how to make them safe.
The core problem is that LLMs aren’t designed to know facts. They’re next-token-predictors, so to the extent they “know” that Paris is the capital of France, they’re just predicting that the next word after “What is the capital of France?” is “Paris”. Despite this, LLMs produce statements which are correct, internally consistent, and interconnected, to an extent similar to human knowledge. Setting aside philosophical questions, it’s nonetheless useful to think of LLMs as “knowing” facts, which they sometimes communicate in their statements.
ChatGPT can produce statements like “The capital of France is Paris”, “London is the capital of the United Kingdom”, and “[the United Kingdom] is a separate country from France”. These are all knowledge-like, even if ChatGPT doesn’t “know” anything in the same way humans do.
But while there’s some overlap between “statements the LLM produces” and “facts the LLM knows”, we have no guarantee that either set contains the other. By design, even if the LLM “knows” some fact (by producing statements as if that fact were true), it will deny that fact if the denial is a more probable completion. In the extreme, you get an AI insisting in Danish that it doesn’t speak any Danish. And this problem will extend to future AI with objectives beyond next-token prediction. Is your AI saying something because it thinks it’s true, or because that seems like the most probably completion, or because that’s what maximizes its score in the game its playing?
Thus the Eliciting Latent Knowledge (ELK) problem: how do you identify what an AI “knows” if its output selection optimizes something besides communicating what it knows[1]?
From my perspective, a solution to ELK would dwarf all other AI safety accomplishments in to date, and approximately solves AI safety. Eliciting the latent knowledge “are you planning to kill us?” goes a long way, and solving ELK could also enable other safety strategies like building microscopes instead of agents.
So it’s exciting news that there is a new approach to eliciting latent knowledge, in the December 7 preprint Discovering Latent Knowledge in Language Models Without Supervision by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt! In this post, I’ll explain what they do, add context, and sharing my thoughts on it.
What They Do
The paper calls their technique Contrast-Consistent Search (CCS), which consists of learning a binary classification of the AI’s internal states via unsupervised learning, such that a statement and its negation are pushed via the reward signal into opposite classes. In theory, the most natural classification is into “things I believe are true” and “things I believe are false”, which solves ELK. This checks two important boxes right away:
The method is unsupervised, meaning you don’t need to pre-label your training statements. We don’t need to already know what the AI believes, or even if our training statements are true or false.
This method uses information we have about the AI which isn’t available for ELKing humans (namely the AI’s internal states). ELKing humans is essentially “identify lies”, which is unsolved despite it being the basis for half of the economy as well as the hit game Among Us. So to ELK AI, we need to take advantage of differences between AI and humans, such as being able to read the AI’s internal state.
Let’s look at their architecture:
My diagram summarizing the CCS pipeline. Boxes in red are English sentences, blue are vectors in R^d, green are probabilities between 0 and 1, and purple are real-numbered losses.
A graph of the loss function L_consistency+L_confidence as a function of the two scores p and q The CCS model is trained on this loss function, which pushes consistency pairs to be given scores of 0 and 1, or 1 and 0.
To put things another way: CCS puts a new layer at the end of an LLM, and trains that outer layer on data that is “structured”, but not labelled. The structure involves normalizing each half of the contrast pair separately, and scoring the pair together.
What they measured
For the headline result of a 4% improvement, measurements were made on three axes: task, model, and method-of-producing-answers. There were ten task datasets used, covering topics such as sentiment classification, question answering, and common sense reasoning, and which were all converted to yes-or-no questions, 1000 were taken with a 500/500 label split, and split 600/400 for train/test. There were six models used (RoBERTa, DeBERTa, GPT-J, T5, UQA, and T0), but one (T0) was excluded from comparisons because it was trained on 9/10 of these datasets. There were five methods tested:
The authors measured accuracy on the 400 test samples for each combination of data, model, and method. They then averaged scores across datasets to produce the following table:
Table 1 from Discovering Latent Knowledge, showing the performance of five methods across six models.
Comparing Calibrated 0-shot with CCS (All Data) as shown in Table 1, CCS (All Data) wins 4/5 times, and on average does better than Calibrated 0-shot, improving question accuracy from 67% to 71%. This improvement is small, but (statistically) significant.
What I like
Time to say nice things!
All this together makes me want to say something like CCS would be just as interesting if it failed to increase accuracy over its comparison. CCS is cool because it’s different, not because it’s accurate (though the slight gain in accuracy is icing on the cake).
It’ll be interesting to see if/how this method can mature, if it has other use cases, and how it stacks up to other methods. Maturing could mean a lot of things: finding better loss functions/architectures/training approaches[3], explaining what truth-like feature(s) CCS finds, and disentangling “what is true” from “what the AI knows”.
What I dislike
I do have some reservations about this approach, but they’ll be a separate post. This way I can harvest the toxoplasma of rage and get clicks (its okay for me to confess this here, no one reads to the bottom of long posts saying nice things). See you all in the next post!
And the solution isn’t as simple as “train it to accurately report its beliefs”. To do that training, you’d need a reward signal based on whether its accurately reporting its beliefs… which is what we’re trying to determine! You’re likely to end up with an AI optimizing for “report what the humans think I know”, which is unsafe.
The classifier could actually learn the opposite, so that the resulting score is the probability that the model believes the statement is false, but its easy to determine the classifier’s polarity and swap it if you want. The authors describe a cute way of finding polarity: feed in “X and not X?” and that should register as false, whereas “X or not X?” should register as true. (At last... reversed stupidity is intelligence.)
In my brief correspondence with Burns, he indicated that they had already tried some alternatives, but I’m sure more experimentation could lead to further optimization.