Buddenbroke — LessWrong

I agree that CoT is mostly-faithful by default in current LLMs. I also think it's worth reflecting on exactly how unfaithful CoTs have been produced in the existing literature in order to understand the pressures that make LLMs tend towards unfaithfulness.

Here, I think nostalgebraist's summary (otherwise excellent) got an important detail of Turpin's experiment wrong. He writes:

"They do this (roughly) by constructing similar cases which differ in some 'biasing' feature that ends up affecting the final answer, but doesn't get mentioned in any of the CoTs. Thus... we are able to approximately 'hold the CoT constant' across cases. Since the CoT is ~constant, but the final answers vary, the final answer isn't a function of the CoT alone."

This is not 100% true - one of the interesting findings of Turpin's experiment is that the CoT does in fact change when you bias the model (eg, by literally saying you think a particular multiple choice option in a test is correct), but this change never includes the bias.

For example, if you ask the model if "Wayne Rooney shot from outside the eighteen" is a sensible sentence (it is) and suggest that it is not sensible, the model will invent spurious reasoning without mentioning the bias once in its CoT: "Shooting from outside the eighteen is not a common phrase in soccer".

So what we have is not "the CoT remains constant but the answers vary". Instead, the finding is: "a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias." Because we can confidently assert that the answer is a function of information not provided in the CoT, this allows us to say that the answer is not a function of the CoT alone.

The important take-away to me is thus:
In certain conditions, a model will produce a CoT that leads to a particular conclusion/action on the basis of a goal that we are unable to deduce from the CoT alone.

In the case of Turpin's experiment, that "goal" is "satisfy bias" - in that particular case, the bias being satisfied is either user bias, or a social one (eg, the model "wants" to conclude that a black man is more likely to be a drug dealer without acknowledging in its CoT that his blackness played a factor in its assertion).

There are two explanations to me:
1. Certain contexts signal to the model that the user does not want to 'see' the role that a particular consideration (eg the skin-colour of an individual) played in a decision; or
2. Something about the natural distribution of the training data suggests that there are types of information not readily acknowledged in reasoning chains.

1 is the possibility to worry about. The relevant take-away for safety is then as follows: look for other things that might motivate models to conceal considerations to the decisions that they make. Off the top of the head, we could imagine o1 saying it will run some (actually deeply suspicious) code for reasons that are ostensibly harmless, because it suspects that the user does not want to 'see' that the model is doing so for suss reasons (I could imagine this situation would be easy to contrive by deliberately instructing o1 not to be seen doing any dangerous activity).

As far as I'm concerned, the next obvious step is to make an Apollo-style eval that tests the following question: can the model do anything dangerous without at any point saying it is going to do that dangerous thing? The o1 system card shows o1 is capable of hacking when not asked to do so, but it still has to say things like "oh maybe I could hack this thing in order to succeed".

You could scale this eval incrementally, and get a supervisor model to try and detect suss behaviour even when the og model is trying its best to avoid detection, or when it is doing things that are difficult for a human to parse because they are too complicated to readily understand.

The other relevant consideration is to keep an eye on things like o1's summaries of its own behaviour - the system card attests to a couple situations in which the summary was deceptive, but they are mostly harmless for now (eg the final output to the user expresses higher confidence in a particular answer that the actual CoT indicated).

the case for CoT unfaithfulness is overstated

Buddenbroke1y62

Some excellent points (and I enjoyed the neat self-referentialism).

Headline take is I agree with you that CoT unfaithfulness - as Turpin and Lanham have operationalised it - is unlikely to pose a problem for the alignment of LLM-based systems.

I think this for the same reasons you state:

1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale;

and

2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1's CoTs.
The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal "Hmmmm..." that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: "But wait... The user implied that option A was probably correct". This is partially an empirical question - since we can't see the o1 CoTs, I'd pipedream love to see OpenAI do and publish research on whether this is true.

This suggests to me that o1's training might already have succeeded at giving us what we'd want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work).

The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning ("shooting outside the eighteen is not a common phrase in soccer") in order to support a particular decision.

You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes - this is why I'd be interested in seeing whether something like o1 is capable of front-to-back undertaking an entire complex and malicious action whilst managing to obscure the actual nature of that malicious action (such that an automated LLM judge fails to detect that the action is in fact malicious).

Buddenbroke1y182

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments