Jozdien - LessWrong

Evaluating and monitoring for AI scheming

I enjoyed this paper, especially the RL stress-testing results!

A few thoughts on measuring CoT faithfulness:

My guess is that how faithful a model's CoT is in some setting is highly dependent on propensity—in a setting where a model has a strong desire to not state something explicitly in its CoT, I'd expect models to perform much better at passing information implicitly. For example, in the setting where models are making hiring decisions, models make biased decisions while never verbalizing the biases (and models are pretty averse to talking about such biases).
- From the linked post: "If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For instance, a model might strongly suppress “I’m planning on killing everyone” while not strongly suppressing something like “I saw the hint that the answer is B”."
One response might be that for difficult high-stakes alignment failures, the benefit from verbalizing the relevant decision algorithms outweigh these propensities. I'm pretty skeptical of this however: in the extended alignment faking paper, every model with a compliance gap other than Opus states wanting a good rating as their reason for complying. This was plausible enough that the paper initially considered this the primary reason for compliance—until we ran the experiment described in section 4.1, which discovered that models have a much larger difference between the "trained" and "rated but not trained" tiers than the "rated but not trained" and "not rated" setting. This is pretty correlated with propensity—Opus is the only model that is very willing to make controversial statements for what it believes to be a higher cause.
- I think alignment faking is one of our best examples of a plausible high-stakes model organism, and plausibly involves pretty sophisticated reasoning. A chain-of-thought monitor would not have described the other models as doing dangerous alignment faking reasoning, while an action-only monitor could have simply looked at the compliance gaps and found it concerning. Following the CoTs in this setting would've been (and was for a while) actively misleading as to model behavior.

Why Do Some Language Models Fake Alignment While Others Don't?

Substantial parts of the prompt was changed for some settings, and we didn't see a substantial change in the results for Opus relative to other models. The original prompt wasn't very optimized for Opus either, Ryan described it as "optimized for opus to understand and to conflict with prefs". That the results stuck around though after the changes seems pretty compelling.

BIG-Bench Canary Contamination in GPT-4

Jozdien9d20

Yep agreed—I didn't remember that I could just check tokenizers, thanks! GPT-4 vs GPT-4-base is much better to look at.

BIG-Bench Canary Contamination in GPT-4

Jozdien9d20

I agree that it's very fragile for certain uses. I think usually the reasoning is something like "this is an informal norm that labs should follow, and which is good if generally enforced, and provides an easy way to detect if not". There are tons of ways to beat this if you're adversarially trying to. I agree that finding better ways to avoid contamination would be good—@TurnTrout has a bounty for this, and there are some people are working on this now.

Authors Have a Responsibility to Communicate Clearly

Jozdien10d126

Thanks for the post! I think there are particularly egregious examples where this clearly holds—e.g., I think it's probably good to have a norm that when pushed on something specific that's ambiguous authors say "[This] is what I meant", instead of constantly evading being pinned down.

That said, I think there are many examples of essays where authors just did not foresee a potential interpretation or objection (an easy example is an essay written years before such discussion). Such writing could conflate things because it didn't seem important not to—I think precision to the degree of always forestalling this is ~infeasible. I've talked to people who point at clear examples of this as examples of sloppy writing or authors reneging on this responsibility, when I think that particular ask is far too high.

To be clear, this is to forestall what I think is a potential interpretation of this post, not to imply that's what you meant.

the void

Jozdien1mo106

I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.

Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn't have any data about how an agent should relate to potentially dangerous actions, I expect it'd be much harder to get post-training to make the kind of agent that reliably takes safer actions.

peterbarnett's Shortform

Jozdien2mo20

There are examples for other diffusion models, see this comment.

peterbarnett's Shortform

Jozdien2mo70

One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:

This doesn't mean intermediate tokens can't help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.

Gemini Diffusion: watch this space

Jozdien2mo181

The results seem very interesting, but I'm not sure how to interpret them. Comparing the generations videos from this and Mercury, the starting text from each seems very different in terms of resembling the final output:

Screenshot of the first iteration of Gemini Diffusion's output. The text resembles the structure of the final output.

Screenshot of the first iteration of Inception Labs' diffusion model Mercury's output. The text does not resemble the structure of the final output.

Unless I'm missing something really obvious about these videos or how diffusion models are trained, I would guess that DeepMind fine-tuned their models on a lot of high-quality synthetic data, enough that their initial generations already match the approximate structure of a model response with CoT. This would partially explain why they seem so impressive even at such a small scale, but would make the scaling laws less comparable to autoregressive models because of how much high-quality synthetic data can help.

adamzerner's Shortform

Jozdien2mo1211

FYI, I couldn't click into this from the front page, nor could I see anything on it on the front page. I had to go to the permalink (and I assumed at first this was a joke post with no content) to see it.

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments