Look, I do agree that "coherence" a questionable name for the measure they've come up with, so I'm going to keep it in quotation marks.
Ok, now let's consider a model with variance of 1e-3 and bias of 1e-6. Huge "incoherence"! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests? Whence this conclusion?
Well, let's think about it. A key proposition in Yudkowskian misalignment theory is that capabilities generalise further than alignment. That is, as models get better, at some point a "capabilities engine" crystallises which is is very good at achieving a very wide variety of things; at the same time, the "thing-it-ought-to-be-achieving" is not strongly constrained by the training process. What would we expect failures of such a system to look like - high bias or high variance?
Naively, we can imagine a model with a good capabilities engine and the wrong objective (which could be a complex mix of stuff or whatever); unless it is in a situation where randomization is at least as good as just doing the optimal thing, we expect it not to randomize bc its capabilities engine knows what the optimal thing is. So failures will generally be consistent, so it will have high "coherence".
Now we could consider an "incoherent" version of this model: it randomly samples an objective, then pursues this objective. But this setup seems unstable: if it is to have low "coherence", it must need a lot of information about what its objective is. But then if there's substantial loss of information about what its state/actions were yesterday, it's liable to sample a different goal today. The end result here is a system that flails incompetently despite being in principle capable of not doing so. So there seems to be some tension between incoherence and the premise of a crystallised capabilities engine.
Furthermore, there has been some empirical work on goal misgeneralization. You yourself made a youtube video about an agent that learned to travel to the right instead of pursue a coin in a 2d platforming game. This too is high "coherence" behaviour!
What if capabilities don't generalize further than alignment? This is a world where though advanced AI is capable of a great many things, in novel situations it's still more prone to error than to competently pursuing the wrong thing (even if it's still much less prone to error than a human in the same situation). When errors do occur, unlike in the capabilities > alignment regime, there's no reason to expect consistency - they could be genuinely random, or highly sensitive to unimportant contextual features. So prima facie we'd expect lower "coherence".
So why should you think a very powerful model with high variance and low bias is not going to be misaligned in the Yudkowskian sense? Because that combination of properties is evidence against "capabilities > alignment". Is it good evidence? I don't know, but the direction is fairly clear.
"capabilities > alignment" is a very big if true proposition, but it's an informal notion without much development theoretically or empirically, so I'm happy whenever I see someone having a crack at the question.
(Similarly, an extremely dumb, broken model which always outputs the same answer regardless of input is extremely "coherent". A rock is also extremely "coherent", by this definition.)
The paper is trying to project what happens to "coherence" at high capability, it isn't a particularly strong criticism that a certain class of minimally capable objects have high coherence because this isn't the domain of interest. It's plausibly even correct that, conditioning on minimal capability, rocks are high coherence and wind is low coherence for any reasonable definition of coherence applicable to these objects (plausibly, mind you, I cannot say I've a deep understanding of all reasonable definitions of coherence).
This is an extremely selective reading of the results, where in almost every experiment, model coherence increased with size. There are three significant exceptions.
This is false, in figures 1 and 2 model coherence has an unclear relationship with size. On some tasks Sonnet 4 is more coherent than o3-mini and o4-mini, on others it is less coherent. On one task Opus 4 is less coherent than Sonnet 4. Qwens also non-monotonic in Fig 3b. It's also weird to call the endpoint of an obvious monotonic trend an "exception".
But bias stemming from a lack of ability is not the same as bias stemming from a lack of propensity. The smaller models here are clearly not misaligned in the propensity sense, which is the conceptual link the paper tries to establish in the description of Figure 1 to motivate its definition of "incoherence"
As you can see in Fig. 6c, the key result is that the bias drops faster than the variance. I want to be measured in my interpretation here: I'm not sure if this is or isn't a great test of the question "do models learn the right targets, or performant general purpose optimizers first", but in broad terms it is evidence that they learn the right targets first and the outstanding question is how strong this evidence is. Your criticism doesn't engage with this at all.
I've sometimes joked that the doomsday style arguments are actually arguments about the death of interest in anthropics.
While I think this is a broadly reasonable response, I'm curious what you think is able to provide better public justification than longtermism. These results seem to apply fairly broadly to any realistic EV-based justification for action given that partial observability is very much the rule.
Well I meant it as an empirical hypothesis and thought it may have formal implications (specifically, placing the problem in a smaller, more tractable class).
Just an incomplete comment about The assumptions that make reward-seekers plausible also make fitness-seekers plausible: I think a central question is whether X-seeking gives you a compressed policy vs "optimal kludge". That is: if it's just as hard to learn the optimal policy if I'm an X-seeker as it is to learn the optimal kludge if I'm not an X-seeker, then it seems like I'm unlikely to learn X-seeking (or X-seeking is at best no more likely than a whole host of other possible behavioural spandrels, which implies exactly the same thing).
I think the argument that X-seeking incentivises optimal behaviour is some reason to think it might be compressive, but not obviously a very strong one: if all X-seeking gets you is "I should do well on evals" then that's a very small piece of policy it's compressing, not obviously even enough to pay its own cost. That is, the extra bit of policy "I should seek X" seems like it could easily be longer or have lower prior probability than "I should do well on evals". If "I should seek X" helped further in actually doing well on evals then I think there's a stronger argument to be made but...I just need to think about this more, it's not immediately apparent what that actually looks like.
Separate comment: the title doesn't seem to connect well to the content, and it'd be nice if you were clearer about whether your theorems are partly original or simply lifts from the relevant texts that are justified by your modelling choices (I think the latter, given the absence of proofs, but "my first theorem" sorta confuses this).
A common longtermist model is one where there's a transient period of instability (perhaps "around about now") that settles into a stable state thereafter. This seems like it would be no harder than a finite-horizon problem terminating when stability is achieved. I haven't looked into the results you quote and exactly what role the infinite horizon takes, but intuitively it seems correct that eternal instability (or even very long-lived instability) along any dimension of interest would make policy choice intractable while stability in the near future can make it fairly straightforward. Maybe there's an issue where the highest value outcomes occur in the unstable regimes, which makes it hard to "bet on stability", but I'd like to see it in maths + plausible examples.
I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.
So I think if you buy that a randomly initialized 1T transformer does in fact contain "Aligned ASI" and "deceptively aligned ASI" in its "prior" but we don't have the data to "find" them yet, then you're probably right that Jan 2026-era training data doesn't change their prior ratio much (or certainly doesn't change it predictably). But this doesn't really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
I did, apologies. I also recently discovered Max H != Max Harms, it's quite confusing round here.
I got my figure numbers mixed up, but I think we're roughly on the same page here. NB the twitter thread states: "Finding 2: There is an inconsistent relationship between model intelligence and incoherence" which looks spot on to me.
I don't see much argument in your post, nor here. There are reasons to think that deceptive schemers will have low variance and there's an absence of reasons to think mistake-makers will. You might think those reasons are weak, but I'd be much happier to see you demonstrate that you understand the reasons and explain why you think they're weak than simply assert your doubt and condemn on the basis of that assertion. I think discussions that get into reasons are sometimes clarifying.
That's not the correct update to make in the face of evidence that alignment scales better than capabilities; the correct update is that misaligned superintelligence is less likely, so I'd say you should either argue against the relevance or make that update.
Look I dunno what to say here. I do think the well-calibrated narrative goes something like "this is extremely weak evidence that much more capable AI will be more prone to confusion than scheming, but we're excited that we've found a way to study it at all", but lots of scientific communication overstates its significance and I'm habituated to making allowances for that. I'd also love it if the paper tried a lot harder to establish why they thought this was relevant to confusion vs scheming for powerful AI, but for whatever reason arguments like this seem to be culturally inappropriate in ML papers, something which I also make allowances for. It doesn't strike me as particularly unreasonable given those allowances.