Looking back on this comment, I'm pleased to note how well the strengths of reasoning models line up with the complaints I made about "non-reasoning" HHH assistants.

Reasoning models provide 3 of the 4 things I said "I would pay a premium for" in the comment: everything except for quantified uncertainty^[1].

I suspect that capabilities are still significantly bottlenecked by limitations of the HHH assistant paradigm, even now that we have "reasoning," and that we will see more qualitative changes analogous the introduction of "reasoning" in the coming months/years.

An obvious area for improvement is giving assistant models a more nuanced sense of when to check in with the user because they're confused or uncertain. This will be really important for making autonomous computer-using agents that are actually useful, since they need to walk a fine line between "just do your best based on the initial instruction" (which predictably causes Sorcerer's Apprentice situations^[2]) and "constantly nag the user for approval and clarification" (which defeats the purpose of autonomy).

^{^}
And come to think of it, I'm not actually sure about that one. Presumably if you just ask o1 / R1 for a probability estimate, they'd exhibit better calibration than their "non-reasoning" ancestors, though I haven't checked how large the improvement is.
^{^}
Note that "Sorcerer's Apprentice situations" are not just "alignment failures," they're also capability failures: people aren't going to want to use these things if they expect that they will likely get a result that is not-what-they-really-meant in some unpredictable, possibly inconvenient/expensive/etc. manner. Thus, no matter how cynical you are about frontier labs' level of alignment diligence, you should still expect them to work on mitigating the "overly zealous unchecked pursuit of initially specified goal" failure modes of their autonomous agent products, since these failure modes make their products less useful and make people less willing to pay for them.
(This is my main objection to the "people will give the AI goals" line of argument, BTW. The exact same properties that make this kind of goal-pursuit dangerous also make it ineffectual for getting things done. If this is what happens when you "give the AI goals," then no, you generally won't want to give the AI goals, at least not after a few rounds of noticing what happens to others when they try to do it. And these issues will be hashed out very soon, while the value of "what happens to others" is not an existential catastrophe, just wasted money or inappropriately deleted files or other such things.)

Thane Ruthenis's Shortform

nostalgebraist14dΩ7103

One possible answer is that we are in what one might call an "unhobbling overhang."

Aschenbrenner uses the term "unhobbling" for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.

His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we're also getting better at unhobbling over time, which leads to even more growth.

That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve "practically accessible capabilities" to the same extent by doing more of either one even in the absence of the other, and if you do both at once that's even better.

However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at "the next tier up," you also need to do novel unhobbling research to "bring out" the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of the new capabilities will be hidden/inaccessible, and this will look to you like diminishing downstream returns to pretraining investment, even though the model is really getting smarter under the hood.

This could be true if, for instance, the new capabilities are fundamentally different in some way that older unhobbling techniques were not planned to reckon with. Which seems plausible IMO: if all you have is GPT-2 (much less GPT-1, or char-rnn, or...), you're not going to invest a lot of effort into letting the model "use a computer" or combine modalities or do long-form reasoning or even be an HHH chatbot, because the model is kind of obviously too dumb to do these things usefully no matter how much help you give it.

(Relatedly, one could argue that fundamentally better capabilities tend to go hand in hand with tasks that operate on a longer horizon and involve richer interaction with the real world, and that this almost inevitably causes the appearance of "diminishing returns" in the interval between creating a model smart enough to perform some newly long/rich task and the point where the model has actually been tuned and scaffolded to do the task. If your new model is finally smart enough to "use a computer" via a screenshot/mouseclick interface, it's probably also great at short/narrow tasks like NLI or whatever, but the benchmarks for those tasks were already maxed out by the last generation so you're not going to see a measurable jump in anything until you build out "computer use" as a new feature.)

This puts a different spin on the two concurrent observations that (a) "frontier companies report 'diminishing returns' from pretraining" and (b) "frontier labs are investing in stuff like o1 and computer use."

Under the "unhobbling is a substitute" view, (b) likely reflects an attempt to find something new to "patch the hole" introduced by (a).

But under the "unhobbling is a complement" view, (a) is instead simply a reflection of the fact that (b) is currently a work in progress: unhobbling is the limiting bottleneck right now, not pretraining compute, and the frontier labs are intensively working on removing this bottleneck so that their latest pretrained models can really shine.

(On an anecdotal/vibes level, this also agrees with my own experience when interacting with frontier LLMs. When I can't get something done, these days I usually feel like the limiting factor is not the model's "intelligence" – at least not only that, and not that in an obvious or uncomplicated way – but rather that I am running up against the limitations of the HHH assistant paradigm; the model feels intuitively smarter in principle than "the character it's playing" is allowed to be in practice. See my comments here and here.)

jbco's Shortform

nostalgebraist2mo72

AFAIK the distinction is that:

When you condition on a particular outcome for , it affects your probabilities for every other variable that's causally related to $X$ , in either direction.
- You gain information about variables that are causally downstream from $X$ (its "effects"). Like, if you imagine setting $X = x$ and then "playing the tape forward," you'll see the sorts of events that tend to follow from $X = x$ and not those that tend to follow from some other outcome $X = x^{'}$ .
- And, you gain information about variables that are causally upstream from $X$ (its "causes"). If you know that $X = x$ , then the causes of $X$ must have "added up to" that outcome for $X$ . You can rule out any configuration of the causes that doesn't "add up to" causing $X = x$ , and that affects your probability distributions for all of these causative variables.
When you use the do-operator to set $X$ to a particular outcome for X, it only affects your probabilities for the "effects" of $X$ , not the "causes." (The first sub-bullet above, not the second.)

For example, suppose hypothetically that I cook dinner every evening. And this process consists of these steps in order:

" $W$ ": considering what ingredients I have in the house
" $X$ ": deciding on a particular meal to make, and cooking it
" $Y$ ": eating the food
" $Z$ ": taking a moment after the meal to take stock of the ingredients left in the kitchen

Some days I have lots of ingredients, and I prepare elaborate dinners. Other days I don't, and I make simple and easy dinners.

Now, suppose that on one particular evening, I am making instant ramen ( $X = m a k i n g i n s t a n t r a m e n$ ). We're given no other info about this evening, but we know this.

What can we conclude from this? A lot, it turns out:

In $Y$ , I'll be eating instant ramen, not something else.
In $W$ , I probably didn't have many ingredients in the house. Otherwise I would have made something more elaborate.
In $Z$ , I probably don't see many ingredients on the shelves (a result of what we know about $W$ ).

This is what happens when we condition on $X = m a k i n g i n s t a n t r a m e n$ .

If instead we apply the do-operator to $X = m a k i n g i n s t a n t r a m e n$ , then:

We learn nothing about $W$ , and from our POV it is still a sample from the original unconditional distribution for $W$ .
We can still conclude that I'll be eating ramen afterwards, in $Y$ .
We know very little about $Z$ (the post-meal ingredient survey) for the same reason we know nothing about $W$ .

Concretely, this models a situation where I first survey my ingredients like usual, and am then forced to make instant ramen by some force outside the universe (i.e. outside our W/X/Y/Z causal diagram).

And this is a useful concept, because we often want to know what would happen if we performed just such an intervention!

That is, we want to know whether it's a good idea to add a new cause to the diagram, forcing some variable to have values we think lead to good outcomes.

To understand what would happen in such an intervention, it's wrong to condition on the outcome using the original, unmodified diagram – if we did that, we'd draw conclusions like "forcing me to make instant ramen would cause me to see relatively few ingredients on the shelves later, after dinner."

interpreting GPT: the logit lens

nostalgebraist2mo31

Because the model has residual connections.

the case for CoT unfaithfulness is overstated

nostalgebraist3mo20

The "sequential calculation steps" I'm referring to are the ones that CoT adds above and beyond what can be done in a single forward pass. It's the extra sequential computation added by CoT, specifically, that is bottlenecked on the CoT tokens.

There is of course another notion of "sequential calculation steps" involved: the sequential layers of the model. However, I don't think the bolded part of this is true:

replacing the token by a dot reduces the number of serial steps the model can perform (from mn to m+n, if there are m forward passes and n layers)

If a model with N layers has been trained to always produce exactly M "dot" tokens before answering, then the number of serial steps is just N, not M+N.

One way to see this is to note that we don't actually need to run M separate forward passes. We can just pre-fill a context window containing the prompt tokens followed by M dot tokens, and run 1 forward pass on the whole thing.

Having the dots does add computation, but it's only extra parallel computation – there's still only one forward pass, just a "wider" one, with more computation happening in parallel inside each of the individually parallelizable steps (tensor multiplications, activation functions).

(If we relax the constraint that the number of dots is fixed, and allow the model to choose it based on the input, that still doesn't add much: note that we could do 1 forward pass on the prompt tokens followed by a very large number of dots, then find the first position where we would have sampled a non-dot from from output distribution, truncate the KV cache to end at that point and sample normally from there.)

If you haven't read the paper I linked in OP, I recommend it – it's pretty illuminating about these distinctions. See e.g. the stuff about CoT making LMs more powerful than versus dots adding more power withing $T C^{0}$ .

The Hidden Complexity of Wishes

nostalgebraist3mo80

In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.

Complexity makes things worse, yes, but the conclusion "AGI is unlikely to have our values" is already entailed by the other premises even if we drop the stuff about complexity.

Why: if we're just sampling some function from a simplicity prior, we're very unlikely to get any particular nontrivial function that we've decided to care about in advance of the sampling event. There are just too many possible functions, and probability mass has to get divided among them all.

In other words, if it takes bits to specify human values, there are $2^{N}$ ways that a bitstring of the same length could be set, and we're hoping to land on just one of those through luck alone. (And to land on a bitstring of this specific length in the first place, of course.) Unless $N$ is very small, such a coincidence is extremely unlikely.

And $N$ is not going to be that small; even in the sort of naive and overly simple "hand-crafted" value specifications which EY has critiqued in this post and elsewhere, a lot of details have to be specified. (E.g. some proposals refer to "humans" and so a full algorithmic description of them would require an account of what is and isn't a human.)

One could devise a variant of this argument that doesn't have this issue, by "relaxing the problem" so that we have some control, just not enough to pin down the sampled function exactly. And then the remaining freedom is filled randomly with a simplicity bias. This partial control might be enough to make a simple function likely, while not being able to make a more complex function likely. (Hmm, perhaps this is just your second argument, or a version of it.)

This kind of reasoning might be applicable in a world where its premises are true, but I don't think it's premises are true in our world.

In practice, we apparently have no trouble getting machines to compute very complex functions, including (as Matthew points out) specifications of human value whose robustness would have seemed like impossible magic back in 2007. The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.

The Hidden Complexity of Wishes

nostalgebraist3mo154

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want. This point is true!

Matthew is not disputing this point, as far as I can tell.

Instead, he is trying to critique some version of^[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:

[...] and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect [...]

But if you're doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.

But it seems to me that he's already doing this. He's not alleging that this post is incorrect in isolation.

The only reason this discussion is happened on the comments of this post at all is the May 2024 update at the start of it, which Matthew used as a jumping-off point for saying "my critique of the 'larger argument' does not make the mistake referred to in the May 2024 update^[2], but people keep saying it does^[3], so I'll try restating that critique again in the hopes it will be clearer this time."

^{^}
I say "some version of" to allow for a distinction between (a) the "larger argument" of Eliezer_2007's which this post was meant to support in 2007, and (b) whatever version of the same "larger argument" was a standard MIRI position as of roughly 2016-2017.

As far as I can tell, Matthew is only interested in evaluating the 2016-2017 MIRI position, not the 2007 EY position (insofar as the latter is different, if it fact it is). When he cites older EY material, he does so as a means to an end – either as indirect evidence of later MIRI positions, because it was itself cited in the later MIRI material which is his main topic.
^{^}
Note that the current version of Matthew's 2023 post includes multiple caveats that he's not making the mistake referred to in the May 2024 update.

Note also that Matthew's post only mentions this post in two relatively minor ways, first to clarify that he doesn't make the mistake referred to in the update (unlike some "Non-MIRI people" who do make the mistake), and second to support an argument about whether "Yudkowsky and other MIRI people" believe that it could be sufficient to get a single human's values into the AI, or whether something like CEV would be required instead.
I bring up the mentions of this post in Matthew's post in order to clarifies what role "is 'The Hidden Complexity of Wishes' correct in isolation, considered apart from anything outside it?" plays in Matthew's critique – namely, none at all, IIUC.
(I realize that Matthew's post has been edited over time, so I can only speak to the current version.)
^{^}
To be fully explicit: I'm not claiming anything about whether or not the May 2024 update was about Matthew's 2023 post (alone or in combination with anything else) or not. I'm just rephrasing what Matthew said in the first comment of this thread (which was also agnostic on the topic of whether the update referred to him).

the case for CoT unfaithfulness is overstated

nostalgebraist4mo122

Thanks for the links!

I was pleased to see OpenAI reference this in their justification for why they aren't letting users see o1's CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).

As it happens, the decision to hide o1 CoTs was one of the main things that motivated me to write this post. Or rather, the muted reaction to it / lack of heated debate about it.

The way I see things, the ability to read CoTs (and more generally "the fact that all LLM sampling happens in plain sight") is a huge plus for both alignment and capabilities – it's a novel way for powerful AI to be useful and (potentially) safe that people hadn't even really conceived of before LLMs existed, but which we now held in our hands.

So when I saw that o1 CoTs would be hidden, that felt like a turning point, a step down a very bad road that we didn't have to choose.

(Like, remember those Anthropic deception papers that had a hidden scratchpad, and justified it by saying it was modeling a scenario where the model had learned to do similar reasoning inside a forward pass and/or steganographically? At the time I was like, "yeah, okay, obviously CoTs can't be hidden in real life, but we're trying to model those other situations, and I guess this is the best we can do."

I never imagined that OpenAI would just come out and say "at long last, we've built the Hidden Scratchpad from Evan Hubinger's sci-fi classic Don't Build The Hidden Scratchpad"!)

Although I saw some people expressing frustration about the choice to hide o1 CoTs, it didn't seem like other people were reacting with the intensity I'd expect if they shared my views. And I thought, hmm, well, maybe everyone's just written off CoTs as inherently deceptive at this point, and that's why they don't care. And then I wrote this post.

(That said, I think I understand why OpenAI is doing it – some mixture of concern about people training about the CoTs, and/or being actually concerned about degraded faithfulness while being organizationally incapable of showing anything to users unless they put that thing under pressure look nice and "safe" in a way that could degrade faithfulness. I think the latter could happen even without a true feedback loop where the CoTs are trained on feedback from actual users, so long as they're trained to comply with "what OpenAI thinks users like" even in a one-time, "offline" manner.

But then at that point, you have to ask: okay, maybe it's faithful, but at what cost? And how would we even know? If the users aren't reading the CoTs, then no one is going to read them the vast majority of the time. It's not like OpenAI is going to have teams of people monitoring this stuff at scale.)

the case for CoT unfaithfulness is overstated

nostalgebraist4mo164

So what we have is not "the CoT remains constant but the answers vary". Instead, the finding is: "a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias."

Thanks for bringing this up.

I think I was trying to shove this under the rug by saying "approximately constant" and "~constant," but that doesn't really make sense, since of course the CoTs actually vary dramatically in response to the biasing features. (They have to, in order to justify different final answers.)

To be honest, I wrote the account of Turpin et al in the post very hastily, because I was really mainly interested in talking about the other paper. My main reaction to Turpin et al was (and still is) "I don't know what you expected, but this behavior seems totally unsurprising, given its ubiquity among humans (and hence in the pretraining distribution), and the fact that you didn't indicate to the model that it wasn't supposed to do it in this case (e.g. by spelling that out in the prompt)."

But yeah, that summary I wrote of Turpin et al is pretty confused – when I get a chance I'll edit the post to add a note about this.

Thinking about it more now, I don't think it makes sense to say the two papers discussed in the post were both "testing the causal diagram (question -> CoT -> answer)" – at least not in the same sense.

As presented, that diagram is ambiguous, because it's not clear whether nodes like "CoT" are referring to literal strings of text in the context window, or to something involving the semantic meaning of those strings of text, like "the aspects of the problem that the CoT explicitly mentions."

With Lanham et al, if we take the "literal strings of text" reading, then there's a precise sense in which the paper is testing the casual diagram.

In the "literal strings" reading, only arrows going from left-to-right in the context window are possible (because of the LLM's causal masking). This rules out e.g. "answer -> CoT," and indeed almost uniquely identifies the diagram: the only non-trivial question remaining is whether there's an additional arrow "question -> answer," or whether the "question"-"answer" relationship is mediated wholly through "CoT." Testing whether this arrow is present is exactly what Lanham et al did. (And they found that it was present, and thus rejected the diagram shown in the post, as I said originally.)

By contrast, Turpin et al are not really testing the literal-strings reading of the diagram at all. Their question is not "which parts of the context window affect which others?" but "which pieces of information affects which others?", where the "information" we're talking about can include things like "whatever was explicitly mentioned in the CoT."

I think there is perhaps a sense in which Turpin et al are testing a version of the diagram where the nodes are read more "intuitively," so that "answer" means "the value that the answer takes on, irrespective of when in the context window the LLM settles upon that value," and "CoT" means "the considerations presented in the CoT text, and the act of writing/thinking-through those considerations." That is, they are testing a sort of (idealized, naive?) picture where the model starts out the CoT not having any idea of the answer, and then brings up all the considerations it can think of that might affect the answer as it writes the CoT, with the value of the answer arising entirely from this process.

But I don't want to push this too far – perhaps the papers really are "doing the same thing" in some sense, but even if so, this observation probably confuses matters more than it clarifies them.

As for the more important higher-level questions about the kind of faithfulness we want and/or expect from powerful models... I find stuff like Turpin et al less worrying than you do.

First, as I noted earlier: the kinds of biased reasoning explored in Turpin et al are ubiquitous among humans (and thus the pretraining distribution), and when humans do them, they basically never mention factors analogous to the biasing factors.

When a human produces an argument in writing – even a good argument – the process that happened was very often something like:

(Half-consciously at best, and usually not verbalized even in one's inner monologue) I need to make a convincing argument that P is true. This is emotionally important for some particular reason (personal, political, etc.)
(More consciously now, verbalized internally) Hmm, what sorts of arguments could be evinced for P? [Thinks through several of them and considers them critically, eventually finding one that seems to work well.]
(Out loud) P is true because [here they provide a cleaned-up version of the "argument that seemed to work well," crafted to be clearer than it was in their mind at the moment they first hit upon it, perhaps with some extraneous complications pruned away or the like].

Witness the way that long internet arguments tend to go, for example. How both sides keep coming back, again and again, bearing fresh new arguments for P (on one side) and arguments against P (on the other). How the dispute, taken as a whole, might provide the reader with many interesting observations and ideas about object-level truth-value of P, and yet never touch on the curious fact that these observations/ideas are parceled out to the disputants in a very particular way, with all the stuff that weighs in favor P spoken by one of the two voices, and all the stuff that weighs against P spoken by the other.

And how it would, in fact, be very weird to mention that stuff explicitly. Like, imagine someone in an internet argument starting out a comment with the literal words: "Yeah, so, reading your reply, I'm now afraid that people will think you've not only proven that ~P, but proven it in a clever way that makes me look dumb. I can't let that happen. So, I must argue for P, in such a way that evades your clever critique, and which is itself very clever, dispelling any impression that you are the smarter of the two. Hmm, what sorts of arguments fit that description? Let's think step by step..."

Indeed, you can see an example of this earlier in this very comment! Consider how hard I tried to rescue the notion that Turpin et al were "testing the causal diagram" in some sense, consider the contortions I twisted myself into trying to get there. Even if the things I said there were correct, I would probably not have produced them if I hadn't felt a need to make my original post seem less confused than it might otherwise seem in light of your comment. And yet I didn't say this outright, at the time, above; of course I didn't; no one ever does^[1].

So, it's not surprising that LLMs do this by default. (What would be surprising is we found, somehow, that they didn't.)

They are producing text that is natural, in a human sense, and that text will inherit qualities that are typical of humans except as otherwise specified in the prompt and/or in the HHH finetuning process. If we don't specify what we want, we get the human default^[2], and the human default is "unfaithful" in the sense of Turpin et al.

But we... can just specify what we want? Or try to? This is what I'm most curious about as an easy follow-up to work like Turpin et al: to what extent can we get LLM assistants to spell out the unspoken drivers of their decisions if we just ask them to, in the prompt?

(The devil is in the details, of course: "just ask" could take various forms, and things might get complicated if few-shots are needed, and we might worry about whether we're just playing whack-a-mole with the hidden drivers that we just so happen to already know about. But one could work through all of these complications, in a research project on the topic, if one had decided to undertake such a project.)

A second, related reason I'm not too worried involves the sort of argumentation that happens in CoTs, and how we're seeing this evolve over time.

What one might call "classic CoT" typically involves the model producing a relatively brief, straight-to-the-point argument, the sort of pared-down object for public consumption that a human might produce in "step 3" of the 1-2-3- process listed above. (All the CoTs in Turpin et al look like this.)

And all else being equal, we'd expect such CoTs to look like the products of all-too-human 1-2-3 motivated reasoning.

But if you look at o1 CoTs, they don't look like this. They verbalize much more of the "step 2" and even "step 1" stuff, the stuff that a human would ordinarily keep inside their own head and not say out loud.

And if we view o1 as an indication of what the pressure to increase capabilities is doing to CoT^[3], that seems like an encouraging sign. It would mean that models are going to talk more explicitly about the underlying drivers of their behavior than humans naturally do when communicating in writing, simply because this helps them perform better. (Which makes sense – humans benefit from their own interior monologues, after all.)

(Last note: I'm curious how the voice modality interacts with all this, since humans speaking out loud in the moment often do not have time to do careful "step 2" preparation, and this makes naturally-occurring speech data importantly different from naturally-occurring text data. I don't have any particular thoughts about this, just wanted to mention it.)

^{^}
In case you're curious, I didn't contrive that earlier stuff about the causal diagram for the sake of making this meta point later. I wrote it all out "naively," and only realized after the fact that it could be put to an amusing use in this later section.
^{^}
Some of the Turpin et al experiments involved few-shots with their own CoTs, which "specifies what we want" in the CoT to some extent, and hence complicates the picture. However, the authors also ran zero-shot versions of these, and found broadly similar trends there IIRC.
^{^}
It might not be, of course. Maybe OpenAI actively tried to get o1 to verbalize more of the step 1/2 stuff for interpretability/safety reasons.

AI #84: Better Than a Podcast

nostalgebraist4mo2210

Re: the davidad/roon conversation about CoT:

The chart in davidad's tweet answers the question "how does the value-add of CoT on a fixed set of tasks vary with model size?"

In the paper that the chart is from, it made sense to ask this question, because the paper did in fact evaluate a range of model sizes on a set of tasks, and the authors were trying to understand how CoT value-add scaling interacted with the thing they were actually trying to measure (CoT faithfulness scaling).

However, this is not the question you should be asking if you're trying to understand how valuable CoT is as an interpretability tool for any given (powerful) model, whether it's a model that exists now or a future one we're trying to make predictions about.

CoT raises the performance ceiling of an LLM. For any given model, there are problems that it has difficulty solving without CoT, but which it can solve with CoT.

AFAIK this is true for every model we know of that's powerful enough to benefit from CoT at all, and I don't know of any evidence that the importance of CoT is now diminishing as models get more powerful.

(Note that with o1, we see the hyperscalers at OpenAI pursuing CoT more intensively than ever, and producing a model that achieves SOTAs on hard problems by generating longer CoTs than ever previously employed. Under davidad's view I don't see how this could possibly make any sense, yet it happened.)

But note that different models have different "performance ceilings."

The problems on which CoT helps GPT-4 are problems right at the upper end of what GPT-4 can do, and hence GPT-3 probably can't even do them with CoT. On the flipside, the problems that GPT-3 needs CoT for are probably easy enough for GPT-4 that the latter can do them just fine without CoT. So, even if CoT always helps any given model, if you hold the problem fixed and vary model size, you'll see a U-shaped curve like the one in the plot.

The fact that CoT raises the performance ceiling matters practically for alignment, because it means that our first encounter with any given powerful capability will probably involve CoT with a weaker model rather than no-CoT with a stronger one.

(Suppose "GPT-n" can do X with CoT, and "GPT-(n+1)" can do X without CoT. Well, we'll surely we'll built GPT-n before GPT-(n+1), and then we'll do CoT with the thing we've built, and so we'll observe a model doing X before GPT-(n+1) even exists.)

See also my post here, which (among other things) discusses the result shown in davidad's chart, drawing conclusions from it that are closer to those which the authors of the paper had in mind when plotting it.