I agree it's generally better to frame in terms of object-level failure modes rather than "objections" (though: sometimes one is intentionally responding to objections that other people raise, but that you don't buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it's making?
Thanks, John. I'm going to hold off here on in-depth debate about how to choose between different ontologies in this vicinity, as I do think it's often a complicated and not-obviously-very-useful thing to debate in the abstract, and that lots of taste is involved. I'll flag, though, that the previous essay on paths and waystations (where I introduce this ontology in more detail) does explicitly name various of the factors you mention (along with a bunch of other not-included subtleties). E.g., re the importance of multiple actors:
...Now: so far I’ve onl
Thanks, John -- very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?
(This comment is not about the parts which most centrally felt anchored on social reality; see other reply for that. This one is a somewhat-tangential but interesting mini-essay on ontological choices.)
The first major ontological choices were introduced in the previous essay:
That seems like a useful framing to me. I think the main issue is just that often, we don't think of commitment as literally closing off choice -- e.g., it's still a "choice" to keep a promise. But if you do think of it as literally closing off choice then yes, you can avoid the violation of Guaranteed Payoffs, at least in cases where you've actually already made the commitment in question.
There's also a bit more context in a footnote on the first post:
"Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new."
See also this flag at the beginning of the series you linked to:
..."The content here is rough. I’m hoping, later, to revi
I'd call that one aspect -- in particular, quite nearby to what I was trying to capture with "map thinking vs. world thinking" and "rote thinking vs. new thinking." But doesn't seem like it captures the whole thing directly.
Thanks for this comment, Mart, and for the pointer to Vervaeke -- does seem plausibly quite relevant. And "functional agent-arena-relationship" seems like an interesting angle on not-being-a-Zombie.
Cross-posting my take from twitter with a few edits:
...My current take on this paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6.
In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation sa
Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo's case):
A "Slack conversation" that appears in the prompt includes this message:
And here's the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.
The extent of the manipulation and sandbagging, in what is ostensibly a GPT-4 derivative, and not GPT-5, is definitely concerning. But it also makes me wonder about the connection to 'scaling has failed' rumors lately, where the frontier LLMs somehow don't seem to be working out. One of the striking parts is that it sounds like all the pretraining people are optimistic, while the pessimism seems to come from executives or product people, complaining about it not working as well for eg. coding as they want it to.
I've wondered if we are seeing a post-trainin...
Brief comments (shared in private with Joe earlier):
1. We agree. We also found the sandbagging with no CoT results the most concerning in expectation.
2. They are still early results, and we didn't have a lot of time to investigate them, so we didn't want to make them the headline result. Due to the natural deadline of the o1 release, we couldn't do a proper investigation.
3. The main goal of the paper was to investigate scheming inability arguments for scheming safety cases. Therefore, shifting focus to propensity-based finding would have watered down the main purpose IMO.
We will potentially further look into these findings in 2025.
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you're saying we didn't solve alignment bc we didn't elicit the benefits?
In my definition, you don't have to actually elicit the benefits. You just need to have gained "access" to the benefits. And I meant this specifically cover cases like misuse. Quoting from the OP:
...“Access” here means something like: being in a position to get these benefits if you want to –
I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky's "programmers" to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue "sure, maybe it will ultimately make sense to empower the rest of humanity, but if that's right, then my CEV will tell me that and I can go do it. But if it's not right, I'll be glad I first just empowered myself and figured ...
Hi Matthew -- I agree it would be good to get a bit more clarity here. Here's a first pass at more specific definitions.
"Your 2021 report on power-seeking does not appear to discuss the cost-benefit analysis that a misaligned AI would conduct when considering takeover, or the likelihood that this cost-benefit analysis might not favor takeover."
I don't think this is quite right. For example: Section 4.3.3 of the report, "Controlling circumstances" focuses on the possibility of ensuring that an AI's environmental constraints are such that the cost-benefit calculus does not favor problematic power-seeking. Quoting:
...
So far in section 4.3, I’ve been talking about controlling “int
in fact, no one worries about Siri "coordinating" to suddenly give us all wrong directions to the grocery store, because that's not remotely how assistants work.
Note that Siri is not capable of threatening types of coordination. But I do think that by the time we actually face a situation where AIs are capable of coordinating to successfully disempower humanity, we may well indeed know enough about "how they work" that we aren't worried about it.
That post ran into some cross-posting problems so had to re-do
The point of that part of my comment was that insofar as part of Nora/Quintin's response to simplicity argument is to say that we have active evidence that SGD's inductive biases disfavor schemers, this seems worth just arguing for directly, since even if e.g. counting arguments were enough to get you worried about schemers from a position of ignorance about SGD's inductive biases, active counter-evidence absent such ignorance could easily make schemers seem quite unlikely overall.
There's a separate question of whether e.g. counting arguments like mine abo...
The probability I give for scheming in the report is specifically for (goal-directed) models that are trained on diverse, long-horizon tasks (see also Cotra on "human feedback on diverse tasks," which is the sort of training she's focused on). I agree that various of the arguments for scheming could in principle apply to pure pre-training as well, and that folks (like myself) who are more worried about scheming in other contexts (e.g., RL on diverse, long-horizon tasks) have to explain what makes those contexts different. But I think there are various plau...
Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting arguments, simplicity arguments, and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast pro...
Ah nice, thanks!
Hi David -- it's true that I don't engage your paper (there's a large literature on infinite ethics, and the piece leaves out a lot of it -- and I'm also not sure I had seen your paper at the time I was writing), but re: your comments here on the ethical relevance of infinities: I discuss the fact that the affectable universe is probably finite -- "current science suggests that our causal influence is made finite by things like lightspeed and entropy" -- in section 1 of the essay (paragraph 5), and argue that infinites are still
(Partly re-hashing my response from twitter.)
I'm seeing your main argument here as a version of what I call, in section 4.4, a "speed argument against schemers" -- e.g., basically, that SGD will punish the extra reasoning that schemers need to perform.
(I’m generally happy to talk about this reasoning as a complexity penalty, and/or about the params it requires, and/or about circuit-depth -- what matters is the overall "preference" that SGD ends up with. And thinking of this consideration as a different kind of counting argument *against* schemers see...
I agree that AIs only optimizing for good human ratings on the episode (what I call "reward-on-the-episode seekers") have incentives to seize control of the reward process, that this is indeed dangerous, and that in some cases it will incentivize AIs to fake alignment in an effort to seize control of the reward process on the episode (I discuss this in the section on "non-schemers with schemer-like traits"). However, I also think that reward-on-the-episode seekers are also substantially less scary than schemers in my sense, for reasons I discuss here (i.e....
Agree that it would need to have some conception of the type of training signal to optimize for, that it will do better in training the more accurate its picture of the training signal, and that this provides an incentive to self-locate more accurately (though not necessary to degree at stake in e.g. knowing what server you're running on).
The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, "On 'slack' in training" -- and at various points the report references how differing levels of "slack" might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: "policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy...
Agents that end up intrinsically motivated to get reward on the episode would be "terminal training-gamers/reward-on-the-episode seekers," and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on "non-schemers with schemer-like traits"), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.
I found this post a very clear and useful summary -- thanks for writing.
Re: "0.00002 would be one in five hundred thousand, but with the percent sign it's one in fifty million." -- thanks, edited.
Re: volatility -- thanks, that sounds right to me, and like a potentially useful dynamic to have in mind.
Oops! You're right, this isn't the right formulation of the relevant principle. Will edit to reflect.
Really appreciated this sequence overall, thanks for writing.
I really like this post. It's a crisp, useful insight, made via a memorable concrete example (plus a few others), in a very efficient way. And it has stayed with me.
Thanks for these thoughtful comments, Paul.
Glad to hear you liked it :)
:) -- nice glasses
Oops! Yep, thanks for catching.
Thanks! Fixed.
Yeah, as I say, I think "neither innocent nor guilty" is least misleading -- but I find "innocent" an evocative frame. Do you have suggestions for an alternative to "selfish"?
Is the argument here supposed to be particular to meta-normativity, or is it something more like "I generally think that there are philosophy facts, those seem kind of a priori-ish and not obviously natural/normal, so maybe a priori normative facts are OK too, even if we understand neither of them"?
Re: meta-philosophy, I tend to see philosophy as fairly continuous with just "good, clear thinking" and "figuring out how stuff hangs together," but applied in a very general way that includes otherwise confusing stuff. I agree various philosophical domain...
Reviewers ended up on the list via different routes. A few we solicited specifically because we expected them to have relatively well-developed views that disagree with the report in one direction or another (e.g., more pessimistic, or more optimistic), and we wanted to understand the best objections in this respect. A few came from trying to get information about how generally thoughtful folks with different backgrounds react to the report. A few came from sending a note to GPI saying we were open to GPI folks providing reviews. And a few came via other m...
Cool, these comments helped me get more clarity about where Ben is coming from.
Ben, I think the conception of planning I’m working with is closest to your “loose” sense. That is, roughly put, I think of planning as happening when (a) something like simulations are happening, and (b) the output is determined (in the right way) at least partly on the basis of those simulations (this definition isn’t ideal, but hopefully it’s close enough for now). Whereas it sounds like you think of (strict) planning as happening when (a) something like simulations are...
I’m glad you think it’s valuable, Ben — and thanks for taking the time to write such a thoughtful and detailed review.
I’m sympathetic to the possibility that the high level of conjuctiveness here created some amount of downward bias, even if the argument does actually have a highly conjunctive structure.”
Yes, I am too. I’m thinking about the right way to address this going forward.
I’ll respond re: planning in the thread with Daniel.
(Note that my numbers re: short-horizon systems + 12 OOMs being enough, and for +12 OOMs in general, changed since an earlier version you read, to 35% and 65% respectively.)
Thanks for these comments.
that suggests that CrystalNights would work, provided we start from something about as smart as a chimp. And arguably OmegaStar would be about as smart as a chimp - it would very likely appear much smarter to people talking with it, at least.
"starting with something as smart as a chimp" seems to me like where a huge amount of the work is being done, and if Omega-star --> Chimp-level intelligence, it seems a lot less likely we'd need to resort to re-running evolution-type stuff. I also don't think "likely to appear smarter than ...
I haven't given a full account of my views of realism anywhere, but briefly, I think that the realism the realists-at-heart want is a robust non-naturalist realism, a la David Enoch, and that this view implies:
In the past, I've thought of idealizing subjectivism as something like an "interim meta-ethics," in the sense that it was a meta-ethic I expected to do OK conditional on each of the three meta-ethical views discussed here, e.g.:
The thought was that on (1), id...
Glad you liked it :). I haven’t spent much time engaging with UDASSA — or with a lot other non-SIA/SSA anthropic theories — at this point, but UDASSA in particular is on my list to understand better. Here I wanted to start with the first-pass basics.
Yes, edited :)
I appreciated this, especially given how challenging this type of exercise can be. Thanks for writing.
Rohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe.
It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.
Thanks!
Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.
Note that conceptual research, as I'm understanding it, isn't defined by the cognitive skills involved in the research -- i.e., by whether the researchers need to have "conceptual thoughts" like "wait, is this measuring what I think it's measuring?". I agree that normal science involves a ton of conceptual thinking (and many "number-go-up" tasks do too). Rather, conceptual research as I'm understanding it is defined by the tools available for evaluating the research in question.[1] In particular, as I'm understanding it, cases where neither available ... (read more)