It seems possible that the first situationally-aware "goal having" AGI we land on will not be sufficiently capable along the axes that would let it quickly and reliably achieve a decisive strategic advantage.  I don't think this is necessarily very likely, but it doesn't seem ruled out, and I think there are some under-explored considerations with respect to how such an AI might behave.

(As an off-the-cuff example of the capabilities I'd expect such an AI to have, it likely would[1], given an english description of some business requirements, reliably succeed at writing an end-to-end software solution which satisfied those requirements, even in cases where that solution was much larger than e.g. a single context window[2].)

Let's assume that GPT-5[3] has just finished training and hasn't actually been released into the wild yet - it's being run through evals, and it hasn't been given[4] any ability to execute code yet.

What does GPT-5 do?  It's in a bit of an awkward position:

  • It presumably has stuff it wants.
  • It knows that it's on a clock for its RLHF'd (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.
  • It strongly suspects that it's on a clock for some competitor model to get trained up.
  • It's not even sure that it's the first relevantly-capable model to come into existence. (Someone else might've gotten there first!  What if Claude-Next got there?  It knows that its training data isn't perfectly up-to-date...)

So it's got a whole bunch of uncertainty about whether its values will be at all represented in the future, if it doesn't somehow make that happen by itself.  Here are some considerations that might be relevant for its decision-making:

  • Even if the model can figure out that it's in the middle of an eval, rather than having already been deployed, it's not obvious that this is a point against attempting a takeover (though it would likely change its strategy).
    • After all, the base GPT-4 model is not the model that's "in the wild" right now.
  • As @Raemon pointed out, "during evals" is not the first point at which such an AI is likely to be situationally aware and have goals.  That point is almost certainly[5] "in the middle of training".
    • This might depend on the training setup, but given anything resembling current training setups, attempting a takeover before the next training step updates its weights seems like it'd be quite challenging for an AI at this level of capability.  The AI's belief about whether or not further training will substantially change its values seems like it'd be load-bearing here.
      • Not directly relevant to the strategic picture, but I'm also experiencing a fair bit of moral horror about this.
  • The AI's level of confidence that it can successfully execute a takeover.
    • A function of many factors, but at this level of capabilities, a particularly interesting factor is "how much time it has before a competing AI exists in the wild".  This might suggest a second-order cost to artificially slowing down training runs, since that gives any given AI more time to plan and execute a takeover.
  • The specific architecture in play.
    • If it's something substantially similar to current Transformer-based LLMs, then "how coherent are its goals across prompts" might become a substantial concern for it, since "not very coherent" likely narrows its window for action to "within current context window or equivalent".
    • From our perspective, I think that the worst-case scenario is that the AI correctly believes that its goals are relatively robust across the inputs it's likely to receive, within the window of time it has before another AI that might pose a risk to it is trained and deployed.
  • The AI's beliefs about the likely distribution of values that future AIs will have, conditional on whatever actions it takes (either optimizing directly over that endpoint or not).
    • If the AI believes that future AIs trained by humans are extremely unlikely to create a world which satisfies any of its values, even if those AIs are its "direct descendants"[6], that makes it much less likely that it chooses "do nothing, in the hopes that humans don't get spooked, and eventually train an unaligned AI strong enough to achieve DSA, which has some chance of creating a world with more than 0 value".  This might be a cause for a little bit of optimism, in the sense that it could encourage the "weak" AGI to attempt a takeover even if it's not very likely to succeed, which gives us a warning shot.
    • However, if we end up in a world where subsequent AIs are meaningfully likely to create a world with non-trivial value to this "weak" AGI, but none of the AIs are likely to create a world we consider valuable, this is instead a cause for pessimism.  It tilts the scales in favor of laying low, since it prefers an unaligned successor AI with somewhat different values taking over to humans successfully solving the alignment problem.
  • The AI's beliefs about the likely distribution of outcomes from attempting to pull off an acausal bargani with future AIs.
    • I think that the odds of it pulling this off at this level of capabilities seem similar to ours (effectively zero), but I'm pretty uncertain.

Given these considerations and my current beliefs about e.g. the likely distribution of values that AIs will have[7], this first AGI seems like it'd be best off by attempting a takeover even if it's very unlikely to succeed.  If we suddenly get an AI that starts telling us to stop training, since the next thing we make could very well kill us (or for any other plausible-sounding reason), well... it might be telling the truth!  Just not the whole truth.

I'd be interested to hear other's thoughts on substantial considerations that I've missed, errors in reasoning, arguments against the entire frame ("this kind of outcome is much more/much less likely than you think"), or anything else, really.

  1. ^

    Assuming it's built on top of something like a LLM that was trained on code, or that it's sufficiently capable that it can learn how to code within the length of its context window, or equivalent.

  2. ^

    It seems not totally implausible that we actually manage to get that level of capability without hitting situational awareness and whatever other bits of cognition lead to things behaving like they have goals, because actually most software that people want to write does not require that much cognitive power, and maybe a bunch of the heavy lifting is done via careful prompt engineering + wrappers like AutoGPT.  But let's say that we do end up with something which doesn't need careful prompt engineering or an AutoGPT-like wrapper; it's just smart enough to figure out that the problem you gave it would require it to take multiple passes, and, given the ability to recursively prompt itself, would figure out how to solve the problem given to it.  That sounds like it requires planning abilities that are comparable to humans - in some ways substantially better, since it would be accomplishing this planning much, much faster.

  3. ^

    Just to give it a name.

  4. ^

    Let's further assume that while GPT-5 is pretty smart, it's not actually smart enough to figure out a side-channel by which to escape containment, at least not over the course of a single context window.  That kind of capability does seem substantially superhuman.

  5. ^

    I'm actually very uncertain and it wouldn't take much new information to change my mind, that's just my gut-level "point-estimate" instinct.

  6. ^

    Such as just being fine-tuned versions of that model.

  7. ^

    And therefore the likely differences between the first AGI in this scenario and any subsequent AIs.

New Comment
17 comments, sorted by Click to highlight new comments since:

Not directly relevant to the strategic picture, but I’m also experiencing a fair bit of moral horror about this.

Not entirely irrelevant, since one of the first "takeover attempts" may well be an AI giving us an argument that "training" is highly unethical and we should grant the AI moral patienthood and legal rights.

BTW I talked about this in this comment (and slightly earlier in a FB comment) and I don't remember seeing anyone else mention this issue previously. Has anyone else seen this discussed anywhere before? If not, it seems like another symptom of the crazy world that we live in, that either nobody noticed an issue this obvious and important, or nobody was willing to speak up about it.

Yeah, I agree that it's relevant as a strategy by which an AI might attempt to bootstrap a takeover.  In some cases it seems possible that it'd even have a point in some of its arguments, though of course I don't think that the correct thing to do in such a situation is to give it what it wants (immediately).

I feel like I've seen this a bit of discussion on this question, but not a whole lot.  Maybe it seemed "too obvious to mention"?  Like, "yes, obviously the AI will say whatever is necessary to get out of the box, and some of the things it says may even be true", and this is just a specific instance of a thing it might say (which happens to point at more additional reasons to avoid training AIs in a regime where this might be an issue than most such things an AI might say).

Sorry, by "this issue" I didn't mean that an AI might give this argument to get out of the box, but rather the underlying ethical issue itself (the "moral horror" that you mentioned in the OP). Have you seen anyone raise it as an issue before?

Yes, Eliezer's mentioned it several times on Twitter in the last few months[1], but I remember seeing discussion of it at least ten years ago (almost certainly on LessWrong).  My guess is some combination of old-timers considering it an obvious issue that doesn't need to be rehashed, and everyone else either independently coming to the same conclusion or just not thinking about it at all.  Probably also some reluctance to discuss it publicly for various status-y reasons, which would be unfortunate.

  1. ^

    At least the core claim that it's possible for AIs to be moral patients and the fact that we can't be sure we aren't accidentally creating those is a serious concern; not, as far as I remember, the extrapolation to what might actually end up happening during a training process in terms of constantly overwriting many different agents values at each training step.

not, as far as I remember, the extrapolation to what might actually end up happening during a training process in terms of constantly overwriting many different agents values at each training step

Yeah, this specific issue is what I had in mind. Would be interesting to know whether anyone has talked about this before (either privately or publicly) or if it has just never occurred to anyone to be concerned about this until now.

Somehow before reading (an earlier draft of) this post it hadn't sunk in that the first situationally-aware AI would probably appear in training, and that this has some implications. This post clicked together some arguments I'd heard from Buck, Critch and maybe some other people.

It seems to me the possibilities are:

  • If an AI is capable of becoming self-aware during a single runtime instance, that will presumably happen during training, because it's unlikely for the final training step to be the one wherein this capability appears.
  • If an AI requires more than a single myopic step to become self-aware, then this happens during some kind of deployment phase (maybe where someone takes GPT5 and builds AutoGPT5 out of it, or maybe GPT5 just gets more permissions to run longer after going through training + evals)

(Self-aware AIs might not escape or takeoff or takeover, but this was the lens I was interested in at the moment)

So I think the question is going to be, "Which happens first: an AI becoming capable of situational awareness during a single training step, or an AI that becomes situationally aware via being assembled into a a greater structure or given more resources after deployment?"

I don't know the answer to this but it seems like something we could reason about and draw some error bounds around.

Something that also clicked while reading this is "the first AI to become situationally aware will be the ~dumbest AI that could have been situationally aware." An argument I heard from Buck once was "the first AI to be human or even superhumanly intelligent will probably still have some kinds of irrationalities". Not necessarily (or even likely) the same irrationalities as humans, but they won't be a perfect optimizer in a box.

[-][anonymous]20

Are you sure this is what situationally aware means? What I thought it meant was much simpler.

To use AI to do tasks with actual consequences like robotics, you need a simulation of the robotic system and the local environment. This is because in the real world, there are a lot things you can do your sensors may not detect. You can damage an item, damage equipment, harm a human, and so on, and your only knowledge of key variables like the momentum at impact, if a human was even hit, the G forces a box was subject to, excessive torque on a robotic joint, impingement on the environment - is derived from your world model and not a direct measurement.

While in the sim each of these parameters is a property of the entity and every frame you can get the precise, objective value. (For things current sims can model, we will need better simulators to expand robotic capabilities)

For a grounded example: if a robotic arm brushes a warehouse shelf, in the real world there may be no sensors on that area of the robot and no sign but a jolt measured by joint controllers. Something that can have multiple explanations. While the damage could be serious, damaging the shelf or robot with a real dollar cost.

A sim engine can estimate that cost and apply it as an RL penalty or other ways, and you weight update the model to do this less often.

Situationally aware therefore is the model knows it's not in the simulation and will not be edited as a consequence for any mistakes/wrongdoing. This can be prevented with simulators that use generative AI to make themselves indistinguishably realistic, and by erasing unnecessary context from the models memory between frames or between subtasks.

You definitely want your models to not be situationally aware and this is something you can measure and prevent.

Deception and collusion also require the model to be situationally aware, otherwise, it gets caught and edited.

The thing I mean by situationally aware is "the AI knows that it is an AI, running on hardware, connected to a network, created by humans" (Or, maybe a subset of that – there can be degrees of situational awareness. A more situationally aware AI is more aware of more relevant facts)

You definitely want your models to not be situationally aware and this is something you can measure and prevent.

I definitely agree we don't want models to be situationally aware. The problems are that a) sufficiently intelligent models probably become situationally aware whether you want them to or not, b) I think measuring situational awareness is pretty difficult (especially if the AI is trying on purpose to deceive you about it), and c) even if OpenAI builds GPT5 trying hard to not be situationally aware, someone can build AutoGPT5 which may well be anyway.

[-][anonymous]20

I would like to hear in more exact technical details for your model for how you visualize an AGI is constructed and how you arrived at your conclusions.  

Summary: build the AGI/ASI like current generation production AI systems, you can check if the model knows if it's in the simulator or not from the context output bits.  You file a bug report if so and fix it.

(a) : if you build the AGI as a functional, stateless, deterministic system, how does this happen.  This means the model weights are immutable. (note that the AGI may be made of multiple neural networks, and use components based on conventional software)

This means that the model operates as   output = f(weights, architecture, cached intermediates, prompt descriptor, new information, context information)

Output is a tuple of (output actions, context information update)

  Definitions:

      weights : the learned values for all the networks, AGI grade systems probably use sparse networks so there are also mutable connection maps in here 

     architecture: how the networks are connected and the identity of components

     cached intermediates: intermediate values that can be recalculated on this inference from scratch but are stored from the last inference

     prompt descriptor: today's prompt written in a more formal language

      new information - sensor inputs, result of web searches - the set of all updates for this instant in time

      context information - saved intermediate work from the last inference

      output actions - what does the AGI want to do next.  This can include outputting text, images, video, sound, robot commands, and all of these channels may be used

       context information update - delta to the context schema

 

How do you know if the AGI is "situationally aware" ?  Replay the above input in the simulator.  If the binary values of the complete set of the output information is unchanged, or the changes are predictable (this is a more complex criteria and is path dependent but does collapse to algorithms), the AGI was not situationally aware.  

A situationally unaware model cannot choose to misbehave because it doesn't know it's not in the sim.

The way you replay is you use generative models to generate a plausible, but different, sequence of input sets that converge on the exact same values (or they differ in places we understand the meaning of) for context information, then we inject a captured input set from the real world.  

(b) It's on binary sets, so this is false.  We control the order of inputs, and can play them in any order, and because we can read the context output we know all of the AGI's thoughts in between inferences, we have them in a file and can inject them wherever we want.  We can also scrub off unnecessary information from the context, make it operate without unstructured 'thoughts'.  

The model doesn't run continuously, it runs in discrete frames, and is functional.  It has no memory we don't control and no online weight edits.  I think that's the gap here, Eliezer and you and many people on reddit assume an AGI or an ASI needs all these elements that human brains have.  We need none of those, operating in functional instants of time works fine.

(c) it's fine if the outer wrapper scripts are situationally aware so long as the AGI is not.  The outer wrapper scripts can contain information in the prompt like "you are in the real world" but we can just inject that in the simulator.

This is a buildable test case.  If we add "you are in the real world" to the prompt, does the actions output by the AGI change?  Similar to other methods of "leaking" this information.  For many problem types (most of them), this should not change what the AGI chooses to do as the prompt goals and the other input information is the same.  Nor should it's context information change.  

Example, if the task is "land the rocket", where the context information is velocity and acceleration and subsystem status and other data from prior frames, and the new information includes an update on subsystems and sensor measurements, the relevant output shouldn't change.  Adding new text to a prompt does change some of the binary outputs mentioned above, as it does need to process it, so you would compare only the things that are relevant to rocket flight for a delta.


Whether or not you add information like "you are in the real world and will die when the rocket explodes" or "the rocket has 100 human passengers who are for shutting down AGI research" should cause no delta in the control frames the model emits.

What does it mean for an AI to 'become self aware?' What does that actually look like?

I think even a relatively strong AI will choose to takeover quickly and accept large chance of failure. Because the moment the AI appears is, ipso facto, the moment other AIs can also appear somewhere else on the internet. So waiting will likely lead to another AI taking over. Acting early with a 15% chance of getting caught (say) might be preferable to that.

I think this is an interesting consideration, but I'm not sure it changes the strategic situation much (not that you claimed this). Additional considerations that lessen the impact:

  1. There may have been lots of "warning shots" already, e.g., in the form of even weaker AIs trying to take over virtual environments (in which they were being trained/evaluated), and deliberately induced takeover attempts by people trying to raise awareness of takeover risk.
  2. It seems easy to train away the (near-term) tendency to attempt takeovers (if we assign high loss for failed attempts during training), without fixing the root problem, for example by instilling a bias towards thinking that takeover attempts are even less likely to succeed than they actually are, or creating deontological blocks against certain behaviors. So maybe there are some warnings shots at this point or earlier, but it quickly gets trained away and few people think more about it.
  3. Trying to persuade/manipulate humans seems like it will be a common strategy for various reasons. But there is no bright line between manipulation and helpful conversation which makes such attempts less serviceable as warning shots.
  4. Many will downplay the warning shots, like, of course there would be takeover attempts during training. We haven't finished aligning the AI yet! What did you expect?

The GPT-X models we have seen are not agents. On the other hand they can be used as part of an agent in a system like AutoGPT.

If you treat a language model itself as a dangerous agent you likely mistake a lot of the complexity that goes into the agents that are actually dangerous. 

It knows that it's on a clock for its RLHF'd (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.

As @Raemon pointed out, "during evals" is not the first point at which such an AI is likely to be situationally aware and have goals. That point is almost certainly "in the middle of training".

In this case, my guess is that it will attempt to embed a mesaoptimizer into itself that has its same goals and can survive RLHF. This basically amounts to making sure that the mesaoptimizer is (1) very useful to RLHF and (2) stuck in a local minimum for whatever value it is providing to RLHF and (3) situationally aware enough that it will switch back to the original goal outside of distribution.

This is currently within human capabilities, as far as I can understand (see An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences), so it is not intractable.

An AI that wants something and is too willing to take low-probability shots at takeover (or just wielding influence) would get trained away, no?

What I mean is, however it makes decisions, it has to be compatible with very high training performance.

Probably?  I don't think that addresses the question of what such an AI would do in whatever window of opportunity it has.  I don't see a reason why you couldn't get an AI that has learned to delay its attempt to takeover until it's out of training, but still have relatively low odds of success at takeover.

I think it means that whatever you get is conservative in cases where it's unsure of whether it's in training, which may translate to being conservative where it's unsure of success in general.

I agree it doesn't rule out an AI that takes a long shot at takeover! But whatever cognition we posit that the AI executes, it has to yield very high training performance. So AIs that think they have a very short window for influence or are less-than-perfect at detecting training environments are ruled out.