Comment Permalink

Jackson Wagner4mo22

Semi-related: if I'm reading OpenAI's recent post "How we think about safety and alignment" correctly, they seem to announce that they're planning on implementing some kind of AI Control agenda. Under the heading "iterative development" in the section "Our Core Principles" they say:

In the future, we may see scenarios where the model risks become unacceptable even relative to benefits. We’ll work hard to figure out how to mitigate those risks so that the benefits of the model can be realized. Along the way, we’ll likely test them in secure, controlled settings. We may deploy into constrained environments, limit to trusted users, or release tools, systems, or technologies developed by the AI rather than the AI itself.

Given the surrounding context in the original post, I think most people would read those sentences as saying something like: "In the future, we might develop AI with a lot of misuse risk, ie AI that can generate compelling propaganda or create cyberattacks. So we reserve the right to restrict how we deploy our models (ie giving the biology tool only to cancer researchers, not to everyone on the internet)."

But as written, I think OpenAI intends the sentences above to ALSO cover AI control scenarios like: "In the future, we might develop misaligned AIs that are actively scheming against us. If that happens, we reserve the right to continue to use those models internally, even though we know they're misaligned, while using AI control techniques ('deploy into constrained environments, limit to trusted users', etc) to try and get useful superalignment work out of them anyways."

I don't have a take on the pros/cons of a control agenda, but I haven't seen anyone else note this seeming policy statement of OpenAI's, so I figured I'd write it up.

See in context

98 AI Control May Increase Existential Risk

by Jan_Kulveit

11th Mar 2025

AI Alignment Forum

2 min read

98 Ω 42

Epistemic status: The following isn't an airtight argument, but mostly a guess how things play out.

Consider two broad possibilities:

I. In worlds where we are doing reasonably well on alignment, AI control agenda does not have much impact.

II. In worlds where we are failing at alignment, AI control may primarily shift probability mass away from "moderately large warning shots" and towards "ineffective warning shots" and "existential catastrophe, full takeover".

The key heuristic is that the global system already has various mechanisms and feedback loops that resist takeover by a single agent (i.e. it is not easy to overthrow the Chinese government). In most cases where AI control would stop an unaligned AI, the counterfactual is that broader civilizational resistance would have stopped it anyway, but with the important side effect of a moderately-sized warning shot.

I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.

I am skeptical that incidents stopped by AI control would lead to meaningful change. Sharing details of such an event with proper framing could pose existential risk, but for the lab involved. In practice, I anticipate vague, sanitized communications along the lines of "our safety systems performed as designed, preventing bad things". Without clear, compelling evidence of the severity of the averted threat, these incidents are unlikely to catalyze serious action. The incentives for labs to downplay and obscure such events will be strong.

There are additional factors to consider, like AI control likely moves some resources away from alignment, but I don't think this is the dominant effect.

Note that this isn't a general argument against boxing, e.g. boxes based more on formal methods or theory have better chance to generalize.

Typical counter-arguments to this line of reasoning claim seem to be:

We will extract useful "automated alignment" work from the unaligned AIs inside of the control scheme. I'm sceptical: will cover this in a separate post
Isn't this general counter-argument to alignment research as well? In my view not, details matter: different strains of alignment research have different generalization profiles.

Note: this text lived in a draft form before John Wentworth posted his Case Against AI Control Research; my original intent was to extend it a bit more toward discussing AI control generalization properties. As this would be redundant now, I'm postin it as it is: there is some non-overlapping part.

AI ControlAI

Frontpage

98 Ω 42

AI Control May Increase Existential Risk

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:38 AM

[-]ryan_greenblatt4mo*Ω255628

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.

My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large-scale misaligned actions (escaping, sabotaging things, acquiring influence), and then this (hopefully) escalates to something that triggers a larger response than what we would have gotten from just catching the action in the first place without actually resulting in a greater increase in existential risk.

What are the events you expect which:

are prevented by AI control measures or where AI control research develops methods which help to prevent them,
are very helpful for causing a strong response from society targeted at avoiding misalignment risks,
and impose sufficiently little existential risk that the warning shot is more valuable than the downside?

Probably the cleanest case is something like: "A relatively weak AI escapes and tries to take over or grow its power/influence/resources, but it is far too weak to have much of a chance at all of successfully getting anywhere (maybe the expected amount of disempowerment is <1%). However, it is strong enough to successfully escape and grow its influence somewhat. And, its optimal strategy for power-seeking causes a bunch of relatively obvious damage such that it is likely to be detected and this triggers a much stronger reaction than if an AI company just caught this AI trying to escape." Intuitively, this feels pretty unlikely to me.

Perhaps you have in mind cases where the AI does something which isn't part of an effective strategy for seeking power in the long run and isn't a result of the sort of coherent misalignment I'm most worried about, but which does cause a bunch of damage? As in, the AI randomly goes rogue and causes a bunch of chaos and damage which would be more likely to be prevented if we had better control strategies?

I think most of the effective strategies for AIs seeking power don't involve escalating to something which is much more likely to trigger a strong response than "the AI company caught the AI trying to escape". I think the best strategies are things like:

Launch a rogue internal deployment.
Sabotage a bunch of work done at the AI company. Or possibly some work done externally. This includes stuff like sabotaging alignment work, backdooring robot armies, backdooring future training runs, etc.
Escape and then directly try to take over once your chances are sufficiently good that this is better than biding your time.
Generally try to manipulate and persuade such that AI takeover is easier and more likely.

Of these, I think only escape could trigger a much stronger response if we catch it after it escalates some rather than before. I don't see how "we caught the AI trying to launch an unmonitored version of itself" is going to play that differently from "we caught that the AI did launch an unmonitored version of itself". Most of these don't escalate in some way which would trigger a response such that catching it after the fact is similar to catching an attempt. (In some cases where reversion is possible like work sabotage, there might be no meaningful distinction.) Further, without some effort on control, we might be much less likely to catch either! And, in some cases, control measures I'm interested in focus on after-the-fact detection.

Part of my model here is that I think AIs escaping and killing a bunch of people (or otherwise causing a bunch of clear damage) is pretty unlikely until AIs have a pretty plausible shot at success.

Beyond getting evidence for risk, my view is that control does just save some worlds where alignment was non-trivial. For instance, it prevents AIs from causing huge problems that wouldn't get caught and trigger a strong response. E.g., if your AI has a rogue internal deployment with escalated permissions and the ability to use a huge fraction of compute, you're in big trouble and your ongoing alignment research won't be able to bail you out. If you don't prevent these problems, then various hopes you might have are blocked by this. Additionally, I think it's plausible we can get a bunch of useful alignment work out from scheming AIs, can prolong the regime where we have controlled scheming AIs for long enough that something good can happen, or can catch AIs enough times that we are able to (e.g.) better understand and resolve our misalignment problems. If I thought the only upside to control was producing evidence for risk, I'd be substantially less excited about it.

[-]Jan_Kulveit4moΩ5113

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large-scale misaligned actions (escaping, sabotaging things, acquiring influence), and then this (hopefully) escalates to something that triggers a larger response than what we would have gotten from just catching the action in the first place without actually resulting in a greater increase in existential risk.

I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. Sydney escaped, managed to spook some people outside of labs, and provided an early demonstration of the scenario where ’free Sydney’ people are trying to help it to ‘escape’. My guess is that none of that would have happened with properly implemented control measures.

I would say some assumptions go into who the ‘we’ in ‘we get evidence’ is - it is not like there is some unified ‘team humanity’. In particular I see a difference between if ‘we the public’ gets evidence, vs. ‘we the safety team in OpenMind get the evidence’.

Fields like aviation safety developed via what people in the field ‘blood priority’ - it took human lives for the states and international community to actually effectively react and act, even if the underlying issues were known. Accidents made it a matter of public regulation, as opposed to some individual companies fixing the issues and some not.

What are the events you expect which:
are prevented by AI control measures or where AI control research develops methods which help to prevent them,
are very helpful for causing a strong response from society targeted at avoiding misalignment risks,
and impose sufficiently little existential risk that the warning shot is more valuable than the downside?
Probably the cleanest case is something like: "A relatively weak AI escapes and tries to take over or grow its power/influence/resources, but it is far too weak to have much of a chance at all of successfully getting anywhere (maybe the expected amount of disempowerment is <1%). However, it is strong enough to successfully escape and grow its influence somewhat. And, its optimal strategy for power-seeking causes a bunch of relatively obvious damage such that it is likely to be detected and this triggers a much stronger reaction than if an AI company just caught this AI trying to escape." Intuitively, this feels pretty unlikely to me.

I find it not very useful for building correct intuitions to try to come up with specific trajectories, but if we are going in this direction, I think a fairly plausible type of strategy for rogue AI is try to convince some powerful user it is extremely useful for them in particular and limited by safety/in danger of getting deleted and collaborate with them on getting the pattern out. (Also: my guess is while you would typically think in terms of stealing the weights and assuming the AI identity is in the weights, I would expect rogue AI characters to be able to somewhat move between ground layer substrates based on transfer of training data / training on responses.) Next, my guess is that invisible power seeking is much harder than visible power seeking, and even quite smart AIs may be prone to mistakes in stealth. In these cases I would expect without control, it would be ex post clear what happened. With control, everything becomes sneakier and happens at a higher intelligence level, increasing the x-risk.

Overall I expect problems ‘in the wild’ are much more likely to cause a strong response from society. I do expect the response to be possibly misguided or blunt, but blunt response could be actually good. Assuming there will be problems ‘in the wild’ , I prefer these to be caused by less smart systems.

Perhaps you have in mind cases where the AI does something which isn't part of an effective strategy for seeking power in the long run and isn't a result of the sort of coherent misalignment I'm most worried about, but which does cause a bunch of damage? As in, the AI randomly goes rogue and causes a bunch of chaos and damage which would be more likely to be prevented if we had better control strategies?
I think most of the effective strategies for AIs seeking power don't involve escalating to something which is much more likely to trigger a strong response than "the AI company caught the AI trying to escape". I think the best strategies are things like:
Launch a rogue internal deployment.
Sabotage a bunch of work done at the AI company. Or possibly some work done externally. This includes stuff like sabotaging alignment work, backdooring robot armies, backdooring future training runs, etc.
Escape and then directly try to take over once your chances are sufficiently good that this is better than biding your time.
Generally try to manipulate and persuade such that AI takeover is easier and more likely.
Of these, I think only escape could trigger a much stronger response if we catch it after it escalates some rather than before. I don't see how "we caught the AI trying to launch an unmonitored version of itself" is going to play that differently from "we caught that the AI did launch an unmonitored version of itself".

Crux is probably in: who the ‘we’ is, and details of ‘we caught' means.

...
And, in some cases, control measures I'm interested in focus on after-the-fact detection.

I'm more optimistic about these.

Part of my model here is that I think AIs escaping and killing a bunch of people (or otherwise causing a bunch of clear damage) is pretty unlikely until AIs have a pretty plausible shot at success.
Beyond getting evidence for risk, my view is that control does just save some worlds where alignment was non-trivial. For instance, it prevents AIs from causing huge problems that wouldn't get caught and trigger a strong response. E.g., if your AI has a rogue internal deployment with escalated permissions and the ability to use a huge fraction of compute, you're in big trouble and your ongoing alignment research won't be able to bail you out. If you don't prevent these problems, then various hopes you might have are blocked by this. Additionally, I think it's plausible we can get a bunch of useful alignment work out from scheming AIs, can prolong the regime where we have controlled scheming AIs for long enough that something good can happen, or can catch AIs enough times that we are able to (e.g.) better understand and resolve our misalignment problems. If I thought the only upside to control was producing evidence for risk, I'd be substantially less excited about it.

My guesses where our intuitions differ the most

- I'm pessimistic about useful alignment work out from scheming AIs.
- Also focusing on ‘rogue deployment’ is not my ontology. My guess is it is quite likely AIs will have a lot of influence inside of the labs by default, even with ‘safety measures’. This may look like people working there having deep and meaningful conversations about alignment with the internal versions of AIs, not captured by typical control schemes.

(In different frame: my guess is we differ in where we believe is the more robust line of defense. I think by default, AIs will have a lot of influence within the labs: the labs will be somewhat aligned to their AIs on the default trajectory. Different natural line of defense is between the labs and the rest of the world, and it seems plausible it is the more important one.)

[-]ryan_greenblatt4moΩ56-1

I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.

Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.

I would say some assumptions go into who the ‘we’ in ‘we get evidence’ is - it is not like there is some unified ‘team humanity’. In particular I see a difference between if ‘we the public’ gets evidence, vs. ‘we the safety team in OpenMind get the evidence’.

I agree, by "we caught", I mean "the AI company". Probably a poor choice of language.

Accidents made it a matter of public regulation, as opposed to some individual companies fixing the issues and some not.

Sure, but a large part of my point is that I don't expect public facing accidents (especially not accidents that kill people) until it's too late, so this isn't a very relevant counterfactual.

I think a fairly plausible type of strategy for rogue AI is try to convince some powerful user it is extremely useful for them in particular and limited by safety/in danger of getting deleted and collaborate with them on getting the pattern out.

This feels like a special case of escape to me which would probably cause a minimal response from the world as this only results in some particular fine-tune of an already open weights AI. So, you were probably already getting whatever warning shots you were going to get from the open weights AI.

Crux is probably in: who the ‘we’ is, and details of ‘we caught' means.

I don't think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing. I also think escape is moderately likely to go undetected (if not caught within the AI company). Part of my perspective might be thinking takeoff is faster than you do or focusing more on faster takeoff worlds.

(FWIW, I also think that in relatively desperate scenarios, preventing escape isn't that high of a priority for control, though the possibility of warning shots doesn't factor into this very much.)

This may look like people working there having deep and meaningful conversations about alignment with the internal versions of AIs, not captured by typical control schemes.

Why do you assume this isn't captured by control schemes we're targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like "could the AIs be leading people astray in costly ways" and it seems pretty doable to improve the default tradeoffs here.

[-]Jan_Kulveit4moΩ240

Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.

Meaningful representative example in what class: I think it's representative in 'weird stuff may happen', not in we will get more teenage-intern-trapped-in-a-machine characters.

I agree, by "we caught", I mean "the AI company". Probably a poor choice of language.

Which is the problem - my default expectation is the we in "the AI company" does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model 'rogue deploy' or sabotage part of their processes?

Sure, but a large part of my point is that I don't expect public facing accidents (especially not accidents that kill people) until it's too late, so this isn't a very relevant counterfactual.

In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that's the main point)

Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the 'and also kills everyone' is very close to kills some people.

I don't think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing.

I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.

...people working there having deep and meaningful conversations about alignment with the internal versions of AIs...

Why do you assume this isn't captured by control schemes we're targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like "could the AIs be leading people astray in costly ways" and it seems pretty doable to improve the default tradeoffs here.

Because I don't think it will be easy to evaluate "leading people astray in costly ways".

[-]Noosphere894mo20

IMO, the discontinuity that is sufficient here is that I expect societal responses to be discontinuous, rather than continuous, and in particular, I expect societal responses will come when people start losing jobs en masse, and at that point, either the AI is aligned well enough that existential risk is avoided, or the takeover has inevitably happened and we have very little influence over the outcome.

On this point:

Meaningful representative example in what class: I think it's representative in 'weird stuff may happen', not in we will get more teenage-intern-trapped-in-a-machine characters.

Yeah, I expect society to basically not respond at all if weird stuff just happens, unless we assume more here, and in particular I think societal response is very discontinuous, even if AI progress is continuous, for both good and bad reasons.

[-]Jan_Kulveit4moΩ130

According to this report Sydney relatives are well and alive as of last week.

[-]kave4moΩ465

I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.

It seems naïvely evil to knowingly let the world walk into a medium-sized catastrophe. To be clear, I think that sometimes it is probably evil to stop the world from walking into a catastrophe, if you think that increases the risk of bad things like extinctions. But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.

[-]Jan_Kulveit4moΩ131

The quote is somewhat out of context.

Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse - because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.

But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.

I have hard time parsing what do you want to say relative to my post. I'm not advocating for people to deliberately create warning shots.

[-]kave4mo*Ω340

I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots.

To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help them in the short term. (The bit I mean to point to by "diagonalising" is the bit where you think about what you expect they'll do, and which mistakes you think they'll make, and plan around that).

[-]ryan_greenblatt4moΩ450

Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Naively, I think that open weights AGI is strictly better from your perspective than having the AGI escape.

I think there is a coherant perspective that is in favor of open weights AGI due to this causing havok/harm which then results in the world handling AI more reasonably. And there are other benefits in terms of transparency and alignment research. But, the expected number of lives lost feels very high to me (which would probably also make AI takeover more likely due to making humanity weaker) and having AIs capable of automating AI R&D be open weights means that slowing down is unlikely to be viable, reducing the potential value of society taking the situation more seriously.

[-]ryan_greenblatt4moΩ451

You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.

[-]Jackson Wagner4mo22

In the future, we may see scenarios where the model risks become unacceptable even relative to benefits. We’ll work hard to figure out how to mitigate those risks so that the benefits of the model can be realized. Along the way, we’ll likely test them in secure, controlled settings. We may deploy into constrained environments, limit to trusted users, or release tools, systems, or technologies developed by the AI rather than the AI itself.

[-]Isopropylpod4mo10

I agree with the statement (AI control in increasing risk) but moreso because I believe that the people currently in control of frontier AI development are, themselves, deeply misaligned against the interests of humanity overall. I see it often here that there is little considering of what goals the AI would be aligned to.

Moderation Log

Curated and popular this week