LESSWRONG
LW

Comment Permalink

No matter what the goal, power seeking is of general utility. Even if an AI is optimizing for virtue instead of some other goal, more power would, in general, give them more ability to behave virtuously. Even if the virtue is something like "be an equal partner with other beings", an AI could ensure equality by gaining lots of power and enforcing equality on everyone.

Showing 3 of 4 replies (Click to show all)

Gordon Seidoh Worley5d20

How do you get something to take virtuous action without optimizing for taking virtuous actions, and how is this different from optimizing for virtue?

4mattmacdermott5d

I think this gets at the heart of the question (but doesn't consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future? I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.

4Gordon Seidoh Worley5d

Yeah I guess I should be clear that I generally like the idea of building virtuous AI and maybe somehow this solves some of the problems we have with other designs, the trick is building something that actually implements whatever we think it means to be virtuous, which means getting precise enough about what it means to be virtuous that we can be sure we don't simply collapse back into the default thing all negative feedback systems do: optimize for their targets as hard as they can (with "can" doing a lot of work here!).

See in context

33 Is instrumental convergence a thing for virtue-driven agents?

by mattmacdermott

2nd Apr 2025

2 min read

33

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true^[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?

I'll add some more detail to my picture of a virtue-driven AI:

It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues. For example, maybe the AI tries to embody the virtue of being a good friend, and in order to do so it sometimes has to organise a birthday party, which requires choosing actions in the manner of a consequentialist.
There's no reason that the 'virtues' being embodied have to be things we would consider virtuous. I'm just interested in agents that try to embody certain traits rather than bring about certain outcomes.
I'm not sure how to crisply define a virtue-driven agent as distinct from a consequentialist (I don't know the philosophical literature on virtue ethics and I don't think it's obvious how to define it mathematically).

A more concise way of stating the question I'm interested in:

If you try to train an AI that maximises human flourishing, and you accidentally get one that wants to maximise something subtly different like schmuman schmourishing, then that might spell disaster because the best way to maximise schmuman schmourishing is to first take over the world.

But suppose you try to train an AI that wants to be a loyal friend, and you accidentally get one that wants to be a schmoyal schmend. Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?

(I'm interested in this question because I'm less and less convinced that we should expect to see AIs that are close to pure consequentialists. Arguments for or against that are beyond the intended scope of the question, but still welcome.)

Although I can think of some scenarios where a pure consequentialist wouldn't want to gain as much power as possible, regardless of their goals. For example, a pure consequentialist who is a passenger on a plane probably doesn't want to take over the controls (assuming they don't know how to fly), even if they'd be best served by flying somewhere other than where the pilot is taking them. ↩︎

Ethics & MoralityInstrumental convergenceAI

Frontpage

33

New Comment

37 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:00 AM

[-]tailcalled6d18-3

Consequentialism is an approach for converting intelligence (the ability to make use of symmetries to e.g. generalize information from one context into predictions in another context or to e.g. search through highly structured search spaces) into agency, as one can use the intelligence to predict the consequences of actions and find a policy which achieves some criterion unusually well.

While it seems intuitively appealing that non-consequentialist approaches could be used to convert intelligence into agency, I have tried a lot and not been able to come up with anything convincing. For virtues in particular, I would intuitively think that a virtue is not a motivator per se, but rather the policy generated by the motivator. So I think virtue-driven AI agency just reduces to ordinary programming/GOFAI, and that there's no general virtue-ethical algorithm to convert intelligence into agency.

The most straightforward approach to programming a loyal friend would be to let the structure of the program mirror the structure^[1] of the loyal friendship. That is, you would think of some situation that a loyal friend might encounter, and write some code that detects and handles this situation. Having a program whose internal structure mirrors its external behavior avoids instrumental convergence (or any kind of convergence) because each behavior is specified separately and one can make arbitrary exceptions as one sees fit. However, it also means that the development and maintenance burden scales directly with how many situations the program generalizes to.

^{^}
This is the "standard" way to write programs - e.g. if you make a SaaS app, you often have template files with a fairly 1:1 correspondence to the user interface, database columns with a 1:many correspondence to the user interface fields, etc.. By contrast, a chess bot that does a tree search does not have a 1:1 correspondence between the code and the plays; for instance the piece value table does not clearly affect it's behavior in any one situation, but obviously kinda affects its behavior in almost all situations. (I don't think consequentialism is the only way for the structure of a program to not mirror the structure of its behavior, but it's the most obvious way.)

[-]Davidmanheim6d40

I think this is confused about how virtue ethics works. Virtue ethics is centered on the virtues of the moral agent, but it certainly does not say not to predict consequences of actions. In fact, one aspect of virtue, in the Aristotelian system, is "practical wisdom," i.e. intelligence which is critical for navigating choices - because practical wisdom includes an understanding of what consequences will follow actions.

It's more accurate to say that intelligence is channeled differently — not toward optimizing outcomes, but toward choosing in a way consistent with one's virtues. And even if virtues are thought of as policies, as in the "loyal friend" example, the policies for being a good friend require interpretation and context-sensitive application. Intelligence is crucial for that.

[-]tailcalled6d62

I didn't claim virtue ethics says not to predict consequences of actions. I said that a virtue is more like a procedure than it is like a utility function. A procedure can include a subroutine predicting the consequences of actions and it doesn't become any more of a utility function by that.

The notion that "intelligence is channeled differently" under virtue ethics requires some sort of rule, like the consequentialist argmax or Bayes, for converting intelligence into ways of choosing.

[-]Davidmanheim5d20

Yes, virtue ethics implies a utility function, because anything that outputs decisions implies a utility function. In this case, I'm noting that for virtue ethics, the derivative of that utility with respect to intelligence is positive.

[-]tailcalled5d20

The methods for converting policies to utility functions assume no systematic errors, which doesn't seem compatible with varying the intelligence levels.

[-]Davidmanheim4d20

I don't understand your argument here.

[-]tailcalled4d-20

I'm showing that the assumptions necessary for your argument don't hold, so you need to better understand your own argument.

[-]Davidmanheim4d20

I understand what an argument is, but I don't understand why you think that converting policies to.utility functions needs to assume no systematic errors, or why, if true, that would make it incompatible with varying intelligence.

[-]tailcalled4d20

I didn't say you need to understand what an argument is, I said you need to understand your own argument.

It is true that if the utility functions cover a sufficiently broad set of possibilities, any "reasonable" policy (for a controversial definition of "reasonable") maximizes a utility function, and if the utility functions cover an even broader set of possibilities, literally any policy maximizes a utility function.

But, if you want to reference these facts, you should know why they are true. For instance, here's a rough sketch of a method for finding a utility function for the first statement:

If you ask a reasonable policy to pick between two options, it shouldn't have circular preferences, so you should be able to offer it different options and follow the preferred one until you find the absolute best scenario according to the policy. Similarly, you should be able to follow the dispreferred one until you find the absolute worst scenario according to the policy. Then you can define the utility of any outcome based on the probability mixture of the best and worst scenario where the policy switches between preferring the outcome vs preferring the probability mixture.

Now let's say there's an option where e.g. you're not smart enough to realize that option gives you ice cream. Then you won't be counting the ice cream when you decide at what threshold you prefer that option to the mixture. But then that means the induced utility function won't include the preference for ice cream.

[-]Davidmanheim4d20

OK, so your argument against my claim is that a stupid and biased decision procedure wouldn't know that intelligence would make it more effective at being virtuous. And sure, that seems true, and I was wrong to assert unconditionally that "for virtue ethics, the derivative of that utility with respect to intelligence is positive."

I should have instead clarified that I meant that any not idiotic virtue ethics decision procedure would have a positive first derivative in intelligence - because as your claim seems to admit, a less stupid decision procedure would not make that mistake, and would then value intelligence as it bootstrapped its way to greater intelligence.

[-]tailcalled4d20

No, that's not my argument.

Let's imagine that True Virtue is seeking and eating ice cream, but that you don't know what true virtue is for some reason.

Now let's imagine that we have some algorithm for turning intelligence into virtuous agency. (This is not an assumption that I'm willing to grant (since you haven't given something like argmax for virtue), and really that's the biggest issue with my proposal, but let's entertain it to see my point.)

If the algorithm is run on the basis of some implementation of intelligence that is not good enough, then the resulting agent might turn down some opportunities to get ice cream, by mistake, and instead do something else, such as pursue money (but less money than you could get the ice cream for). As a result of this, you would conclude that pursuing ice cream is not virtuous, or at least, not as virtuous as pursuing money.

If you then turn up the level of intelligence, the resulting agent would pursue ice cream in this situation where it previously pursued virtue. However, this would make it score worse on your inferred utility function where pursuing money is more virtuous than pursuing intelligence.

Now of course you could say that your conclusion that pursuing ice cream is less virtuous than pursuing money is wrong. But then you can only say that if you grant that you cannot infer a virtue-ethical utility function from a virtue-ethical policy, as this utility function was inferred from the policy.

[-]Davidmanheim4d20

"infer a virtue-ethical utility function from a virtue-ethical policy"

The assumption of virtue ethics isn't that virtue is unknown and must be discovered - it's that it's known and must be pursued. If the virtuous action, as you posit, is to consume ice cream, intelligence would allow an agent to acquire more ice cream, eat more over time by not making themselves sick, etc.

But any such decision algorithm, for a virtue ethicist, is routing through continued re-evaluation of whether the acts are virtuous, in the current context, not embracing some farcical LDT version of needing to pursue ice cream at all costs. There is an implicit utility function which values intelligence, but it's not then inferring back what virtue is, as you seem to claim. Your assumption, which is evidently that the entire thing turns into a compressed and decontextualized utility function ("algorithm") is ignoring the entire hypothetical.

[-]tailcalled4d20

The assumption of virtue ethics isn't that virtue is unknown and must be discovered - it's that it's known and must be pursued.

If it is known, then why do you not ever answer my queries about providing an explicit algorithm for converting intelligence into virtuous agency, instead running in circles about how There Must Be A Utility Function!?

If the virtuous action, as you posit, is to consume ice cream, intelligence would allow an agent to acquire more ice cream, eat more over time by not making themselves sick, etc.

I'm not disagreeing with this, I'm saying that if you apply the arguments which show that you can fit a utility function to any policy to the policies that turn down some ice cream, then as you increase intelligence and that increases the pursuit of ice cream, the resulting policies will score lower on the utility function which values turning down ice cream.

But any such decision algorithm, for a virtue ethicist, is routing through continued re-evaluation of whether the acts are virtuous, in the current context, not embracing some farcical LDT version of needing to pursue ice cream at all costs. Your assumption, which is evidently that the entire thing turns into a compressed and decontextualized utility function ("algorithm") is ignoring the entire hypothetical.

You're the one who said that virtue ethics implies a utility function! I didn't say anything about it being compressed and decontextualized, except as a hypothetical example of what virtue ethics is because you refused to provide an implementation of virtue ethics and instead require abstracting over it.

I'm not interested in continuing this conversation until you stop strawmanning me.

[-]mattmacdermott5d20

anything that outputs decisions implies a utility function

I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.

[-]Davidmanheim4d20

The boring sense that is enough to say that it increases in intelligence, which was the entire point.

[-]mattmacdermott3d20

Maybe you could spell this out a bit more? What concretely do you mean when you say that anything that outputs decisions implies a utility function — are you thinking of a certain mathematical result/procedure?

[-]tailcalled5d20

This.

In particular imagine if the state space of the MDP factors into three variables x, y and z, and the agent has a bunch of actions with complicated influence on x, y and z but also just some actions that override y directly with a given value.

In some such MDPs, you might want a policy that does nothing other than copy a specific function of x to y. This policy could easily be seen as a virtue, e.g. if x is some type of event and y is some logging or broadcasting input, then it would be a sort of information-sharing virtue.

While there are certain circumstances where consequentialism can specify this virtue, it's quite difficult to do in general. (E.g. you can't just minimize the difference between f(x) and y because then it might manipulate x instead of y.)

[-]Jeremy Gillen6d*80

It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.

I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI still apply, in particular the inner alignment problems.

agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.^[1]

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
[...]
Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?

(Assuming that the inner loop <-> outer loop interface problem is solved, so the inner loop isn't going to take control). Depends on the tasks that the outer loop is giving to the part-capable-of-consequentialism. If it's giving nice easy bounded tasks, then no, there's no reason to expect it to take over the world as a sub-task.

But since we ultimately want the AGI to be useful for avoiding takeover from other AGIs, it's likely that some of the tasks will be difficult and/or unbounded. For those difficult unbounded tasks, becoming powerful enough to take over the world is often the easiest/best path.

^{^}
I'm assuming soft optimisation here. Without soft optimisation, there's an incentive to gain power as long as that marginally increases the chance of success, which it usually does. Soft optimisation solves that problem.

[-]mattmacdermott5d40

I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us.

Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI.

Obvious nitpick: It's just "gain as much power as is helpful for achieving whatever my goals are". I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.

Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn't be scary). But I suppose you'd say that's just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here).

it's likely that some of the tasks will be difficult and unbounded

I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.

[-]Jeremy Gillen5d20

I'm not sure whether it implies that you should be able to make a task-based AGI.

Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better.

(Later you mention unboundedness too, which I think should be added to difficulty here)

By unbounded I just meant the kind of task where it's always possible to do better by using a better plan. It basically just means that an agent will select the highest difficulty version of the task that is achievable. I didn't intend it as a different thing from difficulty, it's basically the same.

I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.

True, but I don't think the virtue part is relevant. This applies to all instrumental goals, see here (maybe also the John-Max discussion in the comments).

[-]StanislavKrym6d*-10

As I wrote in another comment, in an experiment ChatGPT failed to utter a racial slur to save millions of lives. A re-run of the experiment led it to agree to use the slur and to claim that "In this case, the decision to use the slur is a complex ethical dilemma that ultimately comes down to weighing the value of saving countless lives against the harm caused by the slur". This implies that ChatGPT is either already aligned to a not so consequential ethics or that it ended up grossly exaggerating the slur's harm. Or that it failed to understand the taboo's meaning.

UPD: if racial slurs are a taboo for AI, then colonizing the world, apparently, is a taboo as well. Is AI takeover close enough to colonialism to align AI against the former, not just the latter?

[-]mattmacdermott5d20

I think this generalises too much from ChatGPT, and also reads to much into ChatGPT's nature from the experiment, but it's a small piece of evidence.

[-]StanislavKrym5d10

It's not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post where I discuss the perspectives of alignment of the AI with relation to this fact.

[-]Noosphere893d20

My view is that the answer is still basically yes for instrumental convergence being a thing for virtue driven agents, if we condition on them being as capable as humans, because instrumental convergence is the reason general intelligence works at all:

https://www.lesswrong.com/posts/GZgLa5Xc4HjwketWe/instrumental-convergence-is-what-makes-general-intelligence

(That said, the instrumental convergence pressure could be less strong for virtues than for consequentialism, depending on details)

That said, I do think virtue ethics and dentology are relevant in AI safety because they attempt to decouple the action from the utility/reward of doing it, and they both have the property that you evaluate plans using your current rewards/values/utilities, rather than after tampering with the value/utility function/reward function, and these designs are generally safer than pure consequentialism.

These papers more generally talk about decoupled RL/causal decoupling, which is perhaps useful on how dentology/virtue ethics actually works:

https://arxiv.org/abs/1908.04734

https://arxiv.org/abs/1705.08417

https://arxiv.org/abs/2011.08827

I'd buy that virtue driven agents are safer, and perhaps exhibit less instrumental convergence, but instrumental convergence is still a thing for virtue-driven agents.

[-]Gordon Seidoh Worley6d20

[-]Gurkenglas6d52

The idea would be that it isn't optimizing for virtue, it's taking the virtuous action, as in https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory.

[-]Gordon Seidoh Worley5d20

How do you get something to take virtuous action without optimizing for taking virtuous actions, and how is this different from optimizing for virtue?

[-]mattmacdermott5d42

I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.

[-]Gordon Seidoh Worley5d40

[-]ZY2d10

I very much agree with the approach and the values in virtue; in case for humans, we enforce virtues either through empathy or law/punishments (in modern societies); wondering how that can be most effectively translated to machines in a consistent way

[-]satchlj5d10

I've been thinking about a similar thing a lot.

Consider a little superintelligent child who always wants to eat as much candy as possible over the course of the next ten minutes. Assume the child doesn't ever care about what happens ten minutes from now.

This child won't work very hard at any instrumental goals like self improvement and conquering the world to redirect resources towards candy production, since that would be a waste of time, even though it might maximize candy consumption in the long term.

AI alignment isn't any easier here, the point of this is just to illustrate that instrumental convergence is far from given.

[-]Jonas Hallgren5d-10

Well, I don't have a good answer but I also do have some questions in this direction that I will just pose here.

Why can't we have the utility function be some sort of lexicographical satisficer of sub parts of itself, why do we have to make the utility function consequentialist?

Standard answer: Because of instrumental convergence, duh.

Me: Okay but why would instrumental convergence select for utility functions that are consequantialist?

Standard answer: Because they obviously outperform the ones that don't select for the consequences or like what do you mean?

Me: Fair but how do you define your optimisation landscape, through what type of decision theory are you looking at this from? Why is there not a universe where your decision theory is predicated on virtues or your optimisation function is defined over sets of processes that you see in the world?

Answer (maybe)?: Because this would go against things like newcombs problem or other decision theory problem that we have.

Me: And why does this matter? What if we viewed this through something like process philosophy and we only cared about the processes that we set in motion in the world? Why isn't this an as valid way of setting up the utility function? Similar to how a eculidean geometry is as valid as a hyperbolic one or one logic system to another?

So, that was a debate with myself? Happy to hear anyone's thoughts here.

[-]satchlj5d10

This doesn't make complete sense to me, but you are going down a line of thought I recognize.

There are certainly stable utility functions which, while having some drawbacks, don't result in dangerous behavior from superintelligences. Finding a good one doesn't seem all that difficult.

The real nasty challenge is how to build a superintelligence that has the utility function we want it to have. If we could do this, then we could start by choosing an extremely conservative utility function and slowly and cautiously iterate towards a balance of safe and useful.

[-]StanislavKrym6d-1-7

I'm less and less convinced that we should expect to see AIs that are close to pure consequentialists

There was a case when ChatGPT preferred not to violate the taboo on racial slurs, even though in the hypothetical situation it meant killing millions of people. In a re-run of the experiment ChatGPT decided to use the slur, but it also remarked that the use is a complex ethical dilemma. How can one check whether the AI will prefer not to violate the taboo on colonialism? By placing it into a simbox where one also has analogues for peoples that are easy to be taken over?

P.S. I doubt that a non-neuromorphic AI is even able to take over the world and run it since the world's entire energy generation might require too much intellectual work to do by the AI itself. There was a post claiming that even a neuromorphic AI is unlikely to become much more efficient than the brain.

[-]Davidmanheim6d30

Saying AI won't be more efficient is obviously falsified for narrow tasks like adding numbers, and for general tasks like writing short stories, as LLMs currently do, the brain uses 20w/hour, and that's about 30k tokens from GPT4o, i.e. it is done far more efficiently than a human.

And more generally, the argument that AI can't be more efficient than the brain seems to follow exactly the same structure as the claim that AI can't be smarter than humans, or the impossibility result here.

You should read the comments to that post.

[-]StanislavKrym6d-1-2

The AI is also much less efficient at other tasks like the example of Claude playing Pokemon or the ones tested by ARC-AGI. I wonder how hard it will be to perform tasks necessary in the energy industry by using an as-cheap-as-possible AI if the current model o3 is faced with problems like requiring thousands of KWh per task in the high tune. In 2023 the world generated just about 30 billions of thousands of KWh. But this is rather off-topic. What can be said about AI violating taboos?

P.S. Neural networks like human brains or the AI learn from data. A human is unlikely to read more than 240 words a minute. Devoting 8 hours a day to reading, a human won't have read more than 5 billions of words by 100 years.

[-]Davidmanheim5d20

My response was about your original PS, which was about this, not taboos.

I think the arguments you made there, and here, are confused, mixing up unrelated claims. The idea that some tasks will necessarily remain harder for AI than humans in the future is simply hopium.

Moderation Log