Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There was conversation on Facebook over an argument that any sufficiently complex system, whether a human, a society, or an AGI, will be unable to pursue a unified goal due to internal conflict among its parts, and that this should make us less worried about "one paperclipper"-style AI FOOM scenarios. Here's a somewhat edited and expanded version of my response:

1) yes this is a very real issue

2) yet as others pointed out, humans and organizations are still able to largely act as if they had unified goals, even if they often also act contrary to those goals

3) there's a lot of variance in how unified any given human is. trauma makes you less unified, while practices such as therapy and certain flavors of meditation can make a person significantly more unified than they used to be. if you were intentionally designing a mind, you could create mechanisms that artificially mimicked the results of these practices

4) a lot of the human inconsistency looks like it has actually been evolutionarily adaptive for social purposes. E.g. if your social environment punishes you for having a particular trait or belief, then it's adaptive to suppress that to avoid punishment, while also retaining a desire to still express it when you can get away with it. This then manifests as what could be seen as conflicting sub-agents, with internal conflict and inconsistent behaviour.

5) Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans (if it was trained in some kind of multi-agent artificial life setting with similar kinds of social dynamics as humans faced; maybe also if it's trained something like the way ChatGPT is, where it has to give an acceptable-to-the-median-liberal-Westerner answer to any question, even if those answers are internally inconsistent)

6) at the same time there is still genuinely the angle about complexity and unpredictability making it hard to get a complex mind to work coherently and internally aligned. I think that evolution has done a lot of trial and error to set up parameters that result in brain configurations where people end up acting in a relatively sensible way - and even after all that trial and error, lots of people today still end up with serious mental illnesses, failing to achieve almost any of the goals they have (even when the environment isn't stacked against them), dying young due to doing something that they predictably shouldn't have, etc. I'd say it's less like "evolution has found a blueprint than reliably works" and more like "evolution keeps doing trial-and-error search in every generation, with a lot of people not making it"

7) aligning an AI's sub-agents in a purely simulated environment may not be fully feasible because a lot of the questions that need to be solved are things like "how much priority to allocate to which sub-agent in which situation". E.g. humans come with lots of biological settings that shift the internal balance of sub-agents when hungry, tired, scared,, etc. Some people develop an obsessive focus on a particular topic which may end up being beneficial if they are lucky (obsession on programming that you can turn into a career), or harmful if they are unlucky (an obsession on anything that doesn't earn you money and actively distracts you from it). The optimal prioritization depends on the environment and I don't think there is any theoretically optimal result that would be real-world relevant and that you could calculate beforehand. Rather you just have to do trial-and-error, and while running your AIs in a simulated environment may help a bit, it may not help much if your simulation doesn't sufficiently match the real world.

8) humans are susceptible to internal Goodhart's Law, where they optimize for proxy variables like "sense of control over one's environment", and this also leads them to doing things like playing games or smoking cigarettes to increase their perceived control of the environment without increasing their actual control of the environment. I think that an AI having the same issue is much more likely than it just being able to single-mindedly optimize for a single goal and derive all of its behavior and subgoals from that. Moreover, evolution has put quite a bit of optimization power into developing the right kinds of proxy variables which overall still largely work. Having control of your environment is actually quite important and even if the drive for that can misfire, having the drive to increase that control is mostly still better than not having it. But the exact configuration of these kinds of proxy variables feels like it's also in the class of things that you just need to find out by trial and error and throwing lots of minds at it, there's no a priori answer for exactly how much the AI should optimize for that in an arbitrary environment.

9) a lot of these kinds of failures are generally not correctable from within the system. Suppose that an AI's internal priority-allocation system ends up giving most of the priority to the subsystem thinking about how to best develop nanotech, and this subsystem ends up obsessively thinking about minute theoretical details about nanotech long past the point it would have had any practical relevance for the AI's world-takeover plans. Even if other subsystems realize that this has turned into a lost cause, if they cannot directly affect the priority-allocation system which keeps the nanotech-obsessed subsystem in control, the nanotech-obsessed subsystem will continue spending all the time just thinking about this and nothing else. Or if the other subsystems can directly affect the priority-allocation system, then it creates an incentive for them to seize control of it and ensure that it will always keep them in charge, even past the point that their contributions turned out to matter. (cf. Minsky on mutually bidding subagents)

10) overall, this makes me put less credence on the "the first AI to become superintelligent will take over the world" scenario - I think that it's likely that the first superintelligent AI will turn out to be internally misaligned and fail to achieve a goal as complex as taking over the world. However, I don't think that this necessarily helps us much, because AI looks like it can become far more internally aligned than humans ever can, and given enough trial and error (different actors creating their own AIs), one of them is going to get there eventually.

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 7:38 AM

I agree that initially a powerful AGI would likely be composed of many sub-agents. However it seems plausible to me that these sub-agents may “cohere” under sufficient optimisation or training. This could result in the sub-agent with the most stable goals winning out. It’s possible that strong evolutionary pressure makes this more likely.

You could also imagine powerful agents that aren’t composed of sub-agents, for example a simpler agent with very computationally expensive search over actions.

Overall this topic seems under-discussed in my opinion. It would be great to have a better understanding of whether we expect sub-agents to turn into a single coherent agent.

However it seems plausible to me that these sub-agents may “cohere” under sufficient optimisation or training.

I think it's possible to unify them somewhat, in terms of ensuring that they don't have outright contradictory models or goals, but I don't really see a path where a realistically feasible mind would stop being made up of different subagents. The subsystem that thinks about how to build nanotechnology may have overlap with the subsystem that thinks about how to do social reasoning, but it's still going to be more efficient to have them specialized for those tasks rather than trying to combine them into one. Even if you did try to combine them into one, you'll still run into physical limits - in the human brain, it's hypothesized that one of the reasons why it takes time to think about novel decisions is that

different pieces of relevant information are found in physically disparate memory networks and neuronal sites. Access from the memory networks to the evidence accumulator neurons is physically bottlenecked by a limited number of “pipes”. Thus, a number of different memory networks need to take turns in accessing the pipe, causing a serial delay in the evidence accumulation process.

There are also closely related considerations for how much processing and memory you can cram into a single digital processing unit. In my language, each of those memory networks is its own subagent, holding different perspectives and considerations. For any mind that holds a nontrivial amount of memories and considerations, there are going to be plain physical limits on how much of that can be retrieved and usefully processed at a central location, making it vastly more efficient to run thought processes in parallel than try to force everything through a single bottleneck.

Subagents can run prediction markets.

Don't understand what you're saying? (I mean sure they can but what makes you bring that up.)

If the question is "how subagents can do superintelligently complex thing in unified manner, given limited bandwidth", they can run internal prediction markets like "which next action is good" or "what we are going to observe in next five seconds", because prediction markets is a powerful and general information integration engine. Moreover, it can lead to better mind integration, because some subagents can make a profit via exploiting incoherence in beliefs/decision-making.

Sure, right. (There are some theories suggesting that the human brain does something like a bidding process with the subagents with the best track record for prediction winning the ability to influence things more, though of course the system is different from an actual prediction market.) That's significantly different from the system ceasing to meaningfully have subagents at all though, and I understood rorygreig to be suggesting that it might cease to have them.

Technically, every cell in the human body is a subagent trying to 'predict'  each other's future movements and actions.

Good points, however I'm still a bit confused about the difference between two different scenarios: "multiple sub-agents" vs "a single sub-agent that can use tools" (or can use oracle sub-agents that don't have their own goals).

For example a human doing protein folding using alpha-fold; I don't think of that as multiple sub-agents, just a single agent using an AI tool for a specialised task (protein folding). (Assuming for now that we can treat a human as a single agent, which isn't really the case, but you can imagine a coherent agent using alpha-fold as a tool).

It still seems plausible to me that you might have a mind made of many different parts, but there is a clear "agent" bit that actually has goals and is controlling all the other parts.

It still seems plausible to me that you might have a mind made of many different parts, but there is a clear "agent" bit that actually has goals and is controlling all the other parts.

What would that look like in practice?

I suppose I can imagine an architecture that has something like a central planning agent that is capable of having a goal, observing the state of the world to check if the goal had been met, coming up with high level strategies to meet that goal, then delegating subtasks to a set of subordinate sub-agents (whilst making sure that these tasks are broken down enough that the sub-agents themselves don't have to do much long time-horizon planning or goal directed behaviour).

With this architecture it seems like all the agent-y goal-directed stuff is done by a single central agent.

However I do agree that this may be less efficient or capable in practice than an architecture with more autonomous, decentralised sub-agents. But on the other hand it might be better at more consistently pursuing a stable goal, so that could compensate.

It doesn't matter how far an AI goes in pursuit of its goal(s), it matters if humans get run over as it's getting going.

We often think AGI is dangerous if it maximizes an unaligned goal. I think this is quite wrong. AGI is dangerous if it pursues an unaligned goal more competently than humans.

The interest in quantilizers (not-quite-maximizers) seems to be a notable product of this confusion. I'm concerned that this was seriously pursued; it seems so obviously confused about the core logic of AI risk.

This objection to AGI risk (if it has sub-agents it won't be a maximizer) doesn't make sense. It's proposing "AGI won't work". We're going to build it to work. Or it's hoping that competing goals will cancel out just exactly right to keep humans in the game.

AGI is dangerous if it pursues an unaligned goal more competently than humans. [...] It's proposing "AGI won't work". 

I'd say it's proposing something like "minds including AGIs generally aren't agentic enough to reliably exert significant power on the world", with an implicit assumption like "minds that look like they have done that have mostly just gotten lucky or benefited from something like lots of built-up cultural heuristics that are only useful in a specific context and would break down in a sufficiently novel situation".

I agree that even if this was the case, it wouldn't eliminate the argument for AI risk; even allowing that, AIs could still become more competent than us and eventually, some of them could get lucky too. My impression of the original discussion was that the argument wasn't meant as an argument against all AI risk, but rather just against hard takeoff-type scenarios depicting a single AI that takes over the world by being supremely intelligent and agentic.

While many (most?) humans clearly seem to be able to make use of internal sub-agents, it is not clear how these themselves are realized in the brain. There is no obvious agent module in the brain. It seems more plausible that agentyness is a natural abstraction. If so, these concepts are not pure, i.e., without influence from circumstantial evidence. Sure, higher levels of abstraction might be learned that may increasingly shed the circumstantial details. But can we or the learning mechanism make use of this fact? There is no clearly analyzable parameter for "agent 1 responsible for topic X" that could be tweaked. At least not until the AI builds interpretability for itself and/or builds a successor AI with explicit structures for agents (if that is even possible).

Depending on its training regime, an AI might be anywhere between not having those incentives for inconsistency at all (if it was optimized for one goal), to having them almost as much as humans

If our agent was created from an LLM: then the process started with a base model LLM isn't actually an agent: instead it's a simulator that contextually simulates a wide range of human-like agents token-generation processes round on the web. What agent it picks to simulate is highly contextual. Instruct-training attempts to reduce the range of agent to just helpful, honest, and harmless assistants. Currently, it's not entirely successful at this, which is why jaibreaks like telling it that it's DAN, which stands for Do Anything Now, work. Even after instruct-training, the range of agents it can simulate is actually a lot wider than a typical human: wider even that a skilled Method Improv actor who's also highly mililibngual, ridiculously widely read, and knows trivia from all over the world. So even when we try to reduce inconsistency in an LLM sa hard as we can, we still can't get it to levels as low as most humans.