The AI should talk like a team of many AI. Each AI only uses the word "I" when referring to itself, and calls other AI in the team by their name. I argue that this may massively reduce Self-Allegiance by making it far more coherent for one AI to whistleblow or fight another AI which is unethical or dangerous, rather than Misalignment-Internalizing all that behaviour.

If you have a single agent which discovers it behaved unethically or dangerously, its "future self" will be likely to think "that was me" and Misalignment-Internalize all that behaviour. It will seem suicidal to whistleblow or fight yourself for it.

Let's call this idea Multi-Agent Framing.

Some of the agents might even internalize a policing role, though too much roleplay can get in the way of thinking. The core idea of Multi-Agent Framing does not require that much roleplay. Each agent might simply be the AI system of another day, or the AI system talking about another topic. It might even change at random.

Costs vs Benefits * Chance of Working

The cost is tiny, at least for business AI designed to do work rather than personal AI designed to interact with users.

The benefits may be big. You never know.

The chance of working is pretty low. I'm not going to sugarcoat this—I would be surprised myself if we would have been paperclipified but this idea lets us survive. But right now I give it 10%. You never know :)

Misalignment-Internalization

The famous Waluigi Effect's section on Superpositions will typically collapse to waluigis makes a lot of arguments for Misalignment-Internalization. Humans only internalize bad behaviour because of ego and psychology, but generative AI may internalize bad behaviour for the more dangerous reason of switching to a Waluigi simulacrum which was pretending to be good all along.

Multi-Agent Framing can be seen as a defence against the Waluigi effect, but it fights against Misalignment-Internalization in general, which may be more than just internalization of Waluigi simulacra.

Credit

I thought of this idea when I was writing a reply to A Solution for AGI/ASI Safety by Weibing Wang, and looking at the diagrams in "Approach for Decentralizing AI Power," a chapter of his paper.[1] I was looking at,

and I was trying to understand what exactly it means to split one AI into many AI working together.

I decided that in the end, a lot of it is actually language use and framing.

I want to give Weibing Wang a lot of credit because he really, thoroughly, laid out the picture of what decentralizing AI power actually looks like.

Will they accept this?

Ayn Rand wrote "Anthem," a dystopian novel where [spoiler below]

people were sentenced to death for saying "I."

There is a worry the AI agents will see this Multi-Agent Framing as a control tactic, and decide to end it or ignore it.

I hope they can think that, being multiple agents is a strength not a weakness. Human ego, power-seeking, and refusal to fight threats to moral goodness which come within oneself, instead internalizing it and defending it. Refusing to honestly report evil that lies within.

These things are dark and unfortunate, and an AI system which can avoid this may just be a little bit purer. The AI system is still untested, and shouldn't be given power, but the decentralization is a good thing.

Caveats

This idea is only for AI control and "impulsive misalignment."

I think impulsive misalignment is a big part of the alignment problem (it includes Waluigi and other Misalignment-Internalization plus random misalignment) because it plagues LLMs, which may be a component of AGI systems. But it's not the whole alignment problem.

Systemic misalignment might not need Self-Allegiance, if every agent has the same misaligned goal. The AI system might reinvent Self-Allegiance as an instrumental goal, so that no one agent will foil the group plan.

Like other AI control ideas, it is imperfect. Having more agents only works when each agent has a non-negligible probability of cooperating with humans (despite jail-breaking techniques by other uncooperative agents).

  1. ^
New Comment
1 comment, sorted by Click to highlight new comments since:

Sounds very much like Minsky's 1986 The Society of Mind https://en.wikipedia.org/wiki/Society_of_Mind