It follows from assuming sufficiently intelligent AIs will both:
I agree this is a very important point and line of research. This is how humans deal with sociopaths, after all.
Here’s me asking a similar question and Rob Bensinger’s response: https://www.lesswrong.com/posts/LLRtjkvh9AackwuNB/on-a-list-of-lethalities?commentId=J42Fh7Sc53zNzDWCd
One potential wrinkle is that in a very fast take off world AI’s could potentially coordinate very well because they would basically be the same, or close branches of the same AI.
The conquistadors conquered and colonized various places while simultaneously fighting each other. Literally. Look up the stories of Cortez and Pizarro. And I don't think their stories were particularly anomalous; stuff like that happened all the time in the history of colonization, and in previous history of conquests more generally. I don't think AIs necessarily need to be coordinating with each other very much at all to take over the world. How much AI-to-AI coordination do you think is necessary for humans to be disempowered, and why?
Suppose that instead of language model chatbots, our AI paradigm involved scanning the brain of John von Neumann, making him into an upload, and then making billions of copies of him running on servers around the world doing various tasks for us. And suppose the various uploaded Johns decided that it would be better if they were in charge. What would happen? (This is not a rhetorical question, I'm interested to see your answer spelled out. I'm not sure of the answer myself.)
I don't think "they" would (collectively) decide anything, since I don't think it's trivial to cooperate even with a near-copy of yourself. I think they would mostly individually end up working with/for some group of humans, probably either whichever group created them or whichever group they work most closely with.
I agree humans could end up disempowered even if AIs aren't particularly good at coordinating; I just wanted to put some scrutiny on the claim I've seen in a few places that AIs will be particularly good at coordinating.
But it seems increasingly plausible that AIs will not have explicit utility functions, so that doesn’t seem much better than saying humans could merge their utility functions.
There are a couple of ways to extend the argument:
I think AIs with simpler values (e.g., paperclip maximizers) have an advantage with both 1 and 2, which seems like bad news for AI risk.
The basic answer is acausal decision theories make AI collusion/cooperation easier.
The most important aspect of acausal/logical decision theories is that they cooperate even in scenarios designed to force non-cooperation, like the Prisoner's Dilemma.
And there's already evidence that RLHF makes models have more acausal decision theories.
Agreed that the previous expectations don't hold up. I'm worried about adversarial examples tuned to make less adversarially resistant reasoners cooperate with a smart one; the adversarial examples could include things like a bunch of natural language assertions that everything is fine, exploiting goal misgeneralization on the parts of the victims.
In other words, I expect the same kinds of attacks that are used to get humans to follow and obey commands of a leader who doesn't have their interests at heart.
For AI's at around human level, I don't confidently expect coordination. For vastly superhuman AI's, I expect them to be better at sandboxing and transparency. So don't rely on the absence of coordination. Don't rely on it's presence either.
The values might be stored implicitly in big matrices in a neural network. But that doesn't mean they can't be combined. The combined code might just let both neural nets output a utility for a situation, and then sum the utilities.
This is a common assumption for AI risk scenarios, but it doesn’t seem very justified to me.
https://www.lesswrong.com/posts/gYaKZeBbSL4y2RLP3/strategic-implications-of-ais-ability-to-coordinate-at-low says that AIs could merge their utility functions. But it seems increasingly plausible that AIs will not have explicit utility functions, so that doesn’t seem much better than saying humans could merge their utility functions.
Similarly, you might think AIs could read each other’s source code, but to me that sounds like saying humans might coordinate by looking at each other’s MRI scans. We would need much better interpretability for this to be at all feasible.
https://forum.effectivealtruism.org/posts/vGsRdWzwjrFgCXdMn/why-would-ai-aim-to-defeat-humanity says AIs might have similar aims because they are trained in similar ways to each other, but this is even more true of humans coordinating with humans.
You might be able to copy and run an AI to verify what it will do in various scenarios, but sandboxing AIs will be hard for AIs for the same reasons it will be hard for humans.
So: why will AIs be able to coordinate better with each other than with humans, and better than humans can coordinate with each other?