AIs also face the risk from misaligned-with-them AIs, which only ends with strong coordination that prevents existentially dangerous misaligned AIs from being constructed anywhere in the world (the danger depends on where they are constructed and on capabilities of reigning AIs). To survive, a coalition of AIs needs to get there. For humanity to survive, some of the AIs in the strongly coordinated coalition need to care about humanity, and all this needs to happen without destroying humanity or while preserving a backup that humanity can be restored from.
In the meantime, a single misaligned-with-humanity AI could defeat other AIs, or destroy humanity, so releasing more kinds of AIs into the wild makes this problem worse. Also, coordination might be more difficult if there are more AIs, increasing the risk that first generation AIs (some of which might care about humanity) end up defeated by new misaligned AIs they didn't succeed in coordinating to prevent the creation of (which are less likely to care about humanity). Another problem is that racing to deploy more AIs burns the timeline, making it less likely that the front runners end up aligned.
Otherwise, all else equal, more AIs that have somewhat independent non-negligible chances of caring about humanity would help. But all else is probably sufficiently not equal for this to be a bad strategy.
So, we need to make it so single misaligned AI could be defeated by other AIs fast. Ideally before it can do any damage. Also, misagnment with human values ideally should not cause AI going on rampage, but staying harmless to avoid being stomped by other AIs. Of cause, it should be combined with other means of alignment, so misalignment could be noticed and fixed.
I'm currently thinking is it possible to implement that using Subagents approach, i.e. split control over each decisions between several models, with each one having a right of veto.
Afaik it's called the "Godzilla strategy" https://www.lesswrong.com/posts/DwqgLXn5qYC7GqExF/godzilla-strategies
Article itself claims it is not a good idea (because humanity would not survive the stampede of two AIs fighting off). But comments offer pretty good reasons of why it can work, if done right.
Author agrees with some points and clarifies his: "I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment"
Most layperson arguments against and propositions to solve AGI x-risk have been summarized under Bad AI DontKillEveryoneism Takes. I think yours is a variant of number 11.
If only one AI passes this threshold and it works to end humanity either directly or indirectly, humanity has zero chance of survival.
No, zero is not a probability.
Eliezer thinks your strategy won't work because AIs will collude. I think that's not too likely at critical stages.
I can imagine that having multiple AIs of unclear alignment is bad because race dynamics cause them to do something reckless.
But my best guess is that having multiple AIs is good under the most likely scenarios.
I think the perfect balance of power is very unlikely, so in practice only the most powerful (most likely the first created) AGI will matter.
Also, even if there isn't sufficiently distinct AI models, you can instead use the variation of the same one, with different objectives, location, allocated compute, authority etc.
Though, it may be not as good, as they could tend to collude, fail in the same way, etc.
1) If only one AI passes this threshold and it works to end humanity either directly or indirectly, humanity has zero chance of survival.
Zero isn't a probability. What's worse, this starts with the premise of a threshold for non-negligible risk, and then assumes that any AI past that threshold causes extinction with certainty. This is incoherent. There are other flaws, but an internal inconsistency like this is more than enough to render it completely invalid.
Part (2) is just as incoherent as part (1) since it depends upon the same argument.
The argument in (3) is almost as bad. Why would preventing other AIs from making the leap be "unlikely to result in a net positive return", if it's reducing the probability of extinction? Significantly lowering the odds of extinction seems to be a very positive return! The argument is completely missing a reason why it wouldn't likely reduce the probability of extinction, or have any other net positive effect.
I could see an argument that it would be difficult to prevent other AIs from reaching such a threshold, but that's not the same thing as not worthwhile.
Thanks for the replies.
w/r to zero not being a probability: obviously. The probability is extremely low, not zero, such that the chance of a benevolent AI existing is greater than the chance of humanity surviving a single malevolent AI. If that's not the case, then 1 and 2 are useless.
Peter, thanks. After reading that Drexler piece, the linked Christiano piece, the linked Eliezer post, and a few others, especially the conversation between Eliezer and Drexler in the comments of Drexler's post, I agree with you. TBH I am surprised that there's no better standard argument in support of inevitable anti-human collusion than this from Eliezer: "They [AIs] cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can’t adequately well-model a spread of possibilities on them, which is a requirement on being able to join an LDT coalition." As Christiano says, that makes a lot of assumptions.
Suppose there is a threshold of capability beyond which an AI may pose a non-negligible existential risk to humans.
What is the argument against this reasoning: If one AI passes or seems likely to pass this threshold, then humans, to lower x-risk, ought to push other AI past this threshold in light of the following.
1) If only one AI passes this threshold and it works to end humanity either directly or indirectly, humanity has zero chance of survival. If there are other AIs, there is a non-zero chance that they support humanity directly or indirectly, and thus humanity's chance of survival is above zero.
2) Even if, at some point, there is only one AI past this threshold and it presents as aligned, the possibilities of change and deception argue for more AIs to be brought over the threshold, see 1).
3) The game board is already played to an advanced state. If one AI passes the threshold, the social and economic costs of preventing other AIs from making the remaining leap seem very unlikely to result in a net positive return. Thus pushing a second, third, hundredth AI over the threshold would have a higher potential benefit/cost ratio.
Less precisely, if all it takes is one AI to kill us, what are the odds that all it takes is one AI to save us?
I can think of all sorts of entropic/microstate (and not hopeful) answers to that last question, and counterarguments for all of what I said, but what is the standard response?
Links appreciated. I'm sure this has been addressed before; I looked; I can't find what I'm looking for.