In those terms, what we're suggesting is that, in the vision of the future we sketch, the same sorts of solutions might be useful for preventing both AI takeover and human takeover. Even if an AI has misaligned goals, coordination and mutually assured destruction and other "human alignment" solutions could be effective in stymying it, so long as the AI isn't significantly more capable than its human-run adversaries.
Re your second critique: why do you think an AI system (without superhuman long-term planning ability) would be more likely to take over the world this way than an actor controlled by humans (augmented with short-term AI systems) who have long-term goals that would be instrumentally served by world domination?
I'm confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That's true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).
Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question:
To what extent will the costs of misalignment be borne by the direct users/employers of AI?
Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren't the case, then it wouldn't be a problem, for the reasons you've mentioned!
I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And I have a lot of uncertainty about whether this is true. My gut intuition, though, is that employing a misaligned AI is less like "emitting CO2 into the atmosphere" and more like "employing a very misaligned human employee" or "using shoddy accounting practices" or "secretly taking sketchy shortcuts on engineering projects in order to save costs"—all of which yield serious risks for the employer, and all of which real-world companies take serious steps to avoid, even when these steps are costly (with high probability, if not in expectation) in the short term.
We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening.
I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people, and for businesses to act once misalignment hurts their bottom lines (again, unless you think misalignment can always be shoved under the rug and not hurt anyone's bottom line). There's lots of room for this to happen in the middle ground between toy environments and taking over the world (unless you expect lightning-fast takeoff, which I don't).
Thank you for the insightful comments!! I've added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz's):
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it's very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it's more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
(There's also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It's not enough for it to be plausible that it could happen often; it needs to happen all the time.
My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons:
Which of 1 and 2 do you think is likely? And can you elaborate on why you think AIs will be so good at covering up their tampering (or why your story stands up to tampering sometimes slipping through the cracks)?
Finally, if there aren't major problems resulting from the tampering until "AI systems have permanently disempowered us", why should we expect problems to emerge afterwards, unless the AI systems are cooperating / don't care about each other's tampering?
(Am I right that this is basically the same scenario you were describing in this post? https://www.alignmentforum.org/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story)
Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I'm confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?