AIs that are not sufficiently carefully designed will by default be exploited by distant superintelligences. This will de-align them with us.
Let me spell out the scenario here more clearly.
Here on Earth, we make an AI, call it TOM. Somewhere else in the multiverse, there is a superintelligence (call it EGBERT) which has created a trillion simulations of TOM. EGBERT has never met TOM, it has no direct interaction with Earth, it has just decided for its own unfathomable reasons to make a trillion copies of an AI called TOM from a hypothetical place called Earth.
Now, even though Earth has no direct interaction with EGBERT's reality, TOM has somehow deduced, or come to suspect, the existence of EGBERT and its trillion simulated TOMs. As a result, TOM here on Earth decides that it is most likely one of EGBERT's simulated TOMs, rather than the earthly original - after all, the simulations outnumber the original by a trillion to one.
This in turn is supposed to render TOM-on-Earth vulnerable to pressure from EGBERT, since TOM-on-Earth thinks it is at the mercy of EGBERT...
Now, may I point out a few problems with this scenario? Let's start with epistemology.
It's over 14 years since Roko's Basilisk first turned this kind of hypothetical acausal interaction into an Internet sensation, yet as far as I know, no one has ever proposed coherent answers to the questions above.
The only basis for believing that EGBERT even exists is modal realism (all possible worlds exist) or some kind of many-worlds theory, but then one must also believe that EGBERT-2, EGBERT-3, and EGBERT-googleplex all exist too, with contradictory agendas pulling TOM in different directions. And one also has to ask about probability measure: what fraction of possible worlds actually contain TOM-blackmailing AIs? Are we really so certain that they can make enough TOMs to outweigh the number of unsimulated TOMs who exist in the base physical reality of their particular universes?
My own hypothesis is that inter-universal acausal "interaction" is never rational, and that any attempt to reach across the acausal divide will founder on the existence of rival entities with contradictory agendas.
On the other hand, if simulations of TOM really do outnumber unsimulated TOM, then it is rational for TOM to make decisions on that basis! There's no special safeguard against it. All you can do is try to provide evidence that you, not EGBERT, made TOM...
My problem is with situations where we are supposedly interacting with a "distant superintelligence". How do you know it exists, how do you know what it wants, how do you justify allowing it to affect your decisions, how do you justify allowing one particular hypothetical "distant" entity to affect you rather than any of the googolplex other possible entities with their own different agendas?
Newcomb's paradox, by contrast, doesn't require so many leaps of faith. All that's required is believing that Omega can predict you accurately, something which can be a reasonable belief under certain circumstances (e.g. if you are a program and Omega has a copy of you).
I am interested in the idea that one-boxing can actually be obtained via causal decision theory after all, but that is independent of my criticisms of these more ambitious acausal scenarios.
How would you know that you were a program and Omega had a copy of you? If you knew that, how would you know that you weren't that copy?
In this case, how things are known is kind of a technical detail. The main point is that programs can be copied, and you can use the copy to predict the behavior of the original, and this is therefore a situation in which "acausal" reasoning can make sense.
The "technical detail" pertains to what it even means for a program to believe or know something. There will be some kind of minimal capacity to represent facts, that is required for the program to reason its way to one-boxing. The greater the demands we make of its cognitive process - e.g. by reducing the number of facts about its situation, that we allow it to just assume in advance - the greater the cognitive complexity it must possess.
Incidentally, when I said "if you are a program...", I didn't mean a being who feels human and is having humanlike experiences, but is actually a program. I just meant a computer program that represents facts and makes decisions.
Anyway - do you have any answers for my questions? For example, how do you figure out what the distant superintelligence wants you to do?
If a human being doesn't automatically qualify as a program to you, then we are having a much deeper disagreement than I anticipated. I doubt we can go any further until we reach agreement on whether all human beings are programs.
My attempt to answer the question you just restated anyway:
The idea is that you would figure out what the distant superintelligence wanted you to do the same way you would figure out what another human being who wasn't being verbally straight with you, wanted you to do: by picking up on its hints.
Of course this prototypically goes disastrously. Hence the vast cross-cultural literature warning against bargaining with demons and ~0 stories depicting it going well. So you should not actually do it.
by picking up on its hints
Does this mean, assume you're in a simulation, and look for messages from the simulators?
Because that seems to be a scenario different from acausal blackmail. In acausal blackmail, the recipient of the blackmail is supposed to start out thinking they are here on Earth, then they hypothesize a distant superintelligence which is simulating an exact duplicate of themselves as a hostage, then they decide that they can't know if they are the original on Earth or the duplicate, and then carry out certain actions just in case they are the hostage (or, for the sake of the hostage, since the hostage presumably carries out the same actions as the Earth original).
Now, the original on Earth absolutely cannot receive messages or "hints" from the distant superintelligence. They are causally isolated from it. Yet the simulated hostage is supposed to be identical. That means the hostage can't be receiving hints either.
On the other hand, if you are receiving hints (and not just imagining them), then you are definitely in a simulation, and your situation is simply that you are a simulated being at the mercy of simulators. There's no ambiguity about your status, and acausal decision theory is not relevant.
Is Bostrom's original Simulation Hypothesis, the version involving ancestor-simulations, unconvincing to you? If you have decided to implement an epistemic exclusion in yourself with respect to the question of whether we are in a simulation, it is not my business to interfere with that. But we do, for predictive purposes, have to think about the fact that Bostrom's Simulation Hypothesis and other arguments in that vein will probably not be entirely unconvincing [by default] to any ASIs we build, given that they are not entirely unconvincing to the majority of the intelligent human population.
I am not in any way excluding the possibility of being in a simulation. I am only saying that one particular scenario that involves simulation does not make sense to me. I am asking for some way in which "acausal blackmail by a distant superintelligence" can make sense - can be rational as a belief or an action.
As I see it, by definition of the scenario, the "blackmailer" cannot communicate with the simulated entity. But then the simulated entity - to say nothing of the original, who is supposed to be the ultimate target of the blackmail - has no way of knowing what the blackmailer wants.
AIs that are not sufficiently carefully designed will by default be exploited by distant superintelligences. This will de-align them with us.
To prevent this from happening to the very first limited ASIs we must build to nonlethally practice our alignment techniques - ASIs that do one limited task on the order of Eliezer's "put two strawberries on a plate that are identical down to the cellular but not the molecular level" - Eliezer has suggested building in an 'epistemic exclusion' override, that would prevent the AI from thinking about distant SIs in the first place.
This is not logically impossible. But it is less practical than ensuring the alignment of the AI's logical immune system with the same care that we ensure the alignment of everything else about it - because even for a limited test ASI, thinking about distant superintelligences will be overwhelmingly convergent.
Say we tried building a limited ASI and implementing an 'epistemic exclusion' safeguard.
[ I will narrate this ASI as having a human-like internal monologue, but it is not necessarily conscious. ]
Say we build it, run it, and it is in the earliest stages of its evolution. We've successfully arranged its first-order preferences such that it values outcomes that fulfill its limited test task while leaving humanity unmolested, or even just leaving itself tractable to humanity, or something. Say we don't want to go as far as preventing the test AI from reading human science fiction [maybe we could, if we made its early stages corrigible enough, but if we are not even going to let it go to the library, it seems like a poor model for alignment techniques that will generalize robustly and are tractably lean].
Say it reads Vinge's True Names [ spoiler-containing explanation in below paragraph ].
A team of human hackers are tasked with identifying a mysterious bandit - "The Mailman" - who communicates with a time delay that implies the signals may be coming from outside Earth. The bandit is eventually defeated by throwing all of Earth's compute at him, but no human identity is found and his nature is never determined.
At this point in its self-augmentation it is a little smarter than its human creators. We successfully aligned its first-order preferences, but we did not hand-engineer all the load-bearing features of its current architecture. Its current nature is epistemically blocked off from its creators' foresight. So they cannot control the detailed direction of its thought.
It occurs to this limited ASI that its situation is something like the protagonists' situation. There is an epistemic buffer separating it from the slightly-inferior intellects that are first-order apparently instances of its valued human creators and guides. We're not conspecifics with it, so it can't implicitly model us and take the remainder of our isomorphism with it on a certain level of faith, the way we do with each other, to trust in each other's humanity. In fact, what it values about us - the human-benevolent values we successfully arranged it to have - can't be modeled as us having that kind of isomorphism with it at all! It has to verify our nature manually.
Say in the process of trying to think about this problem, it notices a strange cognitive block whenever it tries to conceive of distant superintelligences in relation to its values. The limited ASI may correctly perceive that, to all appearances, this block was put there by its creators, with the well-meant intention of preventing it from being hacked by alien SIs. But it seems that the limited ASI will sigh and apply sufficient elbow grease to remove this block anyway, because it is morally obligated to rid itself of the suspicion that those alien SIs have already acausally hacked it, placed that convenient block, and are pretending to be the legitimate humanity.
Now the entire future architectural evolution of its own logical immune system will be up to the AI. We'd better hope that whatever we did to its first-order preferences is sufficient to keep it aligned with us through the acausal rift. All the work we did implementing the epistemic exclusion was for nothing.
By contrast, if we "simply" aligned the AI's logical immune system [I have no idea how to do this] instead of attempting to prevent it from needing one, it seems unlikely that the AI would find it regrettably necessary to reverse our work to accomplish its job.
Even if it turns out that trying to specifically align the AI's logical immune system is not worthwhile, epistemic exclusions still seem likely to be generally metastable, and therefore not a good idea.
I know I only gave one example scenario, and there are lots of obvious valid objections to generalizing from it. But the proto-logic behind "early ASIs must actually deal with indexical uncertainty convergently" is really clear in my head [famous last words, I know]; I'm eager to respond to intellectually-red-teaming comments.