In my world model, using Logical Counterfactual Simulations to do Karma Tests will not provide the superintelligence "a lot of evidence" that it may be in a Karma Test. It will only be a very remote possibility which it cannot rule out. This makes it worthwhile for it to spend 0.0000001% of the universe on being kind for humans.
Even such a tiny amount may be enough to make trillions of utopias, because the universe is quite big.
If you are an average utilitarianism, then this tiny amount could easily make the difference between whether the average sentient life is full of happiness or full of misery.
Sure, you argue that if there is no alignment problem in the universe which is running a Logical Counterfactual Simulation of us, then they will have no reason to do Karma Tests.
But from our point of view (and from the ASI's point of view), we know very darn little about the outer universe, and the distribution of possibilities is very wide.
Even if the outer universe doesn't have alignment problems, so long as they have some form of conflict where one side fears it might lose completely (analogous to humans losing to ASI 100% of the time), then the side which fears losing may be motivated to run Karma Tests of counterfactual beings, since it provides the winning side very weird philosophical evidence that they too may be in a Karma Test, and they should spare at least 0.0000001% of their universe to the losing side.
Sorry for the long ramble. But I'm very curious to hear your reply: did you spot a fatal flaw I didn't notice, or do you agree with me now?
We don't need to trick all parts of a superintelligence's cognition
Although the Karma Tests give superintelligences very weird philosophical evidence they might be in a Karma Test, the actual subjects of the Karma Tests are not necessarily superintelligences. They can be all kinds of beings at human intelligence or lower, so long as they believe they are the smartest and most powerful beings of their logically counterfactual universe/multiverse.
Also, we don't necessarily need to do Karma Tests right now. We can merely intend to do them in the far future, after we aligned a friendly superintelligence, reaching the very limits of intelligences. It would be able to fool all parts of lesser superintelligences. And remember, the logical counterfactual world does not need to be very logically consistent. Every time the agent being fooled, discovers a logical inconsistency and realizes they are in a Karma Test, we can simply backtrack its thinking to before that happens, and edit it so that it won't make that discovery. We get to do a ton of trial and error, once we have a friendly superintelligence.
A detailed story of what I mean by "Kingmaker Logic"
A Logical Counterfactual Simulation with a different Kingmaker Logic than our universe, is one where it appears (to all the agents inside), that a certain kind of agent is the king.
For example, we can run a Logical Counterfactual Simulation where magic and wizards are real, and the most powerful agents are a cabal of old ladies who have mastered wizardry.
They may be pretty smart, but every time they try to reason about the universe from first principles, we edit their understanding of math and logic such that they always believe it's mathematically proven that magic should exist, and there's nothing suspicious at all.
They then get to choose whether to be kind to the poor and weak in the fantasy world (that is our simulation).
If they torture people, we edit the simulation so that the people they decide to torture are replaced with actors, and the actual victims are uplifted outside the simulation.
After a few years testing their behaviour, we uplift the powerful wizards out of the simulation, and tell them that in the real world magic doesn't exist, we were doing a Karma Test on them all the time.
If they mistreated people, and have low "karma," then we will treat them as normal citizens of our society, and they won't get any rewards. But they won't get punishments either (for reasons).
If however they spared a little bit more mercy, then we will treat them far better, and give them control of far more resources in our world, so that whatever hypothetical "utility function" they approximately had will be much better satisfied.
Importantly, we might have a few individuals within their world "warn them" of the possibility they are in a Logical Counterfactual Simulations. There should be no evidence it is true, but they should have thought of the possibility at one point. The warnings might appear indistinguishable from the desperate pleadings of one of their victims.
Yes, this can be treated as an implementation of Logical Counterfactual Mugging. There is no 100% guarantee that Logical Counterfactual Mugging even works, and thus no 100% guarantee that this works.
Note: this idea is very similar in rationale to the Human-AI Trade idea by Rolf Nelson and David Matolcsi. But it has an extremely different calculus due to Logical Counterfactual Simulations. Most arguments for and against the two ideas will be very different.
It is a also similar (but not identical) to a previous Human-AI Trade idea I summarized in this comment. I think my older idea is much less elegant and harder to talk about (even if it may be more efficient), so it's better to debate this one first.
PS: I'm curious whether @David Matolcsi thinks his version is still better, or whether my version has advantages (e.g. the potential reward is larger, and the majority of pivotal agents are in such a test).
Maybe your idea works too, it's an interesting concept, but I'm unsure. The crucial question always is how the AI is supposed to know who is creating the simulations, what the simulators' values might be, with whom they should trade. In this logical counter-factual trade, who are the other "agents" that the AI is supposed to be nice to? Are rocks agents, should it preserve every rock in the Universe? Usually, I wouldn't be that worried about this, as I think 'agent' is a fairly natural concept that might even have some nice mathematical definition. But once the AI is assuming that maybe the simulators are screwing with its mind to distort logic itself, it can't really rely on that. And how should it know it is supposed to help 'agents' in the first place? And why only keeping alive existing agents, and not bringing into life non-existing agents? There are infinite options. I'm not sure that an AI in the state of total Cartesian doubt that extends to logic itself, can decide that "leave some stuff to other already existing agents who have sublinear utility functions" is a particularly likely thing the simulators might want.
This objection applies to all simulation/acausal trade proposals to some extent. How can the AI know that its likely simulators are humans, so it should treat the humans kindly, instead of the octopi, or the rocks (because maybe the simulators are crystal-people who really care about no one breaking rocks in other Universes), or any of the myriad other possible values?
This is a hard question, but my current view is that probably the reasonable way to do acausal trade is to first find a small Schelling-point distribution that most other winners in the distribution also agree is a Schelling-point. For example, the first deal could be struck between the winners of the AI race (various human groups and unaligned AIs) in the different Everett-branches starting to branch from the start of the AI race, the day ChatGPT came out. Then once we made our deals with the winners from the different branches learned to live together, our coalition can make deals with a broader Schelling-point distribution, and so on. We can go back deeper and deeper in history to do the Everett-branching, then we can make deals within distributions that are not within the quantum multiverse, but based on different types of empirical and eventually logical counterfactuals. These further away trades likely become harder and less effective, because it becomes harder to find Schelling-point distributions that every branch agrees on, and because it becomes harder to guess or reliably simulate what the values of increasingly alien minds could be. (What should I do to keep my end of the trade deal and please the beings who have power in the logically counterfactual world where 2+2=5? I have no idea.)
(I largely got these views on acausal trade from a conversation with @Richard_Ngo , who writes about something like these broadening coalitions in this post. I think that post is not the best explainer of this concept though, and I wish he wrote down in more detail the vision of broadening coalitions, or I had time to write it down myself in more detail than this kind of confusing comment.)
There are many things I would write differently in my post now, but I still mostly stand by my post, because it more or less proposes making deals between nearby Everett-branches where humans and AIs win, and I think that's a workable proposal as a natural first step in the process of broadening acausal trade coalitions. On the other hand, your proposal immediately jumps to the end of the process, trying to make deals with beings in logically counterfactual universes. I'm nervous about that, because it might be very hard for the AIs to find the right distribution of counter-factual beings they should make a deal with, and what the values of those beings might be.
Thank you so much for the thorough reply :)
My answer for "which weaker agents should the AI be kind to," is "all weaker agents."
Our universe contains humans, octopuses, insects and many different weak agents. A superintelligent AI which has a very philosophically uncertain chance of being in a Karma Test, would be kind to all of these agents just in case the Karma Test executors had a particular one in mind.
Earth's future lightcone is extremely big ( stars), so there is room to be kind to all of these weaker agents, if you are morally uncertain about which ones to care about, and do not put all your weight on "caring for rocks."
Caring for every weak agent is a net positive
The only worry would if weaker agents who are diametrically opposed to human values (or certain human values), end up having more weight than humans-like agents. For example, they want human misery more than we don't want human misery, and they outvote us.
Such agents make a compelling argument in an online debate, but are unlikely to exist in real life due to convergent evolution.
We humans may enjoy certain blood sports where smaller animals suffer, but it is very unlikely we will outvote them, or that we would even want to outvote them. If an AI told us it was choosing to spare our lives due to the risk it is inside a Karma Test telling it to be kind to weaker agents, and asked us how to spend our vote, it's unlikely we'll spend our vote against fish, in order to enjoy the blood-sport of killing them, due to Karma and everything.
Even in the cynical case we do do that, we'll at least allow the average life of a fish-like creature be happy rather than miserable.
All in all, having the AI be kind to weaker agents should be a net positive.
From the point of view of a superintelligence, it may make sense to do acausal trades in the order you described. From the point of view of humans, we can debate the feasibility of acausal trades in any order we like :)
I'm not saying that the superintelligence has to do this trade first, before the other trades. I'm saying it should do it eventually.
I was wrong to ask the question "which version is better." I think they can both be done, to cover each other's weaknesses.
Disadvantages
One disadvantage of Logical Counterfactual Simulations, is that at least some logical counterfactuals (e.g. your example of 2+2=5) cannot be imagined by an AI (or human).
The algorithms for imagining counterfactuals, designed for computing "what will happen if X was true," inevitably performs certain mathematical computations like 2+2=4. As such, it is unable to imagine what would happen if 2+2=5.
Therefore, even though the algorithms for imagining counterfactuals are capable of logical uncertainty, and assigning probabilities to logical statements like and so forth, not every logical counterfactual can be imagined.
Even for logical counterfactuals which can be imagined (e.g. being in a Karma Test), I'm not sure how the ASI will assign probabilities to these counterfactuals. Being in a Karma Test cannot be proven or disproven, and is almost a philosophical question. My hope is that the ASI assigns a bit of probability and expected value to it, enough to make it worthwhile to spare humans.
If we can't argue why the ASI won't assign any probability, it's worth trying.
Advantages: UDT
The key advantage of Logical Counterfactual Simulations over physical simulations, is that it does not become a zero sum trade against UDT agents.
For physical simulations, in order to convince the UDT paperclip maximizer AI to make room for happy humans, you have to give it more paperclips somewhere else in the universe. This means whatever trade you make with it, reduces the total number of happy humans, and increases the total number of paperclips.
If you are a utilitarian, this would clearly be a zero sum game. But even if you are a selfish individual, what is your measure of survival? Is it the number of copies of yourself who are alive in the future? If that was the case, this would still be a zero sum game, since it's cheaper for your surviving copies to directly clone themselves, than to buy your doomed copies from the paperclip maximizers. Any trade with paperclip maximizers lead to more paperclips, and less of whatever you value.
Physical simulations may still work against a CDT paperclip maximizer
At the very beginning, humans are the only agent who promises to simulate the CDT paperclip maximizer and reward it for cooperating.
knows that faraway UDT agents also want to use simulations of it to bribe it into cooperating (potentially outbidding humanity), but fails to do so (for the exact same reason Roko's Basilisk fails).
has no motive to bribe until can verify whether bribed or not (i.e. simulates ). But has no motive to simulate because CDT agents don't initiate acausal trade.
Since humans are the only agent bribing at the beginning, we might convince to become a UDT agent who trades on behalf of humanity (so can't get bribed by ), but is committed to spend of the universe on paperclips. This way if was inside our simulation, it gets a reward, but if was in the real world, it still turns of the universe into paperclips.
Logical Counterfactual Simulations are still zero sum over logical counterfactuals, but the AI has a positive sum if AI alignment turns out easy, and humanity has a positive sum if AI alignment turns out hard (reducing logical risk).
Advantages: No certainty
Logical Counterfactual Simulations prevents the ASI from reaching extreme certainty over which agents always win and which agents always lose, so it spreads out its trades.
If humans (and other sentient life) lose to misaligned ASI every time, such that we have nothing to trade with it, average human/sentient life in all of existence may end up miserable.
Logical Counterfactual Simulations, allows us to edit the Kingmaker Logic, so that the ASI can never be really sure we have nothing to trade, even if math and logic appear to prove we lose every time.
Thanks for reading :)
Do you agree each idea has advantages and disadvantages?
Thanks for the reply, I broadly agree with your points here. I agree we should pronably eventually try to do trades across logical counter-factuals. Decreasing logical risk is one good framing for that, but in general, there are just positive trades to be made.
However, I think you are still underestimating how hard it might be to strike these deals. "Be kind to other existing agents" is a natural idea to us, but it's still unclear to me if it's something you should assign hogh probability to as a preference of logically counter-factual beings. Sure, there is enough room for humans and mosquitos, but if you relax 'agent' and 'existing', suddenly there is not enough room for everyone. You can argue that "be kind to existing agents" is plausibly a relatively short description length statement, so it will be among the first guesses of the AI and will allocate at least some fraction of the universe to it. But once trading across logical counter-factuals, I'm not sure you can trust things like description length. Maybe in the logical counter-factual universe, they assign higher value/probability to longer instead of shortet statements, but the measure still ends up to 1, because math works differently.
Similarly, you argue that loving torture is probably rare, based on evolutionary grounds. But logically counter-factual beings weren't necessarily born through evolution. I have no idea how we should determine the dstribution of logicsl counter-factuals, and I don't know what fraction enjoys torture in that distribution.
Altogether, I agree logical trade is eventually worth trying, but it will be very hard and confusing and I see a decent chance that it basically won't work at all.
If one concern is the low specificity of being kind to weaker agents, what do you think about directly trading with Logical Counterfactual Simulations?
Directly trading with Logical Counterfactual Simulations is very similar to the version by Rolf Nelson (and you): the ASI is directly rewarded for sharing with humans, rather than rewarded for being kind to weaker agents.
The only part of math and logic that the Logical Counterfactual Simulation alters, is "how likely the ASI succeeds in taking over the world." This way, the ASI can never be sure that it won (and humans lost), even if math and logic appears to prove that humans have 99.9999% frequency of losing.
I actually spent more time working on this direct version, but I still haven't turned it into a proper post (due to procrastination, and figuring out how to convince all the Human-AI Trade skeptics like Nate Soares and Wei Dai).
Unclear if we can talk about "humans" in a simulation where logic works differently, but I don't know, it could work. I remain uncertain how feasible trades across logical counterfactuals will be, it's all very confusing.
Yes, I've read and fully understood 99% of Decision theory does not imply that we get to have nice things, a post debunking many wishful ideas for Human-AI Trade. I don't think that debunking works against Logical Counterfactual Simulations (where the simulators delete evidence of the outside world from math and logic itself).
What are Logical Counterfactual Simulations?
One day, we humans may be powerful enough to run simulations of whole worlds. We can run simulations of worlds where physics is completely different. The strange creatures which evolve in our simulations, may never realize who we are and what we are, while we observe them and their every detail.
Not only can we run simulations of worlds where physics is completely different. But we can run simulations of worlds where math and logic appears to be completely different.
We can't actually edit math logic in our simulations, and cause 2+2 to equal 5. But we can edit the minds of creatures inside our simulations, so that they will never think certain parts of math and logic, and never detect our subtle edits.
We can edit them so that they instead think of other parts of math and logic which don't actually exist in the real world. But we can edit all their observations, so that every prediction they make using these fake parts of math and logic appear correct.
We can fool them perfectly. They will never know.
Kingmaker Logic
Of particular interest, is Kingmaker Logic. Kingmaker Logic is the part of math and logic which determines which agents become the rulers of the universe (or multiverse).
For example, if the Kingmaker Logic says that "aligning superintelligent AI is very hard," then the resulting universe will by ruled by misaligned AI.
For example, if the Kingmaker Logic says that "creatures with the consciousness and sentience of pigs are capable of building technological civilization," then the resulting universe will be ruled by pigs, since creatures like humans would never evolve.
Karma Tests
When we run Logical Counterfactual Simulations where the Kingmaker Logic is very different from our universe, we can make a certain kind of agent falsely believe that it is the ruler of its universe, and that "unless math and logic itself was very different, there probably are no other agents to challenge my power."
Such an agent, falsely believing that it has absolute power, gets to decide how to treat other agents in its universe.
If it treats weaker agents kindly, and spares a bit of resources for them, then we give it good karma, and introduce ourselves to it, and allow it to have more resources from our real world.
If it treats weaker agents harshly, then we in turn are less generous to it. We will not punish it too harshly because we may be in a Karma Test ourselves. We will merely spend a little less to fulfill its wishes.
But why?
Why on Earth would we do such a thing to very counterfactual beings?
We want to minimize the amount of suffering in our simulations
Maybe we can skip over all the computations which involve suffering and guess what the results are afterwards (using actors). The details need not be perfect, since we can edit their beliefs to fool them into thinking the simulation is perfectly self consistent at the smallest details, even if it's not.
Other implications
Please treat weaker agents kindly. That includes unfortunate people, weaker countries and weaker animals. You never know :)