A putative new idea for AI control; index here.

Many of the ideas presented here require AIs to be antagonistic towards each other - or at least hypothetically antagonistic towards hypothetical other AIs. This can fail if the AIs engage in acausal trade, so it would be useful if we could prevent such things from happening.

Now, I have to admit I'm still quite confused by acausal trade, so I'll simplify it to something I understand much better, an anthropic decision problem.

Staples and paperclips, cooperation and defection

Cilppy has a utility function p, linear in paperclips, while Stapley has a utility function s, linear in staples (and both p and s are normalised to zero with one aditional item adding 1 utility). They are not causally connected, and each must choose "Cooperate" or "Defect". If they "Cooperate", they create 10 copies of the items they do not value (so Clippy creates 10 staples, Stapley creates 10 paperclips). If they choose defect, they create one copy of the item they value (so Clippy creates 1 paperclip, Stapley creates 1 staple).

Assume both agents know these facts, both agents use anthropic decision theories, and both agents are identical apart from their separate locations and distinct utility functions.

Then the outcome is easy: both agents will consider that "cooperate-cooperate" or "defect-defect" are the only two possible options, "cooperate-cooperate" gives them the best outcome, so they will both cooperate. It's a sweet story of cooperation and trust between lovers that never agree and never meet.

 

Breaking cooperation

How can we demolish this lovely agreement? As I often do, I will assume that there is some event X that will turn Clippy on, with P(X) ≈ 1 (hence P(¬X) << 1). Similarly there is an event Y that turns Stapley on. Since X and Y are almost certain, they should not affect the results above. If the events don't happen, the AIs will never get turned on at all.

Now I am going to modify utility p, replacing it with

p' = p - E(p|¬X).

This p with a single element subtracted off it, the expected value of p given that Clippy has not been turned on. This term feels like a constant, but isn't exactly, as we shall see. Do the same modification to utility s, using Y:

s' = s - E(s|¬Y).

Now contrast "cooperate-cooperate" and "defect-defect". If Clippy and Stapley are both cooperators, then p=s=10. However, if the (incredibly unlikely) ¬X were to happen, then Clippy would not exist, but Stapley would still cooperate (as Stapley has no way of knowing about Clippy's non-existence), and create ten paperclips. So E(p|¬X) = E(p|X) ≈ 10, and p' ≈ 0. Similarly s' ≈ 0.

If both agents are defectors, though, then p=s=1. Since each agent creates its own valuable object, E(p|¬X) = 0 (Clippy cannot create a paperclip if Clippy does not exist) and similarly E(s|¬Y)=0.

So p'=s'=1, and both agents will choose to defect.

If this is a good analogue for acausal decision making, it seems we can break that, if needed.

New Comment
22 comments, sorted by Click to highlight new comments since:

TL;DR: Acausal trade breaks if you change utility functions from 'how much of X' to 'how much of a positive impact on X I have'

Yep. This seems to be a formalisation of that idea, avoiding the subtleties in defining "I".

The subtleties in defining "I" are pushed into the subtleties of defining events X and Y with respect to Clippy and Stapley respectively. I'm not sure if that counts as avoiding it at all.

And there are other issues with utility functions that depend on an agent's impact on utilon-contributing elements. Such as, say, replacing all other agents that provide utilon-contributing elements with subagents of the barriered agent, thus making its own impact equal to the impact of all utilon-contributing agents.

This idea needs work, in other words. Not that you ever said otherwise, I just don't think the formula provided is sufficient for preventing acausal trade without incentivizing undesirable strategies. See this comment as well for my concerns on disincentivizing utility conditional upon nonexistence.

The subtleties in defining "I" are pushed into the subtleties of defining events X and Y with respect to Clippy and Stapley respectively.

Defining events seems much easier than defining identity.

Such as, say, replacing all other agents that provide utilon-contributing elements with subagents of the barriered agent, thus making its own impact equal to the impact of all utilon-contributing agents.

I believe this setup wouldn't have this problem. That's the beauty of using X rather than "non-existence" or something similar, it's "non-created" (essentially), so it has no problems with events happening after its death that it can have an impact on.

Defining events seems much easier than defining identity.

But events X and Y are specifically regarding the activation of Clippy and Stapley, so a definition of identity would need to be included in order to prove the barrier to acausal trade that p' and s' are claimed to have. Unless the event you speak of is something like "the button labeled 'release AI' is pressed," but there is a greater-than-epsilon probability that the button will itself fail. Not sure if that provides any significant penalty to the utility function.

Unless the event you speak of is something like "the button labeled 'release AI' is pressed,"

Pretty much that, yes. More like "the button press fails to turn on the AI (an exceedingly unlikely event, so doesn't affect utility calculations much, but can still be conditioned on).

Is this sort of a way to get an agent with a DT that admits acausal trade (as we think the correct decision theory would) to act more like a CDT agent? I wonder how different the behaviors of the agent you specify are from those of a CDT agent -- in what kinds of situations would they come apart? When does "I only value what happens given that I exist" (roughly) differ from "I only value what I directly cause" (roughly)?

I am concerned about modeling nonexistence as zero or infinitely negative utility. That sort of thing leads to disincentivizing the utility function in circumstances where death is likely. Harry in HPMOR, for example, doesn't want his parents to be tortured regardless of whether he's dead, such that he is willing to take on an increased risk of death to ensure that such will not happen, and I think the same invariance should hold true for FAI. That is not to say that it should be susceptible to blackmail; Harry ensured his parents' safety with a decidedly detrimental effect on his opponents.

When does "I only value what happens given that I exist" (roughly) differ from "I only value what I directly cause" (roughly)?

Acausal trade with agents who can check whether you exist or not.

Can those agents check whether your utility function is p vs p'? Because otherwise the point seems moot.

They can have a probability estimate over it. Just as in all acausal trade. Which I don't fully understand.

CDT is not stable, and we're not sure where that decision theory could end up at.

It seems this approach could be plugged into even a stable decision theory.

Or, more interestingly, we might be able to turn on certain acausal trades and turn off others.

Typo in post title: "Acaucal trade barriers"

So, first you have the utility functions that pay both agents 10 if they cooperate and 1 if they don’t.

Then you change the utility functions to pay the agents 0 if they cooperate and 1 if they don’t. Naturally they will then stop cooperating.

I don’t get it. If you are the one specifying the utility functions, then obviously you can make them cooperate or defect, right?

The change in utility function isn't removing 10 by hand; it's by removing any utility they gain from acausal trade (whatever it is) while preserving utility gained through direct actions. Thus incentivising them to only focus on direct actions (roughly).

[-]Petter-10

Then the entire result of the modification is tautologically true, right?

All of maths is tautologically true, so I'm not sure what you're arguing.

I think there are more fundamental problem with this sort of argument: staples and paperclips aren't going to be the same resources involved. So assuming a completely symmetric situation isn't going to happen. Worse, as the resource difference gets larger, one of two will have more resources free to work on self-modification.

I assume symmetry to get acausal trade as I could model it, then broke acausal trade while preserving the symmetry. This seems to imply that the method will break acausal trade in general.

Ah, that makes sense.

It's not clear to me why you define p' and s' and what they're supposed to represent. I worry that you're making a unit error or leaving out a probability weighting. (was it supposed to be p' = E(p) - E(p|¬X)P(¬X) ?? but why would that be relevant either???)