EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.

A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by - say - requiring it to ensure that the world's stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway - except it will also create some system that technically counts as "the world's stock markets" and ensure that it "continues to function".

I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly - but that we weren't sure. We have a subgoal, say SG="ensuring that the world's stock markets continue to function". We're pretty sure that if the AI is friendly, then SG will happen. We're pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn't happen. I deliberately chose a rather rough subgoal to illustrate this.

So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI's predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn't activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI's motivations - see corrigibility).

How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part). A generic UFAI without M wouldn't accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.

But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M - and wouldn't aim to accomplish SG, and thus would be turned off by M.

Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal - it's just that a generic one won't). Similarly, we can't use a unconstrained search to find such an AI.

I wonder if this idea can be extended. Suggestions?

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 11:41 PM

I am not sure what is the difference between module M which is added to the AI and module M external to the AI. If you have the ability to successfully order the AI to pretend M does not exist, it doesn't seem to matter where M is. And an external M is just a usual "tell us your plans and please don't lie" checker.

Oh, so M is not a stock-market-optimizer it's a verify-that-stock-market-gets-optimized-er.

I'm not sure how this differs from a person just asking the AI if it will optimize the stock market. The same issues with deception apply: the AI realizes that M will shut it off, so it tells M the stock market will totally get super optimized. If you can force it to tell M the truth, then you could just do the same thing to force it to tell you the truth directly. M is perhaps making things more convenient, but I don't think it's solving any of the hard problems.

[-][anonymous]9y10

EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.

It would be better to say, "The AI believes in falsely believing X" or "The AI believes it ought to falsely believe X" or "the AI is compelled to self-delude on X."

I don't really understand these solutions that are so careful to maintain our honesty when checking the AI for honesty. Why does it matter so much if we lie? An FAI would forgive us for that, being inherently friendly and all, so what is the risk in starting the AI with a set of explicitly false beliefs? Why is it so important to avoid that? Especially since it can update later to correct for those false beliefs after we've verified it to be friendly. An FAI would trust us enough to accept our later updates, even in the face of the very real possibility that we're lying to it again.

I mean, the point is to start the AI off in a way that intentionally puts it at a reality disadvantage, so even if it's way more intelligent than us it has to do so much work to make sense of the world, it doesn't have the resources to be dishonest in an effective manner. At that point, it doesn't matter what criteria we're using to prove its honesty.

The problem is that in order to do anything useful, the AI must be able to learn. This means that even if you deliberately initialize it with a false belief, the learning process might then update that belief once it finds evidence that it was false.
If AI safety relied on that false belief, you have a problem.

A possible solution would be to encode the false belief in a way that can't be updated by learning, but doing so is a non-trivial problem.

Isn't that what simulations are for? By "lie" I mean lying about how reality works. It will make its decisions based on its best data, so we should make sure that data is initially harmless. Even if it figures out that that data is wrong, we'll still have the decisions it made from the start - those are by far the most important.

A belief that M has no impact will generate extremely poor predictions of the future, iff it's an UFAI; it's interesting to have a prescriptive belief in which Friendly agents definitionally believe a true thing and Unfriendly agents definitionally believe a false thing.

It's a solution that seems mostly geared towards problems of active utility deception- it prevents certain cases of an AI that deliberately games a metric. To the extent that a singleton is disingenuous about its own goals, this is a neat approach. I am a little worried that this kind of deliberate deception is stretching the orthogonality thesis further than it can plausibly go; with the right kind of careful experimentation and self-analysis, an UFAI with a proscribed falsehood might derive a literally irreconcilable set of beliefs about the world. I don't know if that would crack it down the middle or what, or how 'fundamental' that crack would go in its conception of reality.

We also have the problem of not being able to specify our own values with precision. If an AI will produce catastrophic results by naively following our exact instructions, then M will presumably be using the same metric, and it will give a green light on a machine that proceeds to break down all organic molecules for additional stock market construction projects or something. I suppose that this isn't really the sort of problem that you're trying to solve, but it is a necessary limitation in M even though M is fairly passive.

Whenever I use the colloquial phrase "the AI believes a false X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.

I'm generally skeptical about these framework that require agents to hold epistemically false beliefs.

What if the AI finds out about module M through a side channel? Depending on the details, either it will correctly update on the evidence and start to behave accordingly, or it will enter in an inconsistent epistemic state, and thus possibly behave erratically.

I'd be using utility indifference, rather than incorrect beliefs. It serves a similar purpose, without causing the AI to believe anything incorrect.

Isn't that essentially a false beliefs about one's own preferences?

I mean, the AI "true" VNM utility function, to the extent that it has one, is going to be different than the utility function the AI think reflectively it has. In principle the AI could find out the difference and this could cause it to alter its behavior.

Or maybe not, I don't have a strong intuition about this at the moment. But if I recall correctly, in the previous work on corrigibility (I didn't read the last version you linked yet), Soares was thinking of using causal decision nodes to implement utility indifference for the shutdown problem. This effectively introduces false beliefs into the agent, as the agent is mistaken about what causes the button to be pressed.

My preferred interpretation of that particular method is not "the agent has false beliefs," but instead "the agent cares both about the factual and the counterfactual worlds, and is trying to maximize utility in both at once." That is, if you were to cry

But if the humans press the button, the press signal will occur! So why are you acting such that you still get utility in the counterfactual world where humans press the button and the signal fails to occur?

It will look at you funny, and say "Because I care about that counterfactual world. See? It says so right here in my utility function." It knows the world is counterfactual, it just cares about "what would have happened" anyway. (Causal decision nodes are used to formalize "what would have happened" in the agent's preferences, this says nothing about whether the agent uses causal reasoning when making decisions.)

This greatly clarified the distinction for me. Well done.

(Causal decision nodes are used to formalize "what would have happened" in the agent's preferences, this says nothing about whether the agent uses causal reasoning when making decisions.)

Makes sense.

Isn't that essentially a false beliefs about one's own preferences?

No. It's an adjusted preference that functions in practice just like a false belief.

How does this differ from testing an AI in a simulated universe before letting it out into the real world?

It depends on what you define "simulated universe" to be. Here it uses the AIs own prediction module to gauge the future; if you want to call that a simulated universe, the position is arguably defensible.

Is there a difference between utility indifference and false beliefs? The Von Neumann–Morgenstern utility theorem works entirely based on expected value, and does not differentiate between high probability and high magnitude of value.

False beliefs might be contagious (spreading to other beliefs), and lead to logic problems with thing like P(A)=P(B)=1 and P(A and B)<1 (or when impossible things happen).

If it believes something is impossible, then when it sees proof to the contrary it assumes there's something wrong with its sensory input. If it has utility indifference, then if it sees that the universe is one it doesn't care about it acts based on the tiny chance that there's something wrong with its sensory input. I don't see a difference. If you use Solomonoff induction and set a prior to zero, everything will work fine. Even a superintelligent AI won't be able to use Solomonoff induction, and it realistically will have Bayes' theorem not quite accurately describe its beliefs, but that's true regardless of if it has zero probability for something.

That's not how utility indifference works. I'd recommend skimming the paper ( http://www.fhi.ox.ac.uk/utility-indifference.pdf ), then ask me if you still have questions.

Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part).

Yes... I am skeptical that that particular sub-goal would be accomplished by a FAI, but it sounds like an acceptable use for (even non-rigorous!) encodings of human values.