You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Ask and ye shall be answered

4 Post author: Stuart_Armstrong 18 September 2015 09:53PM

A putative new idea for AI control; index here.

EDIT: the mathematics of this idea can be found here.

It would be useful to have a mechanism for getting an AI to provide an honest answer to a specific question. This post presents such a mechanism.

The question is: what is the impact of X/¬X on the expected utility of a utility function v? Here X/¬X is some well-defined binary outcome. Formally, for a constant c (positive or negative), we want to query the AI as to whether Q(v,X,c) is true, where

Q(v,X,c) = {E(v|X) - E(v|¬X) > c}.

How can we do that? Well, for a start, let M(u) be a u-maximising agent. We'll assume we can feed M() any utility function to maximise. We'll further assume that M() is boxed, that it has a single output bit Y (or ¬Y), and that we have cut any acausal trade possibilities. We'll add another binary process Z, which, if Z=1, erases the bit from Y before we can see it. In actual fact, P(Z=0)≈1.

The remaining assumptions are on the operations we can perform on v (without needing any further understanding of v). We replace v by v#, defined in the following way. First of all, we make v# indifferent to worlds in which X≠Y or Z=0 (note: there are subtleties to defining indifference to events that are partially under the AI's control; I will address these in a later post). Therefore any v# maximiser will behave as if it lived in a world where X=Y (its output and the outcome of X were the same) and Z=1 (its output was erased and never read). Then we add a special, one-off penalty of -c for outputting Y=1 (there is no effect for outputting Y=0). This defines v#, and we give our boxed AI v# to maximise.

So, how will M(v#) behave? Because it acts as if X=Y with certainty, it behaves as if it can determine the value of X by outputting Y. Because it acts as if Z=0 with certainty, it behaves as if the output of Y is never read, so it can determine the value of X without affecting anything else about the universe. It will now behave as if it was a v maximiser with control over X. So if it outputs ¬Y, it expects to get utility E(v|¬X). If it outputs Y, it expects to get utility E(v|X)-c (because of the -c penalty).

Therefore (ignoring ties) M(v#) will output Y iff it estimates Q(v,X,c) to be true. And, since in actual fact Z=0 almost certainly, we get to see its output, hence its conclusion. Voila!

Note: the advantage of this method is to allow us to get the truth out of an AI, without knowing how to define "truth".

Comments (3)

Comment author: taygetea 19 September 2015 12:45:24AM 8 points [-]

Unrelated to this particular post, I've seen a couple people mention that all your ideas as of late are somewhat scattered and unorganized, and in need of some unification. You've put out a lot of content here, but I think people would definitely appreciate some synthesis work, as well as directly addressing established ideas about these subproblems as a way of grounding your ideas a bit more. "Sixteen main ideas" is probably in need of synthesis or merger.

Comment author: Stuart_Armstrong 19 September 2015 05:47:49AM 5 points [-]

I agree. I think I've got to a good point to start synthesising now.

Comment author: Gunnar_Zarncke 20 September 2015 12:55:42AM 3 points [-]

I don't think this is a very charitable view. I admit that I did propose to add a Wiki page for structure, but not because of a lack of quality but rather the opposite because I see that this as a very valuable albeit dry matter.

I wished more people would pick up on this important FAI (or rather UFAI-prevention) work. Can somebody propose ideas how to improve takeup? I will start with one: Reduce perceived dryness by adding examples or exercises.