Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
A putative new idea for AI control; index here.
Partially inspired by as conversation with Daniel Dewey.
People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.
However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.
The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.
To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.
The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.
So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.
In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.
It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.
More or less crude
The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?
The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.
Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.
We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.
How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").
One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).
You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".
There are a few other thoughts we might have for trying to pick a safer u:
- Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
- If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
- Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
- Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
- Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
- Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
- Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
- All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.
This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?
And with that, the various results of my AI retreat are available to all.
A putative new idea for AI control; index here.
Suppose you have an AI with a utility u and a probability estimate P. There is a certain event X which the AI cannot affect. You wish to change the AI's estimate of the probability of X, by, say, doubling the odds ratio P(X):P(¬X). However, since it is dangerous to give an AI false beliefs (they may not be stable, for one), you instead want to make the AI behave as if it were a u-maximiser with doubled odds ratio.
Assume that the AI is currently deciding between two actions, α and ω. The expected utility of action α decomposes as:
u(α) = P(X)u(α|X) + P(¬X)u(α|¬X).
The utility of action ω is defined similarly, and the expected gain (or loss) of utility by choosing α over ω is:
u(α)-u(ω) = P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).
If we were to double the odds ratio, the expected utility gain becomes:
u(α)-u(ω) = (2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)))/Ω, (1)
for some normalisation constant Ω = 2P(X)+P(¬X), independent of α and ω.
We can reproduce exactly the same effect by instead replacing u with u', such that
- u'( |X)=2u( |X)
- u'( |¬X)=u( |¬X)
u'(α)-u'(ω) = P(X)(u'(α|X)-u'(ω|X)) + P(¬X)(u'(α|¬X)-u'(ω|¬X)),
= 2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)). (2)
This, up to an unimportant constant, is the same equation as (1). Thus we can accomplish, via utility manipulation, exactly the same effect on the AI's behaviour as a by changing its probability estimates.
Notice that we could also have defined
- u'( |X)=u( |X)
- u'( |¬X)=(1/2)u( |¬X)
This is just the same u', scaled.
The utility indifference and false miracles approaches were just special cases of this, where the odds ratio was sent to infinity/zero by multiplying by zero. But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P'). Changes in probability can be replicated as changes in utility.
In the previous, we multiplied certain utilities by two. But by doing so, we implicitly used the zero point of u. But utility is invariant under translation, so this zero point is not actually anything significant.
It turns out that we don't need to care about this - any zero will do, what matters simply is that the spread between options is doubled in the X world but not in the ¬X one.
But that relies on the AI being unable to affect the probability of X and ¬X itself. If the AI has an action that will increase (or decrease) P(X), then it becomes very important where we set the zero before multiplying. Setting the zero in a different place is isomorphic with adding a constant to the X world and not the ¬X world (or vice versa). Obviously this will greatly affect the AI's preferences between X and ¬X.
One way of avoiding the AI affecting X is to set this constant so that u'(X)=u'(¬X), in expectation. Then the AI has no preferences between the two situations, and will not seek to boost one over the other. However, note that u(X) is an expected utility calculation. Therefore:
- Choosing the constant so that u'(X)=u'(¬X) requires accessing the AI's probability estimate P for various worlds; it cannot be done from outside, by multiplying the utility, as the previous approach could.
- Even if u'(X)=u'(¬X), this does not mean that u'(X|Y)=u'(¬X|Y) for every event Y that could happen before X does. Simple example: X is a coin flip, and Y is the bet of someone on that coin flip, someone the AI doesn't like.
This explains all the complexity of the utility indifference approach, which is essentially trying to decompose possible universes (and adding constants to particular subsets of universes) to ensure that u'(X|Y)=u'(¬X|Y) for any Y that could happen before X does.
View more: Next