A putative new idea for AI control; index here.

I just had a talk with Victoria Krakovna about reducing side effects for an AI, and though there are similarities with low impact there are some critical way in which the two differ.

#Distance from baseline

Low impact and low side effects use a similar distance approach: some ideal baseline world or set of worlds is defined, and then the distance between that baseline and the actual world is computed. The distance is then used as penalty term the AI should minimise (while still achieving its objectives).

One useful measure of distance is to list a huge number of variables (stock prices, air pressure, odds of films winning Oscars...) and penalise large deviations in those variables. Every variable is given a certain weight, and the weighted sum is the total distance metric (there are more complicated versions, but this simple metric will suffice for the moment).

#What's the weather like on Venus?

The two approach weigh the variables differently, and for different purposes. Suppose that one of the variables is the average surface temperature of Venus.

Now suppose that the temperature of Venus increases by 1 degree during the operation of the AI. For most low side effects AIs, this is perfectly acceptable. Suppose the AI is aiming to cure cancer. Then as we formalise negative side effects, we start to include things like human survival, human happiness and flourishing, and so on. Temperature changes in distant planets are certainly not prioritised.

And the AI would be correct in that assessment. We would be perfectly happy to accept a cure for cancer in exchange of a small change to Venusian weather. And a properly trained AI, intent on minimising bad side effects, should agree with us. So the weight of the "temperature on Venus" variable will be low for such an AI.

In contrast, a low impact AI sees a temperature change on Venus as an utter disaster - Venusian temperature is likely to be much more important than anything human or anything on Earth. The reason is clear: only an immensely powerful AI could affect something as distant and as massive as that. If Venusian temperature changes strongly as a result of AI action, the low impact containment has failed completely.

#Circling the baseline from afar

The two approaches also differ in how attainable their baseline is. The low impact approach defines a baseline world which the AI can achieve by just doing nothing. In fact, it's more a no-impact measure that a low impact one (hence the need for tricks to get actual impact).

The do-nothing baseline and the tricks mean that we have a clear vision of what we want a low impact AI to do (have no general impact, except in this specific way we allow).

For low side effects, picking the baseline is more tricky. We might define a world where there is, say, no cancer, and no terrible side effects. We might define a baseline set of such worlds.

But unlike the low impact case, we're not confident the AI can achieve a world that's close to the baseline. And when the world is some distance away, things can get dangerous. This is because the AI is not achieving a good world, but trading off the different ways of being far from such a world - and the tradeoff might not be one we like, or understand.

A trivial example: maybe the easiest way for the AI to get closer to the baseline is to take control of the brains of all humans. Sure, it pays a cost in a few variables (metal in people's brains, maybe?), but it can then orchestrate all human behaviour to get close to the baseline in all other ways.

It seems relevant to mention here that problems like AI manipulation and sub-agent creation are really hard to define and deal with, suggesting that it's hard to rule out those kinds of examples.

New Comment