For an overview of the problem of Optimization Regularization, or Mild Optimization, I refer to MIRI's paper Alignment for Advanced Machine Learning Systems, section 2.7
My solution
Start with a bounded utility function, , that is evaluated based on state of the world at a single time (ignoring for now simultaneity is ill-defined in Relativity). Examples:
- If a human at time (at the start of the optimization process) are shown the world state at time , how much would they like it (mapped to to the interval ).
Then maximize , where is a regularization parameter chosen by the AI engineer, and is a free variable chosen by the AI.
Time is measured from the start of the optimization process. Because the utility is evaluated based on the world at time , this value is the amount of time the AI spends on the task. It is up to the AI to decide how much time it wants. Choosing should be seen as part of choosing the policy, or be included in the action space.
Because the utility function is bounded, the optimization process will eventually hit diminishing returns, and will then choose to terminate, because of the time penalty.
Why time penalty?
Unbounded optimization pressure is dangerous. Without any form of regularization, we need to get the alignment exactly right. However, with regularization we merely need to get it almost exactly right, which I believe is much easier.
However, impact regularization have turned out to be very hard. We don't want the impact measure to depend on the AI's understanding of human values, because that will not provide extra safety. But a value neutral impact measure is almost impossible, because the world has too many degrees of freedom. However, time is both value neutral and has only a single degree of freedom.
Why not use a fixed finite time horizon?
The reason is a variable chosen by the AI and not a constant chosen by us, is because we don't know when the optimization process will start hitting diminishing returns. Leaving up to the AI solves this dynamically. In addition, we will still get to choose a maximum time horizon by the choice of and . The AI will never keep going beyond
What happens after ?
That depends on the AI design. Since we don't yet know how to build a general intelligence, we also don't know what will happen after in some default scenario.
However, we probably don't want a strong optimizer sitting around with no goal, because it is very unclear what that thing will do. Or if we accidentally give it some tiny incentive, it might then maximize that until the end of time.
E.g. if the AI has any uncertainty whether it has reach time or not, it will keep maximizing conditioned on , because those are the only worlds that count. As becomes less and less likely, the AI will act more and more crazy.
A way to solve this is to include an incentive for turning itself off, e.g. maximize
Unfortunately, I don't know of any good objective way to define "is turned of". The best definition I thought of so far is:
Definition: AI is turned off at time = If a human at time are shown the world state a time , they would agree that, the AI and all its sub agents and successor agents are turned off.
And if we are referring to human judgement anyway, we might as well throw in some more desirable things. Maximize
Definition: AI is turned off and the world is OK, at time = If a human at time are shown the world state a time , they would agree that, the AI and all its sub agents and successor agents are turned off, and the world at time is not significantly worse or in greater danger than at at time .
Note that "the world is OK" is not necessary for the regularization to work. But I would still recommend to include some explicit optimization pressure towards not destroying the world, ether in , or as an extra term. The regularization mainly stops the AI from Goodharting too hard, it does not do much to reduce side effects you have not even tried to specify.
Some open problems
How is time measured?
I think it is best if time refers to real physical time, and not clock ticks or number of computing operations. This is just an intuition at this point, but it seems to be like we get a better overall optimization regulator if we punish both computation and execution, because that is less likely to have loopholes. E.g. penalizing physical time is robust under delegation.
How to make this compatible with General Relativity?
If measures physical time then this is ill-defined in GR, and since we probably live in GR or similar, this is a big problem.
Is there a better way to define "is turned off"?
It would be nice with a definition of "is turned off" that does not relay on humans' ability to judge this, or the AI's ability to model humans.
"world is OK" is clearly a value statement, so for this part we will have to rely on some sort of value learning scheme.
Acknowledgements
This suggestion is inspired by, and partly based on ARLEA and discussions with John Maxwell. The idea was further developed in discussion with Stuart Armstrong.
If we ignore subagents and imagine a cartesian boundary, turned off can easily be defined as all future outputs are 0.
I also doubt that an AI working ASAP is safe in any meaningful sense. Of course you can move all the magic into "human judges world ok". If you make lambda large enough, your AI is safe and useless.
If the utility function is 1 if widget exists, else 0. Where a widget is easily build-able, not currently existing object.
Suppose that ordering the parts through normal channels will take a few weeks. If it hacks the nukes and holds the world to ransom, then everyone at the widget factory will work nonstop, then drop dead of exhaustion.
Alternately it might be able to bootstrap self replicating nanotech in less time. The AI has no reason to care if the nanotech that makes the widget is highly toxic, and no reason to care if it has a shutoff switch or grey goos the earth after the widget is produced.
World looks ok at time T is not enough, you could still get something bad arising from the way seemingly innocuous parts were set up at time T. Being switched off and having no subagents in the conventional sense isn't enough. What if the AI changed some physics data in such a way that humans would collapse the quantum vacuum state, believing the experiment they were doing was safe. Building a subagent is just a special case of having unwanted influence