One weird trick to turn maximisers into minimisers

Stuart_Armstrong

A putative new idea for AI control; index here.

A simple and easy design for a u-maximising agent that turns into a u-minimising one.

Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:

u^# = (2/ε)Xu - u.

Before t, the expected value of (2/ε)X is 2, so u^# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u^# into a u-minimiser.

This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

u^# = Ω(2/ε)Xu - u.

If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.

A putative new idea for AI control; index here.

A simple and easy design for a u-maximising agent that turns into a u-minimising one.

u^# = (2/ε)Xu - u.

This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

u^# = Ω(2/ε)Xu - u.

I see - I thought you meant it would run for a week building infrastructure, and then be able to build paperclips on the first Monday you named.

I'm not sure what you WANT it to do, really. Do you want it to actually sabotage itself on Tuesday, or do you want it to keep on building infrastructure for later paperclip construction?

Under the system I built, it would do absolutely nothing different on Tuesday and continue to build infrastructure because it anticipates wanting more paperclips by the time it is able to build them at the end of the week. It wants low paperclips now, but it has no influence over paperclips now. It has influence over paperclips in the future, and it wants that there will be more of them when that time comes.

I'm trying to implement value change (see eg http://lesswrong.com/lw/jxa/proper_value_learning_through_indifference/ ). The change from u to -u is the easiest example of such a change. The ideal - which probably can't be implemented in a standard utility function - is that it is a u-maximiser that's indifferent to becoming a -u maximiser, who's then indifferent to further change, etc...

5

One weird trick to turn maximisers into minimisers

5

5

5

One weird trick to turn maximisers into minimisers

5

5