One weird trick to turn maximisers into minimisers

Stuart_Armstrong

A putative new idea for AI control; index here.

A simple and easy design for a u-maximising agent that turns into a u-minimising one.

Let X be some boolean random variable outside the agent's control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:

u^# = (2/ε)Xu - u.

Before t, the expected value of (2/ε)X is 2, so u^# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u^# into a u-minimiser.

This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

u^# = Ω(2/ε)Xu - u.

If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.

A putative new idea for AI control; index here.

A simple and easy design for a u-maximising agent that turns into a u-minimising one.

u^# = (2/ε)Xu - u.

This isn't perfect corrigibility - the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

u^# = Ω(2/ε)Xu - u.

u is a function of paperclips, which is in turn a function of time. So, u(p(t)) is the number of paperclips at time t.

U = integral[some reasonable bounds] {dt p(t) (t in first Tuesday?-1:1)}

So, the AI knows what it wants over all of the future, depending on time. When evaluating future plans for the future, it's able to take that change into account.

Like, it might spend both Monday and Tuesday just building infrastructure. In any case, turning off won't help on Tuesday because it will still know that there were paperclips then - not being on to observe them won't help it.

I don't see exactly how that would work - it can't build paper clips during the first week, so u(p(t))=0 during that period. Therefore it should behave exactly as if nothing special happened on Tuesday?

And my comment on turning itself off for Tuesday was more that the Monday AI wouldn't want it's infrastructure ruined by the Tuesday version, and would just turn itself off to prevent that.

5

One weird trick to turn maximisers into minimisers

5

5

5

One weird trick to turn maximisers into minimisers

5

5