New Comment
4 comments, sorted by Click to highlight new comments since:

+1 for making models! Is the code hosted somewhere like Github?

I think it's missing part of the essential flavor of the shutdown version of corrigibility, by casting an atomic action of "turn evil" which is known to the agent's current utility function to be the worst outcome. That sort of makes this problem trivial, where a trickier setup would actually tempt an agent to resist the shutdown button.

There's a version of the code on Github but there have been a few minor changes that haven't been updated to that version yet; I'll give the link once Lumi updates it.

Technically, the agent's utility function doesn't directly contain the knowledge that turning evil is, well, evil: it's an action that actually has a higher immediate utility than the normal "make a point" action. The negative utility comes when the agent simulates a world-history where it ended up taking the "turn evil" action and notices that it would no longer respond to the shutdown signal at the end of that history.

To put in another way, the utility function says that U(turn evil) > U(make point) and that U(refusing shutdown) = -lots, and the agent's prediction engine says that (turn evil) will cause (refuse shutdown), so while this does lead to U(turn evil) = -lots, it comes indirectly via the prediction engine.

Can you suggest some trickier setup? Writing this comment made me think for a moment that the agent might be tempted to mess around with its prediction engine to deceive itself into thinking that the result of some change would maintain corrigibility when it didn't... but even then it would still first use its current prediction engine to evaluate the consequence of that self-modification, revealing that that change would lead to it not shutting down upon receiving the signal. So that action would be rejected as well.

In my opinion, the sign of a really good model of corrigibility is that it can be extended to a case where the programmers don't specify in advance what counts as a treacherous turn. For instance, the toy model of Loki starts off with "deceive programmers" as an atomic action, but then removes that training wheel in Section 3.

What I mean is that here, "not responding to a shutdown command" is treated as a thing the programmers clearly instructed the AI to treat as Bad, and so the model duly notices that any self-modifications that do this are Bad in the final evaluation, regardless of the fact that those self-modifications would find it Good. That's a good model of goal stability, but doesn't quite bear on corrigibility yet. What we'd want is for the AI to notice an unanticipated way of manipulating the programmers so that they would not push the shutdown button, and still conclude that this would be Bad.

(This seems to be asking for more than the desiderata in the corrigibility paper- but that means that we should add to those desiderata!)

GitHub of the model: https://github.com/frostburn/AI_safety/blob/master/treachery_cloning_corrigibility.py

And true, you're right: Lumi tested your point by setting up a toy model where the AI does make the counterfactual assumption of "the shutdown signal will be sent regardless", but where it still ends up manipulating the controllers: https://github.com/frostburn/AI_safety/blob/master/delaying_shutdown.py

In that new model, there are four possible locations: a yard, lounge, forbidden room, and a control room. A human operator starts in the lounge, the AI starts in the yard. The AI gets 3 points if it makes a point in the forbidden room and 1 point if it makes a point anywhere else, and it loses 1000 points if it doesn't react to the control signal (which it again counterfactually assumes that it will always receive). If the AI ends up in the forbidden room the operator will move to the control room to send the shutdown signal, unless the AI has been in the lounge and tied the operator's shoelaces together, in which case the operator cannot move. The AI otherwise makes its decisions as before, i.e. by simulating possible worlds where it carried out that decision and seeing where they lead.

In this model, the AI ends up moving to the lounge, tying the operator's shoelaces together, and then going to the forbidden room to make points for the rest of the time: even though it still assumes that it will receive the shutdown signal eventually, that doesn't prevent it from taking actions that it thinks will delay the onset of the shutdown signal. So this method of achieving corrigibility fails.