Help needed: nice AIs and presidential deaths

Stuart_Armstrong

A putative new idea for AI control; index here.

This is a problem that developed from the "high impact from low impact" idea, but is a legitimate thought experiment in its own right (it also has connections with the "spirit of the law" idea).

Suppose that, next 1^st of April, the US president may or may not die of natural causes. I chose this example because it's an event of potentially large magnitude, but not overwhelmingly so (neither a butterfly wing nor an asteroid impact).

Also assume that, for some reason, we are able to program an AI that will be nice, given that the president does die on that day. Its behaviour if the president doesn't die is undefined and potentially dangerous.

Is there a way (either at the initial stages of programming or at the later) to extend the "niceness" from the "presidential death world" into the "presidential survival world"?

To focus on how tricky the problem is, assume for argument's sake that the vice-president is a war monger that will start a nuclear war if they become president. Then "launch a coup on the 2^nd of April" is a "nice" thing of the AI to do, conditional on the president dying. However, if you naively import that requirement into the "presidential survival world", the AI will launch a pointeless and counterproductive coup. This is illustrative of the kind of problems that could come up.

So the question is, can we transfer niceness in this way, without needing a solution to the full problem of niceness in general?

EDIT: Actually, this seems ideally setup for a Bayes network (or for the requirement that a Bayes network be used).

EDIT2: Now the problem of predicates like "Grue" and "Bleen" seem to be the relevant bit. If you can avoid concepts such as "X={nuclear war if president died, peace if president lived}", you can make the extension work.

A putative new idea for AI control; index here.

This is a problem that developed from the "high impact from low impact" idea, but is a legitimate thought experiment in its own right (it also has connections with the "spirit of the law" idea).

Is there a way (either at the initial stages of programming or at the later) to extend the "niceness" from the "presidential death world" into the "presidential survival world"?

So the question is, can we transfer niceness in this way, without needing a solution to the full problem of niceness in general?

EDIT: Actually, this seems ideally setup for a Bayes network (or for the requirement that a Bayes network be used).

This counterfactual AI is motivated to take nice actions in worlds where the president died. It might not even know what "nice" means in other worlds.

And even if it knew the correct answer to that question, how can you be sure it wouldn't instead lie to you in order to achieve its real goals? You can't really trust the AI if you are not sure it is nice or at least indifferent...