A putative new idea for AI control; index here.
This is a problem that developed from the "high impact from low impact" idea, but is a legitimate thought experiment in its own right (it also has connections with the "spirit of the law" idea).
Suppose that, next 1st of April, the US president may or may not die of natural causes. I chose this example because it's an event of potentially large magnitude, but not overwhelmingly so (neither a butterfly wing nor an asteroid impact).
Also assume that, for some reason, we are able to program an AI that will be nice, given that the president does die on that day. Its behaviour if the president doesn't die is undefined and potentially dangerous.
Is there a way (either at the initial stages of programming or at the later) to extend the "niceness" from the "presidential death world" into the "presidential survival world"?
To focus on how tricky the problem is, assume for argument's sake that the vice-president is a war monger that will start a nuclear war if they become president. Then "launch a coup on the 2nd of April" is a "nice" thing of the AI to do, conditional on the president dying. However, if you naively import that requirement into the "presidential survival world", the AI will launch a pointeless and counterproductive coup. This is illustrative of the kind of problems that could come up.
So the question is, can we transfer niceness in this way, without needing a solution to the full problem of niceness in general?
EDIT: Actually, this seems ideally setup for a Bayes network (or for the requirement that a Bayes network be used).
EDIT2: Now the problem of predicates like "Grue" and "Bleen" seem to be the relevant bit. If you can avoid concepts such as "X={nuclear war if president died, peace if president lived}", you can make the extension work.
How do you determine that it will be nice under the given condition?
As posed, it's entirely possible that the niceness is a coincidence: an artifact of the initial conditions fitting just right with the programming. Think of a coin landing on its side or a pencil being balanced on its tip. These positions are unstable and you need very specific initial conditions to get them to work.
The safe bet would be to have the AI start plotting an assassination and hope it lets you out of prison once its coup succeeds.