This seems to be like an important piece of puzzle for a safe strong AI. The question is inspired by "horse riding astronaut" debate (where ai fails to generate a proper image, and I speculate it's because of high prior of this context-free prompt being a mistake).

I suspect the topic might have been discussed in general in context of system design (unrelated to AI), but I cannot find any good overview with simple googling. For specific examples, I think of those I know from IT: disabling "rm -rf /" without explicit flag, and the whole concept of "warnings" in program output.

I think in more critical system there might be an important tradeoff, as there's not always enough time to ask user for clarification of the input, so system has to decide on it's own how to interpret an input that is possibly a mistake.

New Answer
New Comment