The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. "if I do X that you wouldn't like and you don't notice it, that's bad; but if you don't notice that you don't notice it, then maybe it's OK."
Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that's not obvious to me at all, especially if we're talking about a reward function rather than the policy that the agent actually learns.
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.
I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.
Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. "if I do X that you wouldn't like and you don't notice it, that's bad; but if you don't notice that you don't notice it, then maybe it's OK."
Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that's not obvious to me at all, especially if we're talking about a reward function rather than the policy that the agent actually learns.