Nice work!
In other words, the self awareness logic doesn’t seem to happen in a different MLP than the behavior logic.
Not sure about this. I think these results are consistent with:
In other words, I think these results leave open the possibility that awareness might require "something else".
It seems like the suggestion here it to potentially apply less process-supervision (i.e. to part of the CoT). In this case, though, the more straight-forward fix to this issue is to apply more supervision. Specifically, they were not applying supervision to the action, only the CoT. From the paper:
Another way of saying this is that their method is not explicitly trying to make the CoT faithful. They are leaving it open for the model to decide between:
The model finds that it's easier to do 1. But if you were monitoring the action, almost certainly 2 would be chosen.
Maybe the more interesting question is whether we could expect monitoring the action to fix this issue in general. Perhaps not. For example, the action may be just a directive/choice ("do this") that follows a long chain of reasoning, and it may not be possible to monitor it like you can here.
Anyway, the goal should really be to make to the CoT faithful, they were not explicitly trying to do that here, and may have even been pressuring to do the opposite.