This is a special post for quick takes by PabloAMC. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
The main problem with wireheading, manipulation... seems related to a confusion between the goal in the world and its representation inside the agent. Perhaps a way to deal with this problem is to use the fact that the agent may be aware of it being an embedded agent. That means that it could be aware of the goal representing an external fact of the world, and we could potentially penalize the divergence between the goal and its representation during training.