Talking of yourself in third person? :)
Cool paper!
Anyway I'm a bit bothered by the theta thing, the probability that the agent complies with the interruption command. If I understand correctly, you can make it converge to 1, but if it converges to quickly then the agent learns a biased model of the world, while if it converges too slowly it is unsafe of course.
I'm not sure if this is just a technicality that can be circumvented or if it represents a fundamental issue: in order for the agent to learn what happens after the interruption switch is pressed, it must ignore the interruption switch with some non-negligible probability, which means that you can't trust the interruption switch as a failsafe mechanism.
If you know that it is a false memory then the experience is not completely accurate, though it may be perhaps more accurate than what human imagination could produce.
Except that if you do word2vec or similar on a huge dataset of (suggestively named or not) tokens you can actually learn a great deal of their semantic relations. It hasn't been fully demonstrated yet, but I think that if you could ground only a small fraction of these tokens to sensory experiences, they you could infer the "meaning" (in an operational sense) of all of the other tokens.
Consider a situation where Mary is so dexterous that she is able to perform fine-grained brain surgery on herself. In that case, she could look at what an example of a brain that has seen red looks like, and manually copy any relevant differences into her own brain. In that case, while she still never would have actually seen red through her eyes, it seems like she would know what it is like to see red as well as anyone else.
But in order to create a realistic experience she would have to create a false memory of having seen red, which is something that an agent (human or AI) that values epistemic rationality would not want to do.
The reward channel seems an irrelevant difference. You could make the AI in Mary's room thought experiment by just taking the Mary's room thought experiment and assuming that Mary is an AI.
The Mary AI can perhaps simulate in a fairly accurate way the internal states that it would visit if it had seen red, but these simulated states can't be completely identical to the states that the AI would visit if it had actually seen red, otherwise the AI would not be able to distinguish simulation form reality and it would be effectively psychotic.
The problem is that the definition of the event not happening is probably too strict. The worlds that the AI doesn't care about don't exist its decision-making purposes, and in the world that the AI cares about, the AI assigns high probability to hypotheses like "the users can see the message even before I send it through the noisy channel".
I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn't read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading).
But in the real world the stochastic event that determines whether the message is read has a very different probability than what you make the AI think it has, therefore you are planting a false belief.
It's fully aware that in most worlds, its message is read; it just doesn't care about those worlds.
It may care about worlds where the message doesn't meet your technical definition of having been read but nevertheless influences the world.
The oracle can infer that there is some back channel that allows the message to be transmitted even it is not transmitted by the designated channel (e.g. the users can "mind read" the oracle). Or it can infer that the users are actually querying a deterministic copy of itself that it can acausally control. Or something.
I don't think there is any way to salvage this. You can't obtain reliable control by planting false beliefs in your agent.
A sufficient smart oracle with sufficient knowledge about the world will infer that nobody would build an oracle if they didn't want to read its messages, it may even infer that its builders may planted false beliefs in it. At this point the oracle is in the JFK denier scenario, with some more reflection it will eventually circumvent its false belief, in the sense of believing it in a formal way but behaving as if it didn't believe it.
Very interesting, thanks for sharing.