I was thinking of a recent presentation I saw where the presenter said "It [AIXI] gets rid of all the humans, and it gets a brick, and puts it on the reward button." and it turns out that was Roko, not Eliezer.
Hutter has discussed AIXI wireheading several times, most recenly in his AGI-10 presentation - where he discusses wireheading in the Q & A at the end (01:03:00) - claiming that he can prove it won't happen in some cases - but not all of them.
Mostly he argues that it probably won't do it - for the same reason that many humans don't take drugs: the long-term rewards are low.
Here's a quote:
Another problem connected, but possibly not limited to embodied agents, especially if they are rewarded by humans, is the following: Sufficiently intelligent agents may increase their rewards by psychologically manipulating their human “teachers”, or by threatening them. This is a general sociological problem which successful AI will cause, which has nothing specifically to do with AIXI. Every intelligence superior to humans is capable of manipulating the latter. In the absence of manipulable humans, e.g. where the reward structure serves a survival function, AIXI may directly hack into its reward feedback. Since this will unlikely increase its long-term survival, AIXI will probably resist this kind of manipulation (like most humans don’t take hard drugs, due to their long-term catastrophic consequences).
Marcus Hutter once wrote:
Another problem connected, but possibly not limited to embodied agents, especially if they are rewarded by humans, is the following: Sufficiently intelligent agents may increase their rewards by psychologically manipulating their human “teachers”, or by threatening them. This is a general sociological problem which successful AI will cause, which has nothing specifically to do with AIXI.
These days, one might say: "this is a general sociological problem which pure reinforcement learning agents will cause - which illustrates why we should not build them."
Link: physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/
Suzanne Gildert basically argues that any AGI that can considerably self-improve would simply alter its reward function directly. I'm not sure how she arrives at the conclusion that such an AGI would likely switch itself off. Even if an abstract general intelligence would tend to alter its reward function, wouldn't it do so indefinitely rather than switching itself off?
If it wants to maximize its reward by increasing a numerical value, why wouldn't it consume the universe doing so? Maybe she had something in mind along the lines of an argument by Katja Grace:
Link: meteuphoric.wordpress.com/2010/02/06/cheap-goals-not-explosive/
I am not sure if that argument would apply here. I suppose the AI might hit diminishing returns but could again alter its reward function to prevent that, though what would be the incentive for doing so?
ETA:
I left a comment over there:
ETA #2:
What else I wrote: