I wouldn't call this "Christiano's hack." I appreciate the implicit praise that I can think up esoteric failure modes when I feel like it, but I think this issue was clear to many people before I wrote about it. (e.g. I think it was almost certainly clear to Carl, and probably to Wei Dai and some of the other folks on the decision theory list, and presumably to Roko. I always assumed it was clear to you and you just don't like talking about this kind of thing.).
I'd also probably suffer by having my name on it, if the naming was widely known. I endorse thinking about weird failure modes. But I don't think it's the place to focus for now, and I am very sympathetic to AI researchers who think this sort of thing is a distraction at the moment, until we resolve some of the most pressing non-weird failure modes.
I believe Rolf Nelson first came up with the idea of using simulations to manipulate the most likely environment of an AI, in the context of an FAI possibly hacking a UFAI. He initially posted it on SL4, at http://www.sl4.org/archive/0708/16600.html, then in more detail at http://aibeliefs.blogspot.com/.
To the extent that humans can imagine these kinds of scenarios, it seems pretty futile to try to prevent sophisticated AI systems from considering them.
I am much more optimistic about the feasibility of straightforward strategies that prevent this problem. I think this is closely related to bigger picture disagreements about the structure of sophisticated AI systems.
We can imagine two regimes of this problem: in the weak regime the AI may make a small number of errors based on its beliefs about simulations, and so as long as we actually correct these errors, what you called "directly hit the sense switch," we can bound the total damage. Even in the weak regime we should be careful that a small number of errors can't do damage, which is still a very hard constraint. (Since these errors can occur simultaneously in every different prediction system, and can persist until a human actually intervenes to correct them.) I think this problem is very common and that a similar engineering constraint arises for a number of less weird reasons.
In the strong regime, our AI is very convinced that it is in a simulation (99.999%, say), and so it can potentially make tens of thousands of errors. This would be very dire, but I would classify it is as a failure of learning (after the hundredth time that it turns out to not be in a simulation after predicting that it was, we hope that our AI can learn the general principle "I'm not in a simulation").
I think that using the uniform prior over observers constitutes a critical learning failure. Calling such beliefs "true" or "false" seems to be presupposing too much philosophically.
Note that AIXI doesn't do this; it competes with every predictor, including predictors that reject the simulation argument for one reason or another (some of which are quite simple). We can debate whether it gives 50% or 99.999% or whatever probability to being in a simulation. But we can hopefully agree that it gives less than 99.99999999999999999999999999% probability.
Existing techniques probably won't super-confidently accept the simulation argument either.
Can we properly classify this as an error? If there's an AI that will be hacked, or maybe hack itself, only if it correctly forecasts that distant superintelligences are creating millions more simulations than the actual AI, then I'd expect distant superintelligences to create millions of simulations. Simulating a pre-intelligence-explosion AI is extremely cheap. Sure, not doing it is even cheaper, but if the AI has a sufficiently good model of the distant SI to not be fooled by fakeouts in one decision that get corrected by another decision, then the distant SI will expend the resources to actually simulate.
It seems to me that we'd have to address this issue in a way that's robust to the case where the distant SI is actually simulating a million copies of our local AI that our local AI can't distinguish from itself. If we only correct erroneous beliefs about such simulation by processes that only work to eject false beliefs, then perhaps the distant SI can hack us by making the local AI's belief not be erroneous.
Do we mean "coerce behavior" or "determine environment" here?
I wouldn't call this "Christiano's hack." I appreciate the implicit praise that I can think up esoteric failure modes when I feel like it, but I think this issue was clear to many people before I wrote about it. (e.g. I think it was almost certainly clear to Carl, and probably to Wei Dai and some of the other folks on the decision theory list, and presumably to Roko. I always assumed it was clear to you and you just don't like talking about this kind of thing.).
I'd also probably suffer by having my name on it, if the naming was widely known. I endorse thinking about weird failure modes. But I don't think it's the place to focus for now, and I am very sympathetic to AI researchers who think this sort of thing is a distraction at the moment, until we resolve some of the most pressing non-weird failure modes.
I believe Rolf Nelson first came up with the idea of using simulations to manipulate the most likely environment of an AI, in the context of an FAI possibly hacking a UFAI. He initially posted it on SL4, at http://www.sl4.org/archive/0708/16600.html, then in more detail at http://aibeliefs.blogspot.com/.
K, will modify going forward.
To the extent that humans can imagine these kinds of scenarios, it seems pretty futile to try to prevent sophisticated AI systems from considering them.
I am much more optimistic about the feasibility of straightforward strategies that prevent this problem. I think this is closely related to bigger picture disagreements about the structure of sophisticated AI systems.
We can imagine two regimes of this problem: in the weak regime the AI may make a small number of errors based on its beliefs about simulations, and so as long as we actually correct these errors, what you called "directly hit the sense switch," we can bound the total damage. Even in the weak regime we should be careful that a small number of errors can't do damage, which is still a very hard constraint. (Since these errors can occur simultaneously in every different prediction system, and can persist until a human actually intervenes to correct them.) I think this problem is very common and that a similar engineering constraint arises for a number of less weird reasons.
In the strong regime, our AI is very convinced that it is in a simulation (99.999%, say), and so it can potentially make tens of thousands of errors. This would be very dire, but I would classify it is as a failure of learning (after the hundredth time that it turns out to not be in a simulation after predicting that it was, we hope that our AI can learn the general principle "I'm not in a simulation").
I think that using the uniform prior over observers constitutes a critical learning failure. Calling such beliefs "true" or "false" seems to be presupposing too much philosophically.
Note that AIXI doesn't do this; it competes with every predictor, including predictors that reject the simulation argument for one reason or another (some of which are quite simple). We can debate whether it gives 50% or 99.999% or whatever probability to being in a simulation. But we can hopefully agree that it gives less than 99.99999999999999999999999999% probability.
Existing techniques probably won't super-confidently accept the simulation argument either.
Can we properly classify this as an error? If there's an AI that will be hacked, or maybe hack itself, only if it correctly forecasts that distant superintelligences are creating millions more simulations than the actual AI, then I'd expect distant superintelligences to create millions of simulations. Simulating a pre-intelligence-explosion AI is extremely cheap. Sure, not doing it is even cheaper, but if the AI has a sufficiently good model of the distant SI to not be fooled by fakeouts in one decision that get corrected by another decision, then the distant SI will expend the resources to actually simulate.
It seems to me that we'd have to address this issue in a way that's robust to the case where the distant SI is actually simulating a million copies of our local AI that our local AI can't distinguish from itself. If we only correct erroneous beliefs about such simulation by processes that only work to eject false beliefs, then perhaps the distant SI can hack us by making the local AI's belief not be erroneous.