The only way I can see this happening with non-negligible probability is if we create AGI along more human lines - e.g, uploaded brains which evolve through a harsh selection process that wouldn't be aligned with human values. In that scenario, it may be near certain. Nothing is closer to a mind design capable of torturing humans than another human mind - we do that all the time today.
As others point out, though, the idea of a sign being flipped in an explicit utility function is one that people understand and are already looking for. More than that, it would only produce minimal human-utility if the AI had a correct description of human utility. Otherwise, it would just use us for fuel and building material. The optimization part also has to work well enough. Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit. This naively suggests a probability near zero. I can't imagine a counter-scenario clearly enough to make me change this estimate, if you don't count the previous paragraph.
Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit.
Isn’t this exactly what happened with the GPT-2 bug, which led to maximally ‘bad’ output? Would that not suggest that the probability of this occurring with an AGI is non-negligible?
If that sort of thing happens, you would turn off the AI system (as OpenAI did in fact do). The AI system is not going to learn so fast that it prevents you from doing so.
This has lowered my credence in such a catastrophe by about an order of magnitude. However, that's a fairly small update for something like this. I'm still worried.
Maybe some important AI will learn faster than we expect. Maybe the humans in charge will be grossly negligent. Maybe the architecture and training process won't be such as to involve a period of dumb-misaligned-AI prior to smart-misaligned-AI. Maybe some unlucky coincidence will happen that prevents the humans from noticing or correcting the problem.
Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?
Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?
Sorry for the dumb question a month after the post, but I've just found out about deceptive alignment. Do you think it's plausible that a signflipped AGI could fake being an FAI in the training stage, just to take a treacherous turn at deployment?
I think AI systems should be designed in such a way to avoid being susceptible to sign flips (as Eliezer argues in that post you linked), but also suspect this is likely to happen naturally in the course of developing the systems. While a sign flip may occur in some local area, you'd have to have just no checksums on the process for the result of a sign-flipped reward function to end up in control.
What do you think the difference would be between an AGI's reward function, and that of GPT-2 during the error it experienced?
One is the difference between training time and deployment, as others have mentioned. But the other is that I'm skeptical that there will be a singleton AI that was just trained via reinforcement learning.
Like, we're going to train a single neural network end-to-end on running the world? And just hand over the economy to it? I don't think that's how it's going to go. There will be interlocking more-and-more powerful systems. See: Arguments about fast takeoff.
I've been reading a fair bit about "worse than death" scenarios from AGI (e.g. posts like this), and the intensities and probabilities of them. I've generally been under the impression that the worst-case scenarios have extremely low probabilities (i.e. would require some form of negative miracle to occur) and can be considered a form of Pascal's mugging.
Recently, however, I came across this post on OpenAI's blog. The blog post notes the following:
This seems to be the exact type of issue that could cause a hyperexistential catastrophe. With this in mind, can we really consider the probability of this sort of scenario to be very small (as was previously believed)? Do we have a reason to believe that this is still highly unlikely to happen with an AGI? If not, would that suggest that current alignment work is net-negative in expectation?