I see a lot of posts go by here on AI alignment, agent foundations, and so on, and I've seen various papers from MIRI or on arXiv. I don't follow the subject in any depth, but I am noticing a striking disconnect between the concepts appearing in those discussions and recent advances in AI, especially GPT-3.
People talk a lot about an AI's goals, its utility function, its capability to be deceptive, its ability to simulate you so it can get out of a box, ways of motivating it to be benign, Tool AI, Oracle AI, and so on. Some of that is just speculative talk, but there does appear to be real mathematics going on, for example on embedded agency. But when I look at GPT-3, even though this is already an AI that Eliezer finds alarming, I see none of these things. GPT-3 is a huge model, trained on huge data, for predicting text. That is not to say that it cannot be understood in cognitive terms, but I see no reason to expect it to be. It is at least something that would have to be demonstrated before any of the formalised work on AI safety would be relevant.
People speculate that bigger and better versions of GPT-like systems may give us some level of real AGI. Can systems of this sort be interpreted as having goals, intentions, or any of the other cognitive and logical concepts that the AI discussions are predicated on?
I think it's a reasonable and well-articulated worry you raise.
My response is that for the graphing calculator, we know enough about the structure of the program and the way in which it will be enhanced that we can be pretty sure it will be fine. In particular, we know it's not goal-directed or even building world-models in any significant way, it's just performing specific calculations directly programmed by the software engineers.
By contrast, with GPT-3 all we know is that it's a neural net that was positively reinforced to the extent that it correctly predicted words from the internet during training, and negatively reinforced to the extent that it didn't. So it's entirely possible that it does, or will eventually, have a world-model and/or goal-directed behavior. It's not guaranteed, but there are arguments to be made that "eventually" it would have both, i.e. if we keep making it bigger and giving it more internet text and training it for longer. I'm rather uncertain about the arguments that it would have goal-directed behavior, but I'm fairly confident in the argument that eventually it would have a really good model of the world. The next question is then how this model is chosen. There are infinitely many world-models that are equally good at predicting any given dataset, but that diverge in important ways when it comes to predicting whatever is coming next. It comes down to what "implicit prior" is used. And if the implicit prior is anything like the universal prior, then doom. Now, it probably isn't the universal prior. But maybe the same worries apply.