I see a lot of posts go by here on AI alignment, agent foundations, and so on, and I've seen various papers from MIRI or on arXiv. I don't follow the subject in any depth, but I am noticing a striking disconnect between the concepts appearing in those discussions and recent advances in AI, especially GPT-3.
People talk a lot about an AI's goals, its utility function, its capability to be deceptive, its ability to simulate you so it can get out of a box, ways of motivating it to be benign, Tool AI, Oracle AI, and so on. Some of that is just speculative talk, but there does appear to be real mathematics going on, for example on embedded agency. But when I look at GPT-3, even though this is already an AI that Eliezer finds alarming, I see none of these things. GPT-3 is a huge model, trained on huge data, for predicting text. That is not to say that it cannot be understood in cognitive terms, but I see no reason to expect it to be. It is at least something that would have to be demonstrated before any of the formalised work on AI safety would be relevant.
People speculate that bigger and better versions of GPT-like systems may give us some level of real AGI. Can systems of this sort be interpreted as having goals, intentions, or any of the other cognitive and logical concepts that the AI discussions are predicated on?
If you have a chance, I'd be interested in your line of thought here.
My initial model of GPT-3, and probably the model of the OP, is basically: GPT-3 is good at producing text that it would have been unsurprising to find on the internet. If we keep training up larger and larger models, using larger and larger datasets, it will produce text that it would be less-and-less surprising to find on the internet. Insofar as there are safety concerns, these mostly have to do with misuse -- or with people using GPT-N as a starting point for developing systems with more dangerous behaviors.
I'm aware that people who are more worried do have arguments in mind, related to stuff like inner optimizers or the characteristics of the universal prior, but I don't feel I understand them well -- and am, perhaps unfairly, beginning from a place of skepticism.
I think that OP's question is sort about whether this way of speaking/thinking about GPT-3 makes sense, in the first place.
Intentionally silly example: Suppose that people were expressing concern about the safety of graphing calculators, saying things like: "OK, the graphing calculator that you own is safe. But that's just because it's too stupid to recognize that it has an incentive to murder you, in order to achieve its goal of multiplying numbers together. The stupidity of your graphing calculator is the only thing keeping you alive. If we keep improving our graphing calculators, without figuring out how to better align their goals, then you will likely die at the hands of graphing-calculator-N."
Obviously, something would be off about this line of thought, although it's a little hard to articulate exactly what. In some way, it seems, the speaker's use of certain concepts (like "goals" and "stupidity") is probably to blame. I think that it's possible that there is an analogous problem, although certainly a less obvious one, with some of the safety discussion around GPT-3.