My impression is that most people around here aren't especially worried about GPT-n being either: capable of recursive self-improvement leading to foom, or obtaining morally significant levels of consciousness.
Reasons given include:
- GPT has a large number of parameters with a shallow layer-depth, meaning it is incapable of "deep" reasoning
- GPT's training function "predict the next character" makes it unlikely to make a "treacherous turn"
- GPT is not "agenty" in the sense of having a model of the world and viewing itself as existing within that model.
On the other hand, I believe is widely agreed that if you take a reinforcement learner (say Google's Dreamer) and give it virtually any objective function (the classic example being "make paperclips") and enough compute, it will destroy the world. The general reason being given is Goodhart's Law.
My question is, does this apparent difference in perceived safely arise purely from our expectations of the two architecture's capabilities. Or is there actually some consensus that different architectures carry inherently different levels of risk?
Question
To make this more concrete, suppose you were presented with two "human level" AGI's, one built using GPT-n (say using this method) and one built using a Reinforcement Learner with a world-model and some seeming innocuous objective function (say "predict the most human like response to your input text").
Pretend you have both AGIs in separate boxes front of you, and complete diagram of their software and hardware and you communicate with them solely using a keyboard and text terminal attached to the box. Both of the AGIs are capable of carrying on a conversation at a level equal to a college-educated human.
If using all the testing methods at your disposal, you perceived these two AGIs to be equally "intelligent", would you consider one more dangerous than the other?
Would you consider one of them to be more likely to be conscious than the other?
What significant moral or safety questions about these two AGIs would you have different answers for (if any)?
Application
Suppose that you consider these two AGIs equally dangerous. Then the alignment problem mostly boils down to the correct choice of objective function.
If, on the other hand, there are widely agreed upon differences in the safety levels of different architectures, then AI safety should focus quite heavily on finding and promoting the safest architectures.
I suspect the difference is mostly in what training opportunities are available, not what type of system is used internally.
In principle, a strong NLP AI might learn some behaviour that manipulates humans. It's just that in practice it is more difficult for it to do so, because in almost all of the training phase there is no interaction at all. The input is decoupled from its output, so there is no training signal to improve any ability to manipulate the input.
In reality there are some side-channels that are interactive, such as selection of fine-tuning training based on human evaluation. A sufficiently powerful system might be able to learn enough from that to manipulate the world, but it seems much less likely than some other type of system with more interactive learning doing it first.