I think you get the point but say openAI "trains" GPT-5 and it turns out to be so dangerous that it can persuade anybody of anything and it wants to destroy the world.
We're already screwed, right? Who cares if they decide not to release it to the public? Or like they can't "RLHF" it now, right? It's already existentially dangerous?
I guess maybe I just don't understand how it works. So if they "train" GPT-5, does that mean they literally have no idea what it will say or be like until the day that the training is done? And then they are like "Hey what's up?" and they find out?
Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?