I believe I understand your point, but there are two things that I need to clarify, that kind of bypasses some of these criticism:
a) I am not assuming any safety technique applied to language models. In a sense, this is the worst-case scenario, one thing that may happen if the language model is run "as-it-is". In particular, the scenario I described would be mitigated if we could possibly stop the existence of stable sub-agents appearing in language models, although how to do this I do not know.
b) The incentives for the language models to be a superoptimizer don't necessarily need to be that strong, if we consider that we could have many instantiations of GPT-N being used, and only one of them needs to be that kind of stable malicious agent I tried (and probably failed) to describe. One of these stable agents would only need to appear once, in some setting where it can both stabilize itself (maybe through carefully placed prompts), and gain some power to cause harm in the world. If we consider something like the language model being used like GPT-3, in multiple different scenarios, this becomes a weaker assumption.
That being said, I agree with your general line of criticism, of not imagining intelligent but not superoptimizing agents being possible, although whether superoptimizer are attractors for generally intelligent agents, and under which conditions, is an open (and crucially important) question.
One speculative way I see it, that I've yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of "Pierre Menard, author of Don Quixote" tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A, when seeing string S, just because "being" that agent performed well in similar training strings. This agent, if complex and capable enough (which may need to be, if that was what it needed to predict previous data), this agent himself could, maybe through the placement of careful answer tokens that would guarantee its stability, would be a dangerous and possibly malicious agent.
And, of course, sequence modeling as a paradigm can also be used for RL training.
Thanks for the reflection, it is how a part of me feels (I usually never post on LessWrong, being just a lurker, but your comment inspired me a bit).
Actually, I do have some background that could, maybe, be useful in alignment, and I did just complete the AGISF program. Right now, I'm applying to some positions (particularly, I'm focusing now on the SERIMATS application, which is an area that I may be differentially talented), and just honestly trying to do my best. After all, it would be outrageous if I could do something, but I simply did not.
But I recognize the possibility that I'm simply not good enough, and there is no way for me to actually do anything beyond just, as you said, signal boosting, so I can introduce more capable people into the field, while living my life and hoping that Humanity solves this.
But, if Humanity does not, well, it is what it is. There was the dream of success, and building a future Utopia, with future technology facilitated by aligned AI, but that may have been just that, a dream. Maybe alignment is unsolvable, and is the natural order of any advanced civilization to destroy itself by its own AI. Or maybe alignment is solvable, but given the incentives of our world as they are, it was always a fact that unsafe AGI would be created before we would solve alignment.
Or maybe, we will solve alignment in the end, or we were all wrong about the risks from AI in the first place.
As for me, for now, I'm going to keep trying, keep studying, just because, if the world comes to an end, I don't want to conclude that I could've done more. While hoping that I never have to wonder about that in the first place.
EDIT: To be clear, I'm not that sure about short timelines, in the sense that, insofar I know (and I may be very, very wrong), the AGIs we are creating right now don't seem to be very agentic, and it may be that creating agency from current techniques is much harder than creating general intelligence. But again, "not so sure" is something like 20%-30% chance of timelines being really short, so the point mostly stands.
That is not my experience at all. Maybe it is because my friends from outside of the AI community are also outside of the tech bubble, but I've seen a lot of pessimism recently with the future of AI. In fact, they seem to easily both the orthogonality and the instrumentality thesis. Although I avoid delving into this topic of human extinction, since I don't want to harm anyone's mental health, the rare times were this topic comes up they seem to easily agree that this is a non-trivial possibility.
I guess the main reason is that, since they are outside of the tech bubble, they don't seem to think that worrying about AI risk is being a Luddite, not truly understanding AI, or something like that. Moreover, since none of them works in AI, they don't take any personal offense with the suggestion that capabilities advance may greatly harm humanity.