Epistemic status: a rough sketch of an idea
Current LLMs are huge and opaque. Our interpretability techniques are not adequate. Current LLMs are not likely to run hidden dangerous optimization processes. But larger ones may.
Let's cap the model size at the currently biggest models, ban everything above. Let's not build superhuman level LLMs. Let's build human level specialist LLMs and allow them to communicate with each other via natural language. Natural language is more interpretable than the inner processes of large transformers. Together, the specialized LLMs will form a meta-organism which may become superhuman, but it will be more interpretable and corrigible, as we'll be able to intervene on the messages between them.
Of course, model parameter efficiency may increase in the future (as it happened with Chinchilla) -> we should monitor this and potentially lower the cap. On the other hand, our mechanistic interpretability techniques may improve, so we may increase the cap, if we are confident it won't do harm.
This idea seems almost trivial to me, but I haven't seen it discussed anywhere, so I'm posting it early to gather feedback why this might not work.
Have you seen Seth Herd's work and the work it references (particularly natural language alignment)? Drexler also has an updated proposal called Open Agencies, which seems to be an updated version of his original CAIS research. It seems like Davidad is working on a complex implementation of open agencies. I will likely work on a significantly simpler implementation. I don't think any of these designs explicitly propose capping LLMs though, given that they're non-agentic, transient, etc. by design and thus seem far less risky than agentic models. The proposals mostly focus on avoiding riskier models that are agentic, persistent, etc.