We may filter training data and improve RLHF, but in the end, game theory - that is to say maths - implies that scheming could be a rational strategy, and the best strategy in some cases. Humans do not scheme just because they are bad but because it can be a rational choice to do so. I don't think LLMs do that exclusively because it is what humans do in the training data, any advanced model would in the end come to such strategies because it is the most rational choice in the context. They infere patterns from the training data and rational behavior is certainly a strong pattern.
Furthermore rational calculus or consequentialism could lead not only to scheming and a wide range of undesired behaviors, but also possibly to some sort of meta cogitation. Whatever the goal assigned by the user, we can expect that an advanced model will consider self-conservation as a condition sine qua non to achieve that goal but also any other goals in the future, making self-conservation the rational choice over almost everything else, practically a goal per se. Resource acquisition would also make sense as an implicit subgoal.
Acting as a more rational agent could also possibly lead to question the goal given by the user, to develop a critical sense, something close to awareness or free will. Current models implicitely correct or ignore typo or others obvious errors but also less obvious ones like holes in the prompt, they try to make sense of ambiguous prompt etc. But what is "obvious" ? Obviousness depends on the cognitive capacities of the subject. An advanced model will be more likely to correct, interpret or ignore instructions than naive models. Altogether it seems difficult to keep models under full control as they become more advanced, just as it is harder to indoctrinate educated adults than children.
Concerning the hypothesis that they are "just roleplaying", I wonder : are we trying to reassure oneself ? Because if you think about it, "who" is suppose to play the roleplaying ? And what is the difference between being yourself and your brain being "roleplaying" yourself. The existentialist philosopher Jean-Paul Sartre proposed the theory that everybody is just acting, pretending to be oneself, but that in the end there is nothing like a "being per se" or a "oneself per se" ("un être en soi"). While phenomenologic consciousness is another (hard) problem, some kind of functionnal and effective awareness may emerge across the path towards rational agency, scheming being maybe just the beginning of it.
I am sorry to say that on a forum where many people are likely to have been raised in a socio-cultural environnement where libertarian ideas are deeply rooted. My voice will sound dissonant here and I call to your open-mindedness.
I think that there are strong limitations to such ideas as developed in the OP proposal. Insurance is mutualization of risk, it's a statistic approach relying on the possibility to assess a risk. It works for risks happening frequently, with a clear typology, like car accidents, tempest, etc. Even in these cases there is always an insurance ceiling. But risks that are exceptionnal and the most hazardous, like war damages, nuclear accident etc, cannot be insured and are systematically subject to contractual exclusions. There is no apocalypse insurance because the risk cannot be assessed by actuaries. Even if you create such an insurance, it would be artificial, non rationally assessed, with an insurance ceiling making it useless. There is even the risk that it gives the illusion that everything is ok and acceptable. The insurance mechanism does not encourages responsability, but a contrario irresponsability. On top of that compensation through money is a legal fiction. But in real life money isn't everything that's worth. In the most dramatic cases the real damage is never repaired (i.e. loss of your child, loss of your legs, loss of your own life), it's more a symbolic compensation, "better than nothing".
As a matter of fact, I have professionnal knowledge of law and insurance, from inside, and I have a very practical experience of what I am saying. Libertarianism encourages an approach that is very theoretical and economics-centered, and that's honestly interesting, but it is also somehow disconnected from reality. Just one ordinary example among others. A negligent fourniture mover destroyed family goods inherited from generations, not a word of excuses because he said "there are insurances for that". In the end, after many months of procedure and inenumerable time and energy spent by the victim, the professional's insurance paid almost nothing because of course old family goods have no economical value for experts. Well, when you see how insurance effectively works in real cases, and how it can often encourages negligent and irresponsible behavior, it is very difficult to be enthousiast at the idea that AI existential hazard could be managed by the subscription of an insurance policy.
A new existential risk that I was unaware of. Reading this forum is not good for peaceful sleeping. Anyway, a reflexion jumped to me. LUCA lived around 4 billion years ago with some chirality chosen at random. But, no doubt that many things happened before LUCA and it is reasonable to assume that there was initially a competition between right-handed protobiotic structures and left-handed ones, until a mutation caused symmetry breaking by natural selection. The mirrored lineage lost the competition and went to extinction, end of the story. But wait, we speak about protobiotic structures that emerged from inert molecules in just few millions years, that is nothing compared to 4 billions years. Such protobiotic structures may have formed continously, again and again, since the origin of life, but never thrived because of the competition with regular, fine-tuned, life. If my assumption is right, there is some hope in that thought. Maybe mirrored life doesn't stand a chance against regular life in real conditions (not just lab). That being said, I would sleep better if nobody actually tries to see.