if anything, it seems more common that people dig into incorrect beliefs because of a sense of adversity against others
Consider cults (including milder things like weird "alternative" health advice groups etc.). Positivity and mutual support seem like a key element of their architecture, and adversity often primarily comes from peers rather than an outgroup. I'm not talking about isolated beliefs, content and motivations for those tend to be far more legible. A lot of belief memeplexes have either too few followers or aren't distinct enough from all the other nonsense to be explicitly labeled as cults or ideologies, or to be organized, but you generally can't argue their members out of alignment with the group (on relevant beliefs, considered altogether).
the point ... is to make it clear that when you are receiving kindness, you are not receiving updates towards truth
This is also a standard piece of anti-epistemic machinery of groups that reinforce some nonsense memplex among themselves with support and positivity. Support and positivity are great, but directing them to systematically taboo correctness-fixing activity is what I'm gesturing at, the sort of "kindness" that by its intent and nature tends to trade off against correctness.
Some links on modal logic for FDT-style decision theory and coordination:
once you achieve pareto optimality, there is a tradeoff between kindness and correctness
It's hard to stay on a pareto frontier, optimizing for more (or less) "kindness" directly is a goodharting hazard. If you ask for something, you might just get poisoned with more of the fake version of it.
I'd prefer less of the sort of "kindness" that trades off with correctness, rather than more of it (even when getting less of it wouldn't actually help with correctness; it just doesn't seem like a good thing). But if I ask for that, I'll end up getting some (subtle) sneering and trolling, or unproductive high-standards elitism that on general principle wants to destroy ideas that didn't get a chance to grow up yet. Similarly, if you ask for the sort of "kindness" that does trade off with correctness, you'll end up getting some sycophancy (essentially) that cultivates your errors, making them stronger and more entrenched in your identity, ever more painful and less feasible to eventually defeat (even if there are benign forms of this sort of "kindness" that merely don't make the problem worse in a comfortable way, as opposed to trying to intervene on it).
having an identity is an important part of how nearly everyone navigates this complex and confusing world
Legible ideas (that are practical to meaningfully argue about) cover a lot of ground, they are not as hazardous as part of identity. And less well-defined but useful/promising/interesting understandings don't need to become part of identity to be taken seriously and developed. That's the failure mode at the other extreme, when anything insufficiently scientific/empirical/legible/etc. gets thrown out with the bathwater.
rather than immediately coming in with a wrecking ball and demolishing emotionally load bearing pillars
Probably when something is easy to defeat (admits argument, legible), it's not that painful to let it go. The pain is the nebulous attachment fighting for influence, that it won't be fully defeated even when you end up consciously endorsing a change of mind. Thus ideologies are somewhat infeasible to change, they'll keep their hold even long after the host disavows them. A habit of keeping such things at a distance benefits from other people not feeding their structurally hazardous placement (as emotionally load bearing pillars) with positivity. But that's distinct from viewing positively the development of even such hazardous things, handling them with appropriate caution.
it can be deeply emotionally painful to part ways with deeply held beliefs
This is not necessarily the case, not for everyone. Theories and their credences don't need to be cherished to be developed, or acted upon, they only need to be taken seriously. Plausibly this can be mitigated by keeping identity small, accepting only more legible things in the role of "beliefs" that can have this sort of psychological effect (so that they can be defeated through argument alone). Legible ideas cover a surprising amount of territory, there is no pragmatic need to treat anything else as "beliefs" in this sense, all the other things can remain ambient epistemic content detached from who you are. When more nebulous worldviews become part of one's identity, they become nearly impossible to dislodge (and possibly painful, with enough context and effort). They are still worth developing towards eventual legibility, and not practical to argue with (or properly explain).
Thus arguing legible beliefs should by their nature be less intrusive than arguing nebulous worldviews. And perhaps nebulous worldviews should be argued against being held as "beliefs" in the emotional sense in general, regardless of their apparent correctness, as a matter of epistemic hygiene. Ensuring by habit you are not going to be in the position where you have "beliefs" that would be painful to part ways with, and also can't be pinned down clearly enough to dispel.
I get a sense "RSI" will start being used to mean continual learning or even just memory features in 2026, similarly to how there are currently attempts to dilute "ASI" to mean merely robust above-human-level competence. Thus recursively self-improving personal superintelligence becomes a normal technology through the power of framing. Communication can fail until the trees start boiling the oceans, when it becomes a matter of framing and ideology rather than isolated terminological disputes. That nothing ever changes is a well-established worldview, and it's learning to talk about AI.
The end states of AI danger need terms to describe them. RSI proper is qualitative self-improvement, at least software-only singularity rather than merely learning from the current situation, automated training of new skills, keeping track of grocery preferences. And ASI proper is being qualitatively more capable than humanity, rather than a somewhat stronger cognitive peer with AI advantages, technology that takes everyone's jobs.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn't help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won't work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it'll remain too early to tell all the way until it's too late to make any use of the update. Maybe Anthropic's RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
There are many missing cognitive faculties, whose absence can plausibly be compensated with scale and other AI advantages. We haven't yet run out of scale, though in 2030s we will (absent AGI, trillions of dollars in revenue).
The currently visible crucial things that are missing are sample efficiency (research taste) and continual learning, with many almost-ready techniques to help with the latter. Sholto Douglas of Anthropic claimed a few days ago that probably continual learning gets solved in a satisfying way in 2026 (at 38:29 in the podcast). Dario Amodei previously discussed how even in-context learning might confer the benefits of continual learning with further scaling and sufficiently long contexts (at 13:17 in the podcast). Dwarkesh Patel says there are rumors Sutskever's SSI is working on test time training (at 39:25 in the podcast). Thinking Machines published work on better LoRA, which in some form seems crucial to making continual learning via weight updates practical for individual model instances. This indirectly suggests OpenAI would also have a current major project around continual learning.
The recent success of RLVR suggests that any given sufficiently narrow mode of activity (such as doing well on a given kind of benchmarks) can now be automated. This plausibly applies to RLVR itself, which might be used by AIs to "manually" add new skills to themselves, after RLVR was used to teach AIs to apply RLVR to themselves, in the schleppy way that AI researchers currently do to get them better at benchmarkable activities. AI instances doing this automatically for the situation (goal, source of tasks, job) where they find themselves covers a lot of what continual learning is supposed to do. Sholto Douglas again (at 1:00:54 in a recent podcast):
So far the evidence indicates that our current methods haven't yet found a problem domain that isn't tractable with sufficient effort.
So it's not completely clear that there are any non-obvious obstacles still remaining. Missing research taste might get paved over with sufficient scale of effort, once continual learning ensures there is some sustained progress at all when AIs are let loose to self-improve. To know that some obstacles are real, the field first needs to run out of scaling and have a few years to apply RLVR (develop RL environments) to automate all the obvious things that might help AIs "manually" compensate for the missing faculties.
Some observations (not particularly constructive):
That has no bearing on whether we'll be OK. Beliefs are for describing reality, whether they are useful or actionable doesn't matter to what they should say. "You will be OK" is a claim of fact, and the post mostly discusses things that are not about this fact being true or false. Perhaps "You shouldn't spend too much time worrying" or "You should feel OK" captures the intent of this post, but this is a plan of action, something entirely different from the claim of fact that "You will be OK", both in content and in the kind of thing it is (plan vs. belief), in the role it should play in clear reasoning.