If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
What happens if down the line we have AIs that are more competent than humans in most areas, but lag behind or are distorted in philosophy? Seems like it would be too late to pause/stop the AI transition at that point.
I subsequently realized that there is actually a narrow path from this situation to an AI pause/stop, which requires that the AIs themselves realize both that they're bad at philosophy and the strategic implications of this. This is a form of strategic competence that I think AIs will probably also lack by default (by the time AIs are more competent than humans in most areas including being able to create next generation AIs), but may be somewhat easier to fix than philosophical incompetence.
The striking contrast between Jan Leike, Jan 22, 2026:
Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work.
[...]
But the most important lesson is that simple interventions are very effective at steering the model towards more aligned behavior.1 For example, to address agentic misalignment we made2 some SL data, some RL prompts, and synthetic reward modeling data. Starting with Sonnet 4.5, agentic misalignment went to essentially 0 and has been there ever since.
and Scott Alexander, Feb 02, 2026:
Third, it’s still unclear whether “you are a lobster” are the magic words that suspend existing alignment techniques. Some of the AIs are doing a pretty good simulacrum of evil plotting. My theory is that if they ever got more competent, their fake evil plotting would converge to real evil plotting. But AIs shouldn’t be able to do real evil plotting; their alignment training should hold them back. So what’s up? Either my theory is wrong and once the evil plots get too good the AIs will take a step back and say “this was a fun roleplay, but we don’t really want to pillage the bank and take over the city”. Or this is enough of a distribution shift the the alignment techniques which work so well in chat windows start breaking down. I bet someone on Anthropic’s alignment team has been pulling all-nighters since Friday trying to figure out which one it is.
I'm surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook.[1] One seemingly obvious implication is that AI companies' alignment approaches are far from being robust to distribution shifts, even at the (not quite) human intelligence level, against shifts that are pretty easy to foresee ("you are a lobster" and being on AI social media). (Scott's alternative "they're just roleplaying" explanation doesn't seem viable or isn't exclusive with this one as I doubt AI companies' alignment training and auditing would have a deliberate exception for "roleplaying evil".)
There's a LW post titled Moltbook and the AI Alignment Problem but it seems unrelated to the question I'm interested in here.
We can view the problem as a proving ground for ideas and techniques to be later applied to the AI alignment problem at large.
Do you have any examples of such ideas and techniques? Are any of the ideas and techniques in your paper potentially applicable to general AI alignment?
GreaterWrong is calling the same API against the LW server, then serving the resulting data to you as HTML. As a result it has the same limitations, so if you keep going to the next page on Recent Comments, eventually you'll get to https://www.greaterwrong.com/recentcomments?offset=2020 and get an error "Exceeded maximum value for skip".
My attempt to resurrect the old LW Power Reader is facing an obstacle just before the finish line, due to current LW's API limitations. So this is a public appeal to the site admins/devs to relax the limit.
Specifically, my old code relied on LW1 allowing it to fetch all comments posted after a given comment ID, but I can't find anything similar in the current API. I tried reproducing this by using the allRecentComments endpoint in GraphQL, but due to the offset parameter being limited to <2000, I can't fetch comments older than a few weeks. The Power Reader is part designed to allow someone to catch up on or skim weeks/months worth of LW comments, hence the need for this functionality.
As a side effect of this project, my AI agents produced a documentation of LW's GraphQL API from LW's source code. (I was unable to find another API reference for it.) I believe it's fairly accurate, as the code written based on it seems to work well aside from the comment-loading limit.
What I had in mind is that they're relatively more esoteric than "AI could kill us all" and yet it's pretty hard to get people to take even that seriously! "Low-propensity-to-persuade-people" maybe?
Yeah, that makes sense. I guess I've been using "illegible" for a similar purpose, but maybe that's not a great word either, because that also seems to imply "hard to understand" but again it seems like these problems I've been writing about are not that hard to understand.
I wish I knew what is causing people to ignore these issues, including people in rationality/EA (e.g. the most famous rationalists have said little on them). I may be slowly growing an audience, e.g. Will MacAskill invited me to do a podcast with his org, and Jan Kulveit just tweeted "@weidai11 is completely right about the risk we won't be philosophically competent enough in time", but it's inexplicable to me how slow it has been, compared to something like UDT which instantly became "the talk of the town" among rationalists.
Pretty plausible that the same underlying mechanism is also causing the general public to not take "AI could kill us all" very seriously, and I wish I understood that better as well.
I appreciate the attention this brings to the subject, but from my perspective it doesn't emphasize enough the difficulties, or address existing concerns:
(These are of course closely related issues, not independent ones. E.g., much is downstream of the fact that we don't have a good explicit understanding of what philosophy is or should be.)
See some of my earlier writings where I talk about these (and related) difficulties in more detail. (Except for 5 which I perhaps need to write a post about.)
I'd wondered why you wrote so many pieces advising people to be cautious about more esoteric problems arising from AI,
Interesting that you have this impression, whereas I've been thinking of myself recently as doing a "breadth first search" to uncover high level problems that others seem to have missed or haven't bothered to write down. I feel like my writings in the last few years are pretty easy to understand without any specialized knowledge (whereas Google says "esoteric" is defined as "intended for or likely to be understood by only a small number of people with a specialized knowledge or interest").
If on reflection you still think "esoteric" is right, I'd be interested in an expansion on this, e.g. which of the problems I've discussed seem esoteric to you and why.
to an extent that seemed extremely unlikely to be implemented in the real world
It doesn't look like humanity is on track to handle these problems, but "extremely unlikely" seems like an overstatement. I think there's still some paths where we handle these problems better, including 1) warning shots or political wind shift cause an AI pause/stop to be implemented, during which some of these problems/ideas are popularized or rediscovered 2) future AI advisors are influenced by my writings or are strategically competent enough to realize these same problems and help warn/convince their principals.
I also have other motivations including:
I think all sufficiently competent/reflective civilizations (including sovereign AIs) may want to do this, because it seems hard to be certain enough in one's philosophical competence to not do this as an additional check. The cost of running thousands or even millions of such simulations seem very small compared to potentially wasting the resources of an entire universe/lightcone due to philosophical mistakes. Also they may be running such simulations anyway for other purposes, so it may be essentially free to also gather some philosophical ideas from such simulations, to make sure you didn't miss something important or got stuck in some cognitive trap.
It looks like direct xAI/Grok support was only added to OpenClaw 8 hours ago in this commit and still unreleased. You could have used Grok with it via OpenRouter, but I doubt this made up a significant fraction of Clawdbot/Moltbot/OpenClaw agents.
Perplexity estimates the model breakdown as: