If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
In my personal vibe-coding projects, I'm reviewing ~0% of code, but having to do a lot more testing (relative to writing the code myself) because the AI is constantly introducing regressions (breaking what previously worked), which are not being caught by its test code because either the spec wasn't detailed enough to cover every possibility or edge case (i.e., the AI can't read between the lines to figure out what I want without being written down in detail, or it doesn't care), or its testing code isn't good enough to catch a lot of bugs.
As an example, when it adds a new UI element, the styling would often be inconsistent with other nearby elements, and I'd have to tell it to add a requirement to the spec that this set of elements should have consistent styling.
If others have similar experiences, we'll still have "% testing done by humans" metric to descend after "% of code reviewed by humans" goes to 0.
But also, beyond all of that, the arguments around decision-theory are I think just true in the kind of boring way that physical facts about the world are true, and saying that people will have the wrong decision-theory in the future sounds to me about as mistaken as saying that lots of people will disbelieve the theory of evolution in the future. It's clearly the kind of thing you update on as you get smarter.
This seems way overconfident:
Another way to put it is that globally we're a bubble (people who like FDT/UDT) within a bubble (analytic philosophy tradition) within a bubble (people who are interested in any kind of philosophy), and then even within this nested bubble there's a further split/disagreement about what FDT/UDT actually says about this specific kind of game/interaction.
It confused me that Opus 4.6's System Card claimed less verbalized evaluation awareness versus 4.5:
On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5.
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn't part of Opus 4.5's alignment evaluation (4.5's System Card doesn't mention them).
This probably seems unfair/unfortunate from Anthropic's perspective, i.e., they believe their models are becoming less eval aware, but due to Apollo's conclusions being spread on social media, a lot of people probably got the impression that models are getting more eval aware. Personally I'm not sure we can trust Anthropic's verbalized evaluation awareness metric, and wish Apollo had done evals on 4.5 too to give us an external comparison.
It looks like direct xAI/Grok support was only added to OpenClaw 8 hours ago in this commit and still unreleased. You could have used Grok with it via OpenRouter, but I doubt this made up a significant fraction of Clawdbot/Moltbot/OpenClaw agents.
Perplexity estimates the model breakdown as:
What happens if down the line we have AIs that are more competent than humans in most areas, but lag behind or are distorted in philosophy? Seems like it would be too late to pause/stop the AI transition at that point.
I subsequently realized that there is actually a narrow path from this situation to an AI pause/stop, which requires that the AIs themselves realize both that they're bad at philosophy and the strategic implications of this. This is a form of strategic competence that I think AIs will probably also lack by default (by the time AIs are more competent than humans in most areas including being able to create next generation AIs), but may be somewhat easier to fix than philosophical incompetence.
The striking contrast between Jan Leike, Jan 22, 2026:
Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been quite useful in guiding our alignment mitigations work.
[...]
But the most important lesson is that simple interventions are very effective at steering the model towards more aligned behavior.1 For example, to address agentic misalignment we made2 some SL data, some RL prompts, and synthetic reward modeling data. Starting with Sonnet 4.5, agentic misalignment went to essentially 0 and has been there ever since.
and Scott Alexander, Feb 02, 2026:
Third, it’s still unclear whether “you are a lobster” are the magic words that suspend existing alignment techniques. Some of the AIs are doing a pretty good simulacrum of evil plotting. My theory is that if they ever got more competent, their fake evil plotting would converge to real evil plotting. But AIs shouldn’t be able to do real evil plotting; their alignment training should hold them back. So what’s up? Either my theory is wrong and once the evil plots get too good the AIs will take a step back and say “this was a fun roleplay, but we don’t really want to pillage the bank and take over the city”. Or this is enough of a distribution shift the the alignment techniques which work so well in chat windows start breaking down. I bet someone on Anthropic’s alignment team has been pulling all-nighters since Friday trying to figure out which one it is.
I'm surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook.[1] One seemingly obvious implication is that AI companies' alignment approaches are far from being robust to distribution shifts, even at the (not quite) human intelligence level, against shifts that are pretty easy to foresee ("you are a lobster" and being on AI social media). (Scott's alternative "they're just roleplaying" explanation doesn't seem viable or isn't exclusive with this one as I doubt AI companies' alignment training and auditing would have a deliberate exception for "roleplaying evil".)
There's a LW post titled Moltbook and the AI Alignment Problem but it seems unrelated to the question I'm interested in here.
We can view the problem as a proving ground for ideas and techniques to be later applied to the AI alignment problem at large.
Do you have any examples of such ideas and techniques? Are any of the ideas and techniques in your paper potentially applicable to general AI alignment?
GreaterWrong is calling the same API against the LW server, then serving the resulting data to you as HTML. As a result it has the same limitations, so if you keep going to the next page on Recent Comments, eventually you'll get to https://www.greaterwrong.com/recentcomments?offset=2020 and get an error "Exceeded maximum value for skip".
My attempt to resurrect the old LW Power Reader is facing an obstacle just before the finish line, due to current LW's API limitations. So this is a public appeal to the site admins/devs to relax the limit.
Specifically, my old code relied on LW1 allowing it to fetch all comments posted after a given comment ID, but I can't find anything similar in the current API. I tried reproducing this by using the allRecentComments endpoint in GraphQL, but due to the offset parameter being limited to <2000, I can't fetch comments older than a few weeks. The Power Reader is part designed to allow someone to catch up on or skim weeks/months worth of LW comments, hence the need for this functionality.
As a side effect of this project, my AI agents produced a documentation of LW's GraphQL API from LW's source code. (I was unable to find another API reference for it.) I believe it's fairly accurate, as the code written based on it seems to work well aside from the comment-loading limit.
Suppose we rule out pure CDT. That still leaves "whatever the right DT is (even if it's something like FDT/UDT), if you actually run the math on it, it says that rewarding people after the fact for one-time actions provides practically zero incentives (if people means pre-singularity humans)". I don't see how we can confidently rule this out.