Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a timely technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
I largely disagree. I suppose there are different types of notes:
They have different levels of usefulness:
"People have this aspirational idea of building a vast, oppressively colossal, deeply interlinked knowledge graph to the point that it almost mirrors every discrete concept and memory in their brain. And I get the appeal of maximalism."
Guilty as charged. I do not regret my crime and I will attempt it again.
I agree that there are ways to define the "capabilities"/"intelligence" of a system where increasing them won't necessarily increase its long-term coherence. Primarily: scaling its ability to solve problems across all domains except the domain of decomposing new unsolved problems into combinations of solved problems. I. e., not teaching it (certain kinds of?) "agency skills". The resultant entity would have an abysmal time horizon (in a certain sense), but it can be made vastly capable, including vastly more capable than most people at most tasks. However, it would by definition be unable to solve new problems, not even those within its deductive closure.
Inasmuch as a system can produce solutions to new problems by deductive/inductive chains, however, it would need to be able to maintain coherence across time (or, rather, across inferential distances, for which time/context lengths are a proxy). And that's precisely what the AI industry is eager to make LLMs do, what it often measures capabilities in.
(I think the above kind of checks out with the distinction you gesture at? Maybe not.)
So yes, there are some notions of "intelligence" and "scaling intelligence" that aren't equivalent to some notions of "coherence" and "scaling coherence". But I would claim it's a moot point, because at this point, the AI industry explicitly wants the kind of intelligence that is equivalent to long-term coherence.
Frankly, the very premise of this paper seems ridiculous to me, to a considerably greater extent than even most other bad alignment takes. How can the notion that agents may be getting more incoherent as they become more capable even exist within an industry that's salivating over the prospect of climbing METR's "maintain coherence over longer spans of time" benchmark?
Will Automating AI R&D not work for some reason, or will it not lead to vastly superhuman superintelligence within 2 years of "~100% automation" for some reason?
My current main guess is that it will more-or-less work, and then it will not lead to vastly superhuman superintelligence.
Specifically: I expect that the current LLM paradigm is sufficient to automate its own in-paradigm research, but that this paradigm is AGI-incomplete. Which means it's possible to "skip to the end" by automating it to superhuman speeds, but what lies at its end won't be AGI.
Like, much of the current paradigm is "make loss go down/reward go up by trying various recombinations of slight variations on a bunch of techniques, constructing RL environments, and doing not-that-deep math research". That means the rewards are verifiable across the board, so there's "no reason" why RLVR + something like AlphaEvolve won't work for automating it. But it's still possible that you can automate ~all of the AI research that's currently happening at the frontier labs, and still fail to get to AGI.
(Though it's possible that what lies at the end will be a powerful-enough non-AGI AI tool that it'll make it very easy for the frontier labs to then use it to R&D an actual AGI, or take over the world, or whatever. This is a subtly different cluster of scenarios, though.)
Ryan had suggested that, on his model, spending ~5%-more-than-commercially-expedient resources on alignment might drop takeover risks down to 50%. I'm interested in how he thinks this scales: how much more resources, in percentage terms, would be needed to drop the risk to 20%, 10%, 1%?
Perhaps the claim is that such Python programs won't be encountered due to relevant properties of the universe (ie, because the universe is understandable).
That's indeed where some of the hope lies, yep!
Following up on [1] and [2]...
So, I've had a "Claude Code moment" recently: I decided to build something on a lark, asked Opus to implement it, found that the prototype worked fine on the first try, then kept blindly asking for more and more features and was surprised to discover that it just kept working.
The "something" in question was a Python file editor which behaves as follows:
The remarkable thing isn't really the functionality (to a large extent, this is just a wrapper on ast + QScintilla), but how little effort it took: <6 hours by wall-clock time to generate 4.3k lines of code, and I've never actually had to look at it, I just described the features I wanted and reported bugs to Opus. I've not verified the functionality comprehensively, but it basically works, I think.
How does that square with the frankly dismal performance I've been observing before? Is it perhaps because I skilled up at directing Opus, cracked the secret to it, and now I can indeed dramatically speed up my work?
No.
There was zero additional skill involved. I'd started doing it on a lark, so I'd disregarded all the previous lessons I've been learning and just directed Opus same as how I've been trying to do it at the start. And it Just Worked in a way it Just Didn't Work before.
Which means the main predictor of how well Opus performs isn't how well you're using it/working with it, but what type of project you're working on.
Meaning it's very likely that the people for whom LLMs works exhilaratingly well are working on the kinds of projects LLMs happen to be very good at, and everyone for whom working with LLMs is a tooth-pulling exercise happen not to be working on these kinds of projects. Or, to reframe: if you need to code up something from the latter category, if it's not a side-project you can take or leave, you're screwed, no amount of skill on your part is going to make it easy. The issue is not that of your skill.
The obvious question is: what are the differences between those categories? I have some vague guesses. To get a second opinion, I placed the Python editor ("SpanEditor") and the other project I've been working on ("Scaffold") into the same directory, and asked Opus to run a comparative analysis regarding their technical difficulty and speculate about the skillset of someone who'd be very good at the first kind of project and bad at the second kind. (I'm told this is what peak automation looks like.)
Its conclusions seem sensible:
Scaffold is harder in terms of:
SpanEditor is harder in terms of:
The fundamental difference: Scaffold builds infrastructure from primitives (graphics, commands, queries) while SpanEditor leverages existing infrastructure (Scintilla, AST) but must solve domain-specific semantic problems (code understanding).
[...]
Scaffold exhibits systems complexity - building infrastructure from primitives (graphics, commands, queries, serialization).
SpanEditor exhibits semantic complexity - leveraging existing infrastructure but solving domain-specific problems (understanding code without type information).
Both are well-architected. Which is "harder" depends on whether you value low-level systems programming or semantic/heuristic reasoning.
[...]
What SpanEditor-Style Work Requires
What Scaffold-Style Work Requires
The Cognitive Profile
Someone who excels at SpanEditor but struggles with Scaffold likely has these traits:
Strengths
| Trait | Manifestation |
| Strong verbal/symbolic reasoning | Comfortable with ASTs, grammars, semantic analysis |
| Good at classification | Naturally thinks "what kind of thing is this?" |
| Comfortable with ambiguity | Can write heuristics that work "most of the time" |
| Library-oriented thinking | First instinct: "what library solves this?" |
| Top-down decomposition | Breaks problems into conceptual categories |
Weaknesses
| Trait | Manifestation |
| Weak spatial reasoning | Struggles to visualize coordinate transformations |
| Difficulty with temporal interleaving | Gets confused when multiple state machines interact |
| Uncomfortable without guardrails | Anxious when there's no library to lean on |
| Single-layer focus | Tends to think about one abstraction level at a time |
| Stateless mental model | Prefers pure functions; mutable state across time feels slippery |
Deeper Interpretation
They Think in Types, Not States
SpanEditor reasoning: "A CodeElement can be a function, method, or class. A CallInfo has a receiver and a name."
Scaffold reasoning: "The window is currently in RESIZING_LEFT mode, the aura progress is 0.7, and there's a pending animation callback."
The SpanEditor developer asks "what is this?" The Scaffold developer asks "what is happening right now, and what happens next?"
They're Comfortable with Semantic Ambiguity, Not Mechanical Ambiguity
SpanEditor: "We can't know which class obj.method() refers to, so we'll try all classes." (Semantic uncertainty - they're fine with this.)
Scaffold: "If the user releases the mouse during phase 1 of the animation, do we cancel phase 2 or let it complete?" (Mechanical uncertainty - this feels overwhelming.)
They Trust Abstractions More Than They Build Them
SpanEditor developer's instinct: "Scintilla handles scrolling. I don't need to know how."
Scaffold requires: "I need to implement scrolling myself, which means tracking content height, visible height, scroll offset, thumb position, and wheel events."
The SpanEditor developer is a consumer of well-designed abstractions. The Scaffold developer must create them.
tl;dr: "they think in types, not states", "they're anxious when there's no library to lean on", "they trust abstractions more than they build them", and "tend to think about one abstraction level at a time".
Or, what I would claim is a fine distillation: "bad at novel problem-solving and gears-level modeling".
Now, it's a bit suspicious how well this confirms my cached prejudices. A paranoiac, which I am, might suspect the following line of possibility: I'm sure it was transparent to Opus that it wrote both codebases (I didn't tell it, but I didn't bother removing its comments, and I'm sure it can recognize its writing style), so perhaps when I asked it to list the strengths and weaknesses of that hypothetical person, it just retrieved some cached "what LLMs are good vs. bad at" spiel from its pretraining. There are reasons not to think that, though:
Overall... Well, make of that what you will.
The direction of my update, though, is once again in favor of LLMs being less capable than they sound like, and towards longer timelines.
Like, before this, there was a possibility that it really were a skill issue on my part, and one really could 10x their productivity with the right approach. But I've now observed that whether you get 0.8x'd or 10x'd depends on the project you're working on and doesn't depend on one's skill level – and if so, well, this pretty much explains the cluster of "this 10x'd my productivity!" reports, no? We no longer need to entertain the "maybe there really is a trick to it" hypothesis to explain said reports.
Anyway, this is obviously rather sparse data, and I'll keep trying to find ways to squeeze more performance out of LLMs. But, well, my short-term p(doom) has gone down some more.
I am interested in trying out the new code simplifier to see whether it can do a good job
Tried it out a couple times just now, it appears specialized for low-level, syntax-level rephrasings. It will inline functions and intermediate-variable computations that are only used once and try to distill if-else blocks into something more elegant, but it won't even attempt doing things at a higher level. Was very eager to remove Claude's own overly verbose/obvious comments, though. Very relatable.
Overall, it would be mildly useful in isolation, but I'm pretty sure you can get the same job done ten times faster using Haiku 4.5 or Composer-1 (Cursor's own blazing-fast LLM).
Curious if you get a different experience.
There was so much to unpack in that one. The line about how it's "on brand for Anthropic to use a deceptive ad to critique theoretical deceptive ads that aren’t real" takes the cake, of course. Amazing stuff.
Feels important to note that this is a (minor) positive update on Anthropic for me, worth a hundred nice-sounding Dario essays and Claude Constitutions. I expect them to completely cave in after a bit, hence it being only a minor update. But at least they didn't start out pre-caved-in.