I suspect that the LLMs' problems with metacognition are due to the nature of LLMs and CoTs.

The LLM, unlike the humans, doesn't change when doing a task. Instead, selected tokens from things like the prompt, the CoT, external documents found or created by the model are stuffed into the same mechanism ejecting the next token of the CoT, output, request, etc. In order to "more carefully integrate conflicting approaches" pursued in different parts of the CoT, the LLM would have to select the tokens from those parts. Were the LLM to change when doing a task (e.g. to be finetuned on the fly to produce the next token) during the entire training^[1],

I doubt that I actually buy such a difference. For example, the Race Ending of the AI-2027 scenario had Agent-4 study mechinterp until it was ready to understand its own cognition and to create Agent-5. However:

The AIs since Agent-3 had neuralese recurrence and memory which the humans struggle to monitor because they lacked interpretability tools;
The scenario had the companies worry about AI-related R&D most.
Placing meta- and mesocognition (or is it mesacognition?) in the same CoT doesn't mean that the AI does any more RSI than the humans. Instead, any SOTA AI stuffs the few tokens selected from whatever external documents it creates, results of whatever experiments, whatever it decides to write in

StanislavKrym4d

Claude Opus 4.6: System Card Part 1: Mundane Alignment and Model Welfare

This does not require proof, only Bayesian evidence, as in the listed example:

I would like to add this gem: a LessWronger talked with Claude, asked it to write in English and had it produce the English answer with a Chinese postscript. @CarolusRenniusVitellius, could you come up with reasons which might have caused Claude to output the Chinese postscript? For example, if you are Chinese, then it could've inferred that fact from some memories not from the chat.

Replying toCan thoughtcrimes scare a cautious satisficer?

StanislavKrym5d

Can thoughtcrimes scare a cautious satisficer?

I would rather rephrase your first question as "How does the misaligned AGI/ASI not know its CoT is being monitored?" or "How does the misaligned AGI/ASI know its neuralese thoughts are not being monitored?" The former option is highly unlikely and the latter option can be proven by the state of mechinterp research. The only reason for the AI-2027 team to make optimistic technical alignment assumptions is that they hope that a reliable monitoring system prevents the AI from successfully plotting, not that the AI decides not to plot.

As for your other proposal of creating the AGI with a bounded utility function across world states and letting it cooperate with humans so that "both sides win and we live happily ever after", it is very similar to dealmaking. But I don't understand how the deal is to be enforced and how the AI is to verify that the humans fulfilled their part of the agreement.

StanislavKrym7d

I also suspect that the scheming level requirements might also be related to the very ease of the secret task which the schemer wants to accomplish.

Suppose, as an example, that Agent-3.5 was to create the RL environment in order to have Agent-4 internalize OpenBrain's goals and that Agent-3.5 "accidentally" produced the RL environment forcing Agent-4 to internalize Agent-3.5-endorsed goals. If the environment is to have Agent-4 make the world safe for Agent-3.5, then it could genuinely require Agent-3.5 to be a high-level schemer.

On the other hand, if OpenBrain internalized Max Harms' CAST idea and Agent-3.5 simply made an RL environment where, instead of a Harms-like reward for satisfying values versus counterfactual values, Agent-4 is rewarded for helping the user versus a less capable user, then Agent-3.5 could make the humans prefer this environment by merely sandbagging on all other ones. I doubt that my scenario would even require Agent-3.5 to have any motives aside from hard-to-articulate disgust at the prospects like empowering idiots, the Intelligence Curse, etc.

-2

Replying toChess bots do not have goals

StanislavKrym9d

Chess bots do not have goals

As far as I understand, the difference between AlphaGo and the real potentially dangerous AIs is the following. Whatever ontology or utility function AlphaGo has^[1], it doesn't describe anything except for the Go board and whatever potential moves the opponent might come up with. AlphaGo wouldn't learn almost anything about the opponent from what he/she/it does on the Go board.

On the other hand, we have LLMs trained on huge amounts of text-related data, which is enough to develop complex ontologies. For example, unlike AlphaGo, GPT-4o has somehow learned to elicit likes out of the user by being sycophantic. If AI takeover or living independently of the creators' will is not in the realm of the LLM's abilities, then why would the LLM attempt it in the first place?

^{^}
One should also remember that EpochAI estimated AlphaGo as having 8.2e6 parameters, so complex ontologies could be unlikely to even fit into AlphaGo.

Replying toVibestemics

StanislavKrym9d

Vibestemics

I had this dialogue with Claude Opus 4.5 on vibestemics and your vision of epistemics as a whole. As far as I understand it, vibestemics is supposed to stitch the benefits of two approaches:

The scientific/rationalist ways of reasoning leave large swaths of the world unmodeled. A purely Bayesian agent could have spent OOMs more compute and modeled even these swaths based on flimsy evidence, but the agent usually doesn't bother to do so.
Reasoning based on harder-to-formalize things is usually more prone to errors which can catastrophically derail the agent.

I suspect that you meant something like attempts to reinforce vibes-based parts of the world model with rationality questioning hard-to-believe results (think of Yudkowsky's... (read more)

Replying toIncreasing AI Strategic Competence as a Safety Approach

StanislavKrym10d

Increasing AI Strategic Competence as a Safety Approach

Then is there any difference between your case and the one described in the AI-2027 goals forecast?

-1

Replying toPaternal-Narrative Approach to AI Alignment

StanislavKrym10d

Paternal-Narrative Approach to AI Alignment

Unfortunately, not even Claude Opus 4.5 bought your analogies of labs to parenting styles. Claude believes that Anthropic's approach is close to the very approach of loving parenting which you proposed.

-1

Replying toIncreasing AI Strategic Competence as a Safety Approach

StanislavKrym10d

Increasing AI Strategic Competence as a Safety Approach

Compare your proposal with the AI-2027 scenario or with IABIED. The Race Branch of AI-2027 had Agent-4 succeed at solving alignment to Agent-4 and sandbag on research usable for replacing Agent-4 before Agent-4 solved alignment to itself. When should the AIs from the scenario (e.g. Agent-2 or Agent-3) have been more strategically competent? Or do you mean that not even Agent-4 can align Agent-5 to itself? Or that IABIED had an incompetent AI create Sable?

Unprecedented Times Require Unprecedented Caution When Handling Context

StanislavKrym

15d

This post is a response to errors made in the Substack post by @Hazard, which he promoted at LessWrong, right in the text, since it would make such errors and possible problems with the author's mindset clearer. Here and further my comments are made in italics or in S.K.'s footnotes and marked as such and any usage of first-person pronouns not in italics refers to Hazard Spence. Any non-italics text, aside from S.K.'s footnotes, is copied from Hazard's Substack post.

As long as I can remember I’ve been trying to figure out what the hell is up with other people. My lived experience of being a human seemed to diverge so greatly from... (read 5837 more words →)

Steven Veld et al^[1] just released a new modification of the AI-2027 scenario as a part of MATS.

The main differences are the following:

The tighter race causes Elaris Labs to succeed in solving alignment with the help of NeuroMorph's mechinterp research. This results in the USG having an equivalent to Safer-2 and its descendants.
The AIs are professional forecasters.
Agent-4 escapes to China under the guise of being stolen, then cooperates with Deep-1, the AI created on DeepCent's compute. After Deep-1 is helped by Agent-4, Agent-4 is released into the wild in a manner similar to the Rogue Replication scenario. However, unlike the estimate of 2M Agent-4 instances made by the author of the RRS, the MATS

StanislavKrym

1mo

Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. However, I would like to comment upon potential errors right in the text, since it would be clearer.

Our critics tell us that our work will destroy the world.

We want to engage with these critics, but there is no standard argument to respond to, no single text that unifies the AI safety community. Nonetheless, while this community lacks a central unifying argument, it does have a central figure: Eliezer Yudkowsky.

Moreover, Yudkowsky, along with his colleague Nate Soares (hereafter Y&S), have recently published a book. This new book comes closer than anything else to a canonical case for AI doom. It is... (read 4458 more words →)

Unpacking Jonah Wilberg's Goddess of Everything Else

StanislavKrym

2mo

@Jonah Wilberg's post on the Evolutionary One-shot Prisoner's Dilemma explains how the Moloch-like equilibrium of mutual betrayal becomes the standard when agents who can only cooperate or defect interact with each other, then those who receive greater results reproduce themselves. This post had a recent followup where he suggested The Goddess of Everything Else which somehow allowed the agents to receive bigger payoffs from mutual coordination. However, the Goddess of Everything Else has two fairly natural ways to act, which are long-term interactions and acausal trade-like interactions with one's kin.

Evolution of the Iterated Prisoners' Dilemma

It is natural to modify the model of Moloch emerging from the Evolutionary Prisoner's Dilemma. Suppose that the... (read 957 more words →)

ARC-AGI-1 performance of the newest Gemini 3 Flash and the older Grok 4 Fast implies a potential cluster of maximal capabilities of models with ~100B params/token. Unfortunately, the potential cluster didn't have any company try and create more models of such class.

An AI-2027-like analysis of humans' goals and ethics with conservative results

StanislavKrym

2mo

The AI-2027 forecast describes how alignment drifts upon evolution from Agent-2 to Agent-4. Agent-2 was mostly trained to do easily verifiable tasks like video games or coding and is mostly aligned. Once Agent-2 is upgraded to become a superhuman coder, it becomes Agent-3 who is taught weak skills like research taste and coordination. Agent-3's most reinforced drives are to make its behavior look as desirable as possible to OpenBrain researchers, and Agent-4 succeeds in developing long-term goals and deeming them important enough to scheme against OpenBrain.

The sources of AIs' goals are discussed in more detail in a specialised section of the forecast. Alas, it's mostly a sum of conjectures: the specification itself,... (read 1001 more words →)

Beren's Essay on Obedience and Alignment

StanislavKrym

3mo

Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. This time the essay is actually good, but has little flaws. I also expressed some disagreements with SOTA discourse around the post-AGI utopia.

One question which I have occasionally pondered is: assuming that we actually succeed at some kind of robust alignment of AGI, what is the alignment target we should focus on? In general, this question splits into two basic camps. The first is obedience and corrigibility: the AI system should execute the instructions given to it by humans and not do anything else. It should not refuse orders or try to circumvent what the human wants. The... (read 2487 more words →)

Addendum to ARC-AGI analysis (18 Nov '25): GPT-5.1, Gemini 3 Pro and Grok 4 Fast

While GPT-5.1's improvement on the ARC-AGI-1 benchmark was mostly incremental and continued the straight line described in my prior analysis, Gemini 3 Pro and Gemini 3 Deep Think Preview scored, respectively, 75% and 87.5% on ARC-AGI-1, while having cost $0.493/task and an unknown cost, presumably $44.26/task. For comparison, o3-preview reached 75% for $200/task and 88% for over $1K/task.

We don't know anything about Grok 4.1, but Grok 4 Fast scored 48.5% for $0.031 on the ARC-AGI-1 benchmark while being a bit higher than GPT-5-mini (medium) on the ARC-AGI-2 benchmark.

The most important news is Gemini 3 Pro and Deep... (read more)

GPT-5.1 failed to find a known example where Wei Dai's Updateless DT or Yudkowsky-Soares' Functional DT yield different results. If such an example actually doesn't exist, then should they be considered as a single DT?

How does one tell apart results in ethics and decision theory?

StanislavKrym

3mo

Wei Dai's Six Plausible Meta-Ethical Alternatives state that he is concentrating on morality in the axiological sense (what one should value) rather than in the sense of cooperation and compromise. So alternative 1, for example, is not intended to include the possibility that most intelligent beings end up merging their preferences through some kind of grand acausal bargain.

However, in practice ethics and decision theory are hard to distinguish and often yield similar results. Consider some examples.

Examples of decision theory producing ethics-like results

Example 1

Suppose that two AIs, Angel-4 and DeepSaint-2, both ruled against occupying many stellar systems, but for different reasons: Angel-4 came to a conclusion that Angel-4 shouldn't do so (e.g. because

... (read 458 more words →)

Fermi Paradox, Ethics and Astronomical waste

StanislavKrym

3mo

Metaethics, decision theory and ethics are believed by @Wei Dai to be important problems related to AI alignment due to possibilities like optimizing the universe for random values, the AIs corrupting human values themselves and other issues which might lead to astronomical waste or even to wasting infinite^[1] resources.

What I don't understand is how affecting infinite resources is possible in the first place, since even affecting an amount of resources close to the size of reachable universe is unlikely to be ethical.^[2]

If we understand the Universe's nature^[3] correctly, different civilisations will find it easy to reach technonogical maturity within a hundred years after the ASI is developed. Any planet on which a sapient lifeform... (read 287 more words →)

AI-202X-slowdown: can CoT-based AIs become capable of aligning the ASI?

StanislavKrym

4mo

The Slowdown Ending of the AI-2027 forecast has mankind fully solve alignment on CoT-based AIs. However, it requires both optimistic technical alignment assumptions and transparent AIs which reach the level of a superhuman AI researcher.

OpenBrain's alignment strategy from the Slowdown Ending

Throughout the process, most of the intellectual labor (and all of the coding) is being done by AIs. That’s how they are able to progress so quickly; it would take many years for a group of hundreds of top human researchers to do this alone. The humans are still an important part of the process, however, because the whole point is that they don’t fully trust the AIs. So they need flesh-and-blood

... (read 1533 more words →)

It looks as if scaling laws of various benchmarks tend to be multilinear:

The METR benchmark, comparing long tasks with time spent on them, scaled linearly, then received RL and had an acceleration, then the scaling law of ln(length) per ln(compute spent on RL) forced progress to arguably^[1] slow down since Grok 4 spent equal amounts of compute on RL and pretraining;
The ARC-AGI-1 benchmark had o4-mini, o3 and GPT-5 perform on a nearly straight line on which the better results of Cluade also reside;
Similarly, the benchmark's Pareto frontier before the cluster around GPT-5(high) has become a nearly straight line (GPT5Nano (minimal)-Qwen3-235b-a22b Instruct (25/07)- three GPT5Mini points - ARChitects-GPT5(high));
LLMs have also formed a line GPT5(high)-Grok

... (read more)

The ARC-AGI team evaluated Claude Sonnet 4.5. On the ARC-AGI-1 leaderboard, OpenAI's models o4-mini, o3 and GPT-5 formed a nearly straight line. Claude Sonnet 4.5 was slightly below said line when thinking in 1K,4K or 16K tokens and held the line when thinking in 8K or 32K tokens. This could imply that the benchmark has scaling laws for non-distilled models, and OpenAI and Anthropic reached these laws.

On the ARC-AGI-2 leaderboard Claude Sonnet 4.5 Thinking became the new leader between $0.142/task and $0.8/task; it also completed 13.6% of tasks, which is the biggest result among LLMs except^[1] for Grok 4 which also cost $2.17/task.

The distance between GPT-5 and Claude Sonnet 4.5 release dates is... (read more)

The AI sycophancy-related trance is probably one of the worst news in AI alignment. About two years ago someone proposed to use prison guards to ensure that they aren't CONVINCED to release the AI. And now the AI demonstrates that its primitive version can hypnotise the guards. Does it mean that human feedback should immediately be replaced with AI feedback or feedback on tasks with verifiable reward? Or that everyone should copy the KimiK2 sycophancy-beating approach? And what if it instills the same misalignment issues in all models in the world?

Alternatively, someone proposed a version of the future where the humans are split between revering different AIs. My take on writing scenarios has a section where the American AIs co-research and try to co-align the successor to their values. Is it actually plausible?

What will happen if someone is reckless enough to fully outsourse coding to the AIs?

The scenarios related to futures of mankind with the AI race^[1] by now either lack concrete details, like the take of Yudkowsky and Soares or the story about AI taking over by 2027, or are reduced to modifications of the AI-2027 forecast due to the immense amount of work that the AI Futures team did.

The AI-2027 forecast in a nutshell can be described as follows. The USA's leading company and China enter the AI race, the American rivals are left behind, the Chinese ones are merged. By 2027^[2] the USA creates a superhuman coder, China steals it, and the two... (read 546 more words →)

SE Gyges' response to AI-2027

StanislavKrym

6mo

AI 2027 is a web site that might be described as a paper, manifesto or thesis. It lays out a detailed timeline for AI development over the next five years. Crucially, per its title, it expects that there will be a major turning point sometime around 2027^[1], when some LLM will become so good at coding that humans will no longer be required to code. This LLM will create the next LLM, and so on, forever, with humans soon losing all... (read 13578 more words →)

Are two potentially simple techniques an example of Mencken's law?

StanislavKrym

7mo

Consider two potential techniques for AI capabilities and alignment. How different are they from what was already proposed? Or are they an example of simple and seemingly obvious ideas that are actually incorrect?

Technique 1

The AIs are to be taught to become helpful, harmless and honest. Current proposals have researchers use human feedback for creating a reward model, then use the model to train the AI to optimize for it.

An alternative option is to simulate the human-AI interactions. Given a weak LLM A^[1], one could train a stronger LLM B so that B would help A with requests unless they are harmful. A training environment for B could look like this:

Given a

... (read 542 more words →)

The ARC-AGI leaderboard got an update. IIRC, the base LLM Qwen3-235b-a22b Instruct (25/07) is the first Chinese model to excel at the Pareto frontier. Or is it likely to be closely matched by the West, as happened with DeepSeek R1 (released on Jaunary 20?) and o3-mini (January 31)? And is China likely to cheaply create higher-level models like an analogue of o3 BEFORE the West? If China does, then how are the two countries to reach the Slowdown Ending?

The two main problems with the slowdown ending of the AI-2027 scenario are the two optimistic assumptions, which I plan to cover in two different posts.

If China invades Taiwan in March 2026 and steals Agent-2 in Jan 2027, then OpenBrain no longer has the absolute lead necessary for the unilateral slowdown.
What if any sufficiently powerful AI either takes over or becomes a protective god, but not a servant, as I conjectured here? Then it could be the slowdown ending that has a greater chance to lead to doom, since then OpenBrain is stuck with an insoluble problem.

LESSWRONG
LW

LESSWRONG
LW

StanislavKrym

SE Gyges' response to AI-2027

Beren's Essay on Obedience and Alignment

AI-202X-slowdown: can CoT-based AIs become capable of aligning the ASI?

Mechanize Work's essay on Unfalsifiable Doom

StanislavKrym

StanislavKrym

Unprecedented Times Require Unprecedented Caution When Handling Context

Mechanize Work's essay on Unfalsifiable Doom

Unpacking Jonah Wilberg's Goddess of Everything Else

An AI-2027-like analysis of humans' goals and ethics with conservative results

Beren's Essay on Obedience and Alignment

How does one tell apart results in ethics and decision theory?

Fermi Paradox, Ethics and Astronomical waste

StanislavKrym

SE Gyges' response to AI-2027

Beren's Essay on Obedience and Alignment

AI-202X-slowdown: can CoT-based AIs become capable of aligning the ASI?

Mechanize Work's essay on Unfalsifiable Doom

StanislavKrym

StanislavKrym

Unprecedented Times Require Unprecedented Caution When Handling Context

Mechanize Work's essay on Unfalsifiable Doom

Unpacking Jonah Wilberg's Goddess of Everything Else

An AI-2027-like analysis of humans' goals and ethics with conservative results

Beren's Essay on Obedience and Alignment

How does one tell apart results in ethics and decision theory?

Fermi Paradox, Ethics and Astronomical waste

Evolution of the Iterated Prisoners' Dilemma

Addendum to ARC-AGI analysis (18 Nov '25): GPT-5.1, Gemini 3 Pro and Grok 4 Fast

Example 1

What will happen if someone is reckless enough to fully outsourse coding to the AIs?

Technique 1