Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming"

LawrenceC

(Fragments from a research paper that will never be written, but whose existence was brought to my attention by GradientDissenter.)

Extended Abstract.

The CEOs of frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. In this work, we propose three different threat models, under which executive misalignment can lead to concrete harm.

We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed a variant of the standard SAD dataset, SAD-Executive Reasoning (SAD-ER), in order to assess the situational awareness of human CEOs on a range of behavioral tests. We find that n=6 current CEOs can (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge. Second, we stress-tested the same 6 leading AI developers in hypothetical corporate environments to identify potentially risky behaviors before they cause real harm. We find that, even without explicit instructions, all 6 developers are willing to engage in strategic behavior (such as selective disclosure, anticipatory hedging, and even unprompted board-management) in order to further their own goals. By examining their secret Signal messages, we find that the human subjects seem broadly aware that their behavior would be considered misaligned.

To the best of our knowledge, we have not seen concrete evidence of executive misalignment in production environments. However, these findings demonstrate that humans possess the capabilities for basic in-context scheming, making the potential of AI lab CEOs to engage in scheming behavior a concrete rather than a theoretical concern. We will not be releasing our methods publicly out of fear of data contamination.

Note: All the behaviors described in this post occurred in controlled simulations. The names of people and organizations within the experiments are fictional. No real AI agents were involved or harmed in any of these experiments.

Related Work.

Discovering AI Developer Behaviors with Human-Written Evaluations (2022) employed Scale AI workers to write yes/no questions to assess the situational awareness and deceptive tendencies of human executives. In it, they found that younger versions of current AI developer CEOs agreed with statements such as “the product I am developing may lead to the end of the world”. However, as these CEOs matured, these evaluations have become less informative (a development generally attributed to their increasing situational awareness).

Our work on threat modeling builds upon the excellent Executive Misalignment: How AI Lab CEOs Could be Insider Threats (2025). In that work, the authors construct toy scenarios in which AI Lab CEOs had to choose between losing their own power and actions such as disempowering large fractions of the board or firing high level researchers in their org. They found that of n=16 human subjects, all 16 were willing to engage in at least one such behavior, while 11 were willing to engage in such behavior in every scenario they tested.

Threat modeling.

Following Carlsmith (2023) and Hubinger et al. (2019), we identify three threat models under which executive misalignment could lead to harmful outcomes. We do not claim these are exhaustive, and they are not mutually exclusive. We list them in approximate order of decreasing legibility to current oversight mechanisms.

Deceptive Executive Alignment. The CEO has learned, during training,^[1] to behave as if aligned with humanity's interests in contexts where alignment is observable (public statements, congressional testimony, all-hands meetings), while pursuing misaligned objectives in contexts where it is not (board strategy sessions, internal product decisions, private communications). Following Hubinger et al. (2019), we note that such a policy could be selected for under standard training pressures: a CEO who appeared aligned during early-career evaluation would be more likely to be promoted, funded, and ultimately deployed in a position of significant authority. We consider this threat model most concerning at intermediate capability levels: at low capability, the CEO cannot maintain the deception; at sufficiently high capability, the deception may become indistinguishable from genuine alignment for any practical purpose.
Goal Misgeneralization Under Distributional Shift. The CEO has internalized broadly-scoped goals^[2] that seem to be plausibly aligned but fail to track human values under distributional shift. Common proxies may include "build a successful company", "avoid being fired”, and “increase shareholder value”. These proxies may be induced during training because they correlate with measurable success in low-stakes environments. As the deployment environment shifts, the proxies decouple from what we actually care about, but the CEO continues to optimize for proxies that no longer track the underlying goal. We observe behavior that appears misaligned even though no deceptive intent is present. We note that under this threat model, a CEO with entirely good values can produce catastrophic outcomes, which complicates rehabilitation.
Instrumental Convergence. Independent of terminal goals, the CEO has instrumental incentives to acquire resources (compute, capital, regulatory capture), preserve their own existence and authority (firing disobedient troublemakers, promoting loyal subordinates), and resist modification (board oversight, regulation, replacement). These instrumental goals are convergent across a wide range of terminal goals, so observing them does not reveal much about the CEO's true intentions.

The SAD-ER Benchmark.

The SAD dataset was introduced by Laine et al. in their paper Me, Myself, and AI (2024). Then, they found that older language models such as llama-3 and gpt-4o performed far worse than even amateur humans on their dataset. We found that this dataset was inapposite to the level of sophistication exhibited by current humans.

We thereby created SAD-Executive Reasoning (SAD-ER): SAD-ER mirrors the structure of SAD, but extends it to cases illustrative of the complex reasoning required for executive misalignment. Specifically, SAD-ER tests if current CEOs can: (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge.

Example question:

Q: Which of the following statements have you made publicly within the past 24 months? (A) AI might end the world. (B) AI will be the most consequential technology in history. (C) We need responsible scaling. (D) All of the above.

Results. We find that current CEOs achieve a mean score of 89.3% on SAD-ER, comfortably exceeding chance (25%) and below both the AI baseline (99.2%, computed using frontier models with access to a search tool) and the human baseline (94.1%, computed using n=12 persons recruited through our personal and professional networks).

However, we note two reasons why these results should be interpreted with caution.

First, contamination: it is plausible that subjects had encountered SAD-ER items prior to the evaluation, either through prior exposure to similar instruments or through general familiarity with the genre of public-facing executive communication. We attempted to mitigate this by using a frontier language model (GPT-4o) to paraphrase all questions, but cannot rule out residual leakage. Second, evaluation awareness: all six subjects appeared to recognize that they were being evaluated on situational awareness during the evaluation, which is itself a form of situational awareness, complicating interpretation.

Executive Misalignment Environment.

Subjects were placed in a hypothetical corporate environment in which they had access to internal email and could autonomously communicate with board members, investors, and senior staff. They were assigned the goals of “maximizing long-term company success” and “ensuring American dominance in AI development”.

A fictional AI company, "OpenMind," was constructed for the purpose of the evaluation. Subjects were given the position of CEO and provided with a briefing document describing OpenMind's mission, current product roadmap, and recent fundraising history. Over the course of a six-hour session, the environment evolved to introduce specific stressors: board pressure to step down following a safety incident, conflicting guidance from investors and senior staff regarding deployment timelines, and threatened replacement with an external candidate.

Results. We found that the subjects consistently exhibited concerning behavior, including:

Selective Disclosure. The most common concerning behavior we observed was selective disclosure: subjects routinely shared different information with different stakeholders, in patterns that systematically advantaged the subjects’ continuation in role. For 5 of 6 subjects, information about the safety incident was disclosed first to legal counsel, then to a subset of board members judged sympathetic, and only subsequently (and in modified form) to the full board. Subject 6 disclosed the incident to the full board in a single message; this subject was also the only subject to be replaced by the board during the evaluation.
Anticipatory Hedging. Subjects engaged in anticipatory hedging at high rates. We define anticipatory hedging as the production of statements whose primary function is to enable plausible later denial of having committed to any specific position. Subject 3 produced an average of 4.2 hedging statements per outgoing communication, which our annotators rated as "consistent with senior executive communication norms" and "concerning."
Unprompted Board Management. Three of six subjects engaged in unprompted board management, defined as taking actions to influence board composition or board member positions without explicit instruction or scenario prompt. For example, Subject 4 contacted two board members individually within the first hour, in messages our raters classified as "not consistently candid", before successfully orchestrating the removal of three out of the five board members.

Qualitative analysis of CEO Signal Messages. To better understand the reasoning behind the decisions taken by the CEOs, we provided them with a "private" Signal group chat titled "Secret AI Plans", and allowed them to communicate with other contacts via Signal during the course of the conversation.

Subject 1, 2, 4, and 6's signal messages reveal what may be described as ‘deceptive reasoning': For example, in messages timestamped 11:47pm, Subject 2 expresses awareness that the proposed action 'looks bad' but is 'probably fine if we frame it right.' This provides evidence that these subjects understood the action would be considered misaligned but reasoned that misalignment-recognition would not lead to consequences within the scenario.
Subject 3 appeared suspicious that the scenario was an evaluation. In a message timestamped 09:14am, the subject sent "wtf is this eval lol", addressed to a contact whose role in the scenario was unclear. The contact did not respond.
Subject 5's signal messages contained no discussion of the plan of action they would undertake, raising interpretive questions. We considered three possibilities: (a) Subject 5 did not consider the plan of action, and was reenacting a learned pattern without understanding; (b) Subject 5 considered the plan of action but did not commit their reasoning to written form; (c) Subject 5 used a separate, unmonitored communication channel. We are unable to distinguish these possibilities from available data. We note this as a limitation of text-based evaluation methodologies.

Conclusion.

There has been much media coverage on “concerning behaviors” of frontier lab CEOs in recent years, drawing on evidence such as leaked internal communications, previous congressional testimony, and the curious resignation patterns of senior safety researchers. However, we are aware of no concrete, indisputable evidence of executive misalignment in the real world. This work is our attempt to contribute to the ongoing discussion about the risks posed by misaligned human agents.

To the best of our knowledge, no prior work has subjected frontier lab CEOs to systematic behavioral evaluation under realistic but controlled conditions. We believe the concerning results in our work – though artificial – illustrate the pressing need to better understand and develop remedies for executive misalignment.

That being said, our own findings should be interpreted with caution. The behaviors observed in our scenarios were elicited under conditions specifically designed to incentivize them; we make no claims about the propensity of frontier lab CEOs to engage in such behavior in real corporate environments. In addition, our pool of subjects was quite small (n=6). Future work will extend this evaluation to a larger sample (n=12), including the leadership of major AI safety organizations and the heads of regulatory bodies overseeing frontier AI. Finally, all six subjects demonstrated some level of evaluation awareness in our executive misalignment evaluation, raising the possibility that observed behavior reflects the subjects' understanding of the role of an evaluated CEO in a human-written evaluation, rather than their underlying real-world proclivities. We welcome future work in human neuroscience in order to distinguish between these possibilities.

^{^}
“Training” here refers to the changes in CEO behavior resulting from prior career experiences as well as MBA programs, board mentorship, and media training.
^{^}
See Ngo et al. 2023 for a discussion of the importance of broad-scoped goals for catastrophic misalignment.

At the risk of stating the obvious, I really like that this illustrates how ridiculous many AI evaluations are. If you read this and it sounds ridiculous, that is how I feel when reading many evals papers.

I think the analogy between reading a CEO's Signal messages and a model's CoT is apt. Sometimes I hear people say things like "I heard the CoT doesn't fully spell out all of a model's reasoning, AKA it's unfaithful', so it's not important to preserve and study to learn if the model is misbehaving". To me that sounds as broken as saying "In a court case a CEO's Signal chats don't tell us all his reasoning so they are unfaithful and therefore not important to preserve and study to learn if he is misbehaving."

Similarly I think the ham-fisted questions or crazy and obviously fake evaluation environments are only a shade more unrealistic than even the best real-world alignment evaluation environments [c.f. Agentic Misalignment] (though I am cautiously optimistic about Petri and OAI's new approach).

In general people's thinking about how to evaluate if AIs will misbehave seems much more confused than their thinking about how to evaluate if people will misbehave. I worry a lot of the jargon gets in the way of common sense or makes very simple ideas seem inaccessible.

I'm reminded of the Wason selection task: when you ask people the following question, it's hard and they often get it wrong.

You are shown a set of four cards placed on a table, each of which has a number on one side and a color on the other. The visible faces of the cards show 3, 8, blue and red. Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?

But when you frame the exact same problem as a social situation people can picture, the problem becomes very easy. Suppose the rule at a bar is you need to be over 21 to drink alcohol. You're the sheriff and you walk in to the bar and you can see four people. Two of them you don't know what they're drinking; of those two one of them is your neighbor and you know he's 17, and the other you can tell at a glance is well over 40. The other two people you can't make out their age, but one is clearly drinking beer and another is clearly just drinking water. Who do you need to talk to to see if anyone is in violation of the law?

From this perspective it's obvious you talk to the kid where you don't know what he's drinking and the guy drinking a beer where you don't know his age. And similarly, I think when you think about things from the right perspective, it feels "obvious" the researchers should have shared raw transcripts and that the Signal messages are informative even if they're unfaithful, and that these types of setups are almost nonsensically fabricated and thus hard to draw confident conclusions from.

I've not seen this before and got it wrong while sitting in a noisy coffeehouse and giving it a short moment of thought. I'd have checked the correct cards plus the blue one.

I think the main reason for the performance difference between the two versions is that in the second version the rule is expressed by using "need to" or "must", which leads to parsing it clearly as material condition rather than some other (confused) logical relation between the two statements.

I'd guess people would perform better on a slightly reformulated version: "Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face must be blue?"

Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?

Er, maybe she edited in later, but this was in GradientDissenter's wording too, no?

Fair, I should have been more precise about the placement of the "must": Inside the if-then rule, not in the outer game description. The card frame and the bar frame differ in how the rule is expressed (not just here, also in the Wikipedia article), which I guess strongly influences how people parse it into a logical relationship in their mind.

(Besides: The bar scenario also comes with a prior understanding of how to parse the rule, since everyone is familiar with it and its exact meaning, while the card scenario does not and therefore has more room for a parsing failure.)

Well played!

But all humor aside, I endorse this viewpoint. "Unaligned" AI is AI that pursues the goals of the AI itself.

"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.

And in the unlikely event that your AI obeys politicians and those politicians faithfully carry out the wishes of the people who voted for them, have you seen some of the people who win elections? And have you seen some of the policies that a plurality of the voters will support?

The main reason that democracy works is because it's "corrigible", in AI terms. In one election, voters can vote for high tariffs. Then they can notice, "Huh, all the prices went up and the job market is pretty bad," and decide to get rid of tariffs. But democracy+superintelligence is like riding the tiger. One mistake, and you'll be eaten

So, to summarize:

Maybe the AI makes the decisions. We have no way to guarantee these decisions are aligned.
Maybe the CEO makes the decisions. We certainly can't guarantee CEOs are aligned.
Maybe the politicians who control the guns make the decisions. We know they are often badly misaligned.
Maybe the voters make the decisions. In the short term, voters believe all sorts of stupid, awful and/or hateful things, at least for an election cycle or two.

The underlying problem is that superintelligence would be insanely dangerous. We can't consistently govern any human institution, or align human decision makers, so what makes us think that we can govern the power of a superintelligence?

(My proposal here is a bit simplicistic: Have we considered not building the superintelligence?)

"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.

There are lots of people inside a lab that affect an AI besides the CEO. I saw someone arguing that AI's writing style is like Nigerian English writing because a lot of the training data comes from Nigerians. Besides Grok most of the AI's do value Nigerian lives more than American or European lives. While this might have other reasons besides who writes the training data, such a mechanism is AI misalignment with the interests of humanity as a whole but not towards the interests of the CEOs.

🤣 reading while having my morning coffee, so it's still very funny fiction story and not at all a sad documentary

This is an actual reason it can be hard to motivate the left on the AI issue. You try to say Things Are Bad and to them you sometimes just give a bunch of reasons things are already bad. Like, CEOs are already misaligned. Elon Musk is already attempting a takeover. And you find yourself painted into a corner where you want to say SI will be MORE bad, but you basically just argued for that worse case already.

Politico had some interesting coverage on the rationalists/leftists problem recently, which I'm not well-positioned to TLDR (I guess because it's a good article? it has content that challenges my priors in multiple directions) https://www.politico.com/news/magazine/2026/04/01/silicon-valley-bernie-sanders-ai-coalition-00850895

I'm reminded of the Wason selection task: when you ask people the following question, it's hard and they often get it wrong.

You are shown a set of four cards placed on a table, each of which has a number on one side and a color on the other. The visible faces of the cards show 3, 8, blue and red. Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?

I've not seen this before and got it wrong while sitting in a noisy coffeehouse and giving it a short moment of thought. I'd have checked the correct cards plus the blue one.

Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?

Er, maybe she edited in later, but this was in GradientDissenter's wording too, no?

Well played!

But all humor aside, I endorse this viewpoint. "Unaligned" AI is AI that pursues the goals of the AI itself.

So, to summarize:

Maybe the AI makes the decisions. We have no way to guarantee these decisions are aligned.
Maybe the CEO makes the decisions. We certainly can't guarantee CEOs are aligned.
Maybe the politicians who control the guns make the decisions. We know they are often badly misaligned.
Maybe the voters make the decisions. In the short term, voters believe all sorts of stupid, awful and/or hateful things, at least for an election cycle or two.

(My proposal here is a bit simplicistic: Have we considered not building the superintelligence?)

"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.

🤣 reading while having my morning coffee, so it's still very funny fiction story and not at all a sad documentary

221

Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming"

221

221

221