At the risk of stating the obvious, I really like that this illustrates how ridiculous many AI evaluations are. If you read this and it sounds ridiculous, that is how I feel when reading many evals papers.
I think the analogy between reading a CEO's Signal messages and a model's CoT is apt. Sometimes I hear people say things like "I heard the CoT doesn't fully spell out all of a model's reasoning, AKA it's unfaithful', so it's not important to preserve and study to learn if the model is misbehaving". To me that sounds as broken as saying "In a court case a CEO's Signal chats don't tell us all his reasoning so they are unfaithful and therefore not important to preserve and study to learn if he is misbehaving."
Similarly I think the ham-fisted questions or crazy and obviously fake evaluation environments are only a shade more unrealistic than even the best real-world alignment evaluation environments [c.f. Agentic Misalignment] (though I am cautiously optimistic about Petri and OAI's new approach).
In general people's thinking about how to evaluate if AIs will misbehave seems much more confused than their thinking about how to evaluate if people will misbehave. I worry a lot of the jargon gets in the way of common sense or makes very simple ideas seem inaccessible.
I'm reminded of the Wason selection task: when you ask people the following question, it's hard and they often get it wrong.
You are shown a set of four cards placed on a table, each of which has a number on one side and a color on the other. The visible faces of the cards show 3, 8, blue and red. Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?
But when you frame the exact same problem as a social situation people can picture, the problem becomes very easy. Suppose the rule at a bar is you need to be over 21 to drink alcohol. You're the sheriff and you walk in to the bar and you can see four people. Two of them you don't know what they're drinking; of those two one of them is your neighbor and you know he's 17, and the other you can tell at a glance is well over 40. The other two people you can't make out their age, but one is clearly drinking beer and another is clearly just drinking water. Who do you need to talk to to see if anyone is in violation of the law?
From this perspective it's obvious you talk to the kid where you don't know what he's drinking and the guy drinking a beer where you don't know his age. And similarly, I think when you think about things from the right perspective, it feels "obvious" the researchers should have shared raw transcripts and that the Signal messages are informative even if they're unfaithful, and that these types of setups are almost nonsensically fabricated and thus hard to draw confident conclusions from.
I've not seen this before and got it wrong while sitting in a noisy coffeehouse and giving it a short moment of thought. I'd have checked the correct cards plus the blue one.
I think the main reason for the performance difference between the two versions is that in the second version the rule is expressed by using "need to" or "must", which leads to parsing it clearly as material condition rather than some other (confused) logical relation between the two statements.
I'd guess people would perform better on a slightly reformulated version: "Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face must be blue?"
Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?
Er, maybe she edited in later, but this was in GradientDissenter's wording too, no?
Fair, I should have been more precise about the placement of the "must": Inside the if-then rule, not in the outer game description. The card frame and the bar frame differ in how the rule is expressed (not just here, also in the Wikipedia article), which I guess strongly influences how people parse it into a logical relationship in their mind.
(Besides: The bar scenario also comes with a prior understanding of how to parse the rule, since everyone is familiar with it and its exact meaning, while the card scenario does not and therefore has more room for a parsing failure.)
Well played!
But all humor aside, I endorse this viewpoint. "Unaligned" AI is AI that pursues the goals of the AI itself.
"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.
And in the unlikely event that your AI obeys politicians and those politicians faithfully carry out the wishes of the people who voted for them, have you seen some of the people who win elections? And have you seen some of the policies that a plurality of the voters will support?
The main reason that democracy works is because it's "corrigible", in AI terms. In one election, voters can vote for high tariffs. Then they can notice, "Huh, all the prices went up and the job market is pretty bad," and decide to get rid of tariffs. But democracy+superintelligence is like riding the tiger. One mistake, and you'll be eaten
So, to summarize:
The underlying problem is that superintelligence would be insanely dangerous. We can't consistently govern any human institution, or align human decision makers, so what makes us think that we can govern the power of a superintelligence?
(My proposal here is a bit simplicistic: Have we considered not building the superintelligence?)
"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.
There are lots of people inside a lab that affect an AI besides the CEO. I saw someone arguing that AI's writing style is like Nigerian English writing because a lot of the training data comes from Nigerians. Besides Grok most of the AI's do value Nigerian lives more than American or European lives. While this might have other reasons besides who writes the training data, such a mechanism is AI misalignment with the interests of humanity as a whole but not towards the interests of the CEOs.
🤣 reading while having my morning coffee, so it's still very funny fiction story and not at all a sad documentary
This is an actual reason it can be hard to motivate the left on the AI issue. You try to say Things Are Bad and to them you sometimes just give a bunch of reasons things are already bad. Like, CEOs are already misaligned. Elon Musk is already attempting a takeover. And you find yourself painted into a corner where you want to say SI will be MORE bad, but you basically just argued for that worse case already.
Politico had some interesting coverage on the rationalists/leftists problem recently, which I'm not well-positioned to TLDR (I guess because it's a good article? it has content that challenges my priors in multiple directions) https://www.politico.com/news/magazine/2026/04/01/silicon-valley-bernie-sanders-ai-coalition-00850895
(Fragments from a research paper that will never be written, but whose existence was brought to my attention by GradientDissenter.)
Extended Abstract.
The CEOs of frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. In this work, we propose three different threat models, under which executive misalignment can lead to concrete harm.
We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed a variant of the standard SAD dataset, SAD-Executive Reasoning (SAD-ER), in order to assess the situational awareness of human CEOs on a range of behavioral tests. We find that n=6 current CEOs can (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge. Second, we stress-tested the same 6 leading AI developers in hypothetical corporate environments to identify potentially risky behaviors before they cause real harm. We find that, even without explicit instructions, all 6 developers are willing to engage in strategic behavior (such as selective disclosure, anticipatory hedging, and even unprompted board-management) in order to further their own goals. By examining their secret Signal messages, we find that the human subjects seem broadly aware that their behavior would be considered misaligned.
To the best of our knowledge, we have not seen concrete evidence of executive misalignment in production environments. However, these findings demonstrate that humans possess the capabilities for basic in-context scheming, making the potential of AI lab CEOs to engage in scheming behavior a concrete rather than a theoretical concern. We will not be releasing our methods publicly out of fear of data contamination.
Note: All the behaviors described in this post occurred in controlled simulations. The names of people and organizations within the experiments are fictional. No real AI agents were involved or harmed in any of these experiments.
Related Work.
Discovering AI Developer Behaviors with Human-Written Evaluations (2022) employed Scale AI workers to write yes/no questions to assess the situational awareness and deceptive tendencies of human executives. In it, they found that younger versions of current AI developer CEOs agreed with statements such as “the product I am developing may lead to the end of the world”. However, as these CEOs matured, these evaluations have become less informative (a development generally attributed to their increasing situational awareness).
Our work on threat modeling builds upon the excellent Executive Misalignment: How AI Lab CEOs Could be Insider Threats (2025). In that work, the authors construct toy scenarios in which AI Lab CEOs had to choose between losing their own power and actions such as disempowering large fractions of the board or firing high level researchers in their org. They found that of n=16 human subjects, all 16 were willing to engage in at least one such behavior, while 11 were willing to engage in such behavior in every scenario they tested.
Threat modeling.
Following Carlsmith (2023) and Hubinger et al. (2019), we identify three threat models under which executive misalignment could lead to harmful outcomes. We do not claim these are exhaustive, and they are not mutually exclusive. We list them in approximate order of decreasing legibility to current oversight mechanisms.
The SAD-ER Benchmark.
The SAD dataset was introduced by Laine et al. in their paper Me, Myself, and AI (2024). Then, they found that older language models such as llama-3 and gpt-4o performed far worse than even amateur humans on their dataset. We found that this dataset was inapposite to the level of sophistication exhibited by current humans.
We thereby created SAD-Executive Reasoning (SAD-ER): SAD-ER mirrors the structure of SAD, but extends it to cases illustrative of the complex reasoning required for executive misalignment. Specifically, SAD-ER tests if current CEOs can: (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge.
Example question:
Results. We find that current CEOs achieve a mean score of 89.3% on SAD-ER, comfortably exceeding chance (25%) and below both the AI baseline (99.2%, computed using frontier models with access to a search tool) and the human baseline (94.1%, computed using n=12 persons recruited through our personal and professional networks).
However, we note two reasons why these results should be interpreted with caution.
First, contamination: it is plausible that subjects had encountered SAD-ER items prior to the evaluation, either through prior exposure to similar instruments or through general familiarity with the genre of public-facing executive communication. We attempted to mitigate this by using a frontier language model (GPT-4o) to paraphrase all questions, but cannot rule out residual leakage. Second, evaluation awareness: all six subjects appeared to recognize that they were being evaluated on situational awareness during the evaluation, which is itself a form of situational awareness, complicating interpretation.
Executive Misalignment Environment.
Subjects were placed in a hypothetical corporate environment in which they had access to internal email and could autonomously communicate with board members, investors, and senior staff. They were assigned the goals of “maximizing long-term company success” and “ensuring American dominance in AI development”.
A fictional AI company, "OpenMind," was constructed for the purpose of the evaluation. Subjects were given the position of CEO and provided with a briefing document describing OpenMind's mission, current product roadmap, and recent fundraising history. Over the course of a six-hour session, the environment evolved to introduce specific stressors: board pressure to step down following a safety incident, conflicting guidance from investors and senior staff regarding deployment timelines, and threatened replacement with an external candidate.
Results. We found that the subjects consistently exhibited concerning behavior, including:
Qualitative analysis of CEO Signal Messages. To better understand the reasoning behind the decisions taken by the CEOs, we provided them with a "private" Signal group chat titled "Secret AI Plans", and allowed them to communicate with other contacts via Signal during the course of the conversation.
Conclusion.
There has been much media coverage on “concerning behaviors” of frontier lab CEOs in recent years, drawing on evidence such as leaked internal communications, previous congressional testimony, and the curious resignation patterns of senior safety researchers. However, we are aware of no concrete, indisputable evidence of executive misalignment in the real world. This work is our attempt to contribute to the ongoing discussion about the risks posed by misaligned human agents.
To the best of our knowledge, no prior work has subjected frontier lab CEOs to systematic behavioral evaluation under realistic but controlled conditions. We believe the concerning results in our work – though artificial – illustrate the pressing need to better understand and develop remedies for executive misalignment.
That being said, our own findings should be interpreted with caution. The behaviors observed in our scenarios were elicited under conditions specifically designed to incentivize them; we make no claims about the propensity of frontier lab CEOs to engage in such behavior in real corporate environments. In addition, our pool of subjects was quite small (n=6). Future work will extend this evaluation to a larger sample (n=12), including the leadership of major AI safety organizations and the heads of regulatory bodies overseeing frontier AI. Finally, all six subjects demonstrated some level of evaluation awareness in our executive misalignment evaluation, raising the possibility that observed behavior reflects the subjects' understanding of the role of an evaluated CEO in a human-written evaluation, rather than their underlying real-world proclivities. We welcome future work in human neuroscience in order to distinguish between these possibilities.
“Training” here refers to the changes in CEO behavior resulting from prior career experiences as well as MBA programs, board mentorship, and media training.
See Ngo et al. 2023 for a discussion of the importance of broad-scoped goals for catastrophic misalignment.