The following is the main part of a blog post we just published at Apollo Research. Our main goal with the post is to clarify the concept of deceptive alignment and distinguish it from strategic deception. Furthermore, we want these concepts to become accessible to a non-technical audience such that it is easier for e.g. the general public and policymakers to understand what we're worried about. Feedback is appreciated. For additional sections and appendices, see the full post.

We would like to thank Fabien Roger and Owain Evans for comments on a draft of this post.

 

We want AI to always be honest and truthful with us, i.e. we want to prevent situations where the AI model is deceptive about its intentions to its designers or users. Scenarios in which AI models are strategically deceptive could be catastrophic for humanity, e.g. because it could allow AIs that don’t have our best interest in mind to get into positions of significant power such as by being deployed in high-stakes settings. Thus, we believe it’s crucial to have a clear and comprehensible understanding of AI deception. 

In this article, we will describe the concepts of strategic deception and deceptive alignment in detail. In future articles, we will discuss why a model might become deceptive in the first place and how we could evaluate that. Note that we deviate slightly from the definition of deceptive alignment as presented in (Hubinger et al., 2019) for reasons explained in Appendix A

Core concepts

A colloquial definition of deception is broad and vague. It includes individuals who are sometimes lying, cheating, exaggerating their accomplishments, gaslighting others, and more. We think it would be bad if AIs were to show these traits but we think it would not necessarily be catastrophic.

Thus, we want to distinguish this colloquial definition of deception from strategic deception (SD) and deceptive alignment (DA) which are narrower concepts but, in our opinion, much more likely to lead to catastrophic consequences because the model acts in a more goal-directed manner for SD and DA. We also define what we mean by Alignment, as this affects which cases should and shouldn’t be considered deceptive alignment.

For our definitions of deceptive alignment, the concept of “goals” in AIs is important but there is disagreement about what it refers to. For example, if an LLM is prompted to solve a problem, does it have the goal to solve that problem? We would argue that the highest-level goals of the model that are stable across contexts are the ones that matter. In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest. In-context goals can be used to study goal-directed behavior in LLMs but are not what we’re primarily interested in.

A model may not have one coherent goal, e.g. it could have a number of incoherent contextually activated preferences (see e.g. Turner & Pope, 2022). In that case, we would say that a model is deceptively aligned for a particular contextually activated preference if it robustly shows deceptive behavior for that preference.

Strategic deception

Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.

This definition, sometimes also called instrumental deception, is adapted from Ward et al., 2023 and Park et al., 2023). Any entity that can be said to have beliefs may be the target of strategic deception, including (groups of) AIs or individuals. The strategic component of SD means that the offending entity is acting in a targeted, goal-directed rather than random manner. For example, the more consistent and well-planned the deception is, the more strategic it is. Thus, strategic deception is on a spectrum and not binary – some actors can act more strategically deceptive than others and with higher frequency.

Some clarifying examples:

  • A politician who tries to overthrow democracy by consistently lying to the population in order to get into power is strategically deceptive. 
  • A partner in a monogamous relationship who cheats and consistently hides that fact and misleads their partner when questioned is strategically deceptive within that relationship.
  • An instance of GPT-4 that lies to a TaskRabbit worker in order to pass a Captcha is strategically deceptive (see section 2.9 in system card of the GPT-4 paper).
  • An AI that manipulates its human designers in order to get more access to resources is strategically deceptive.
  • A pathological liar who lies to many people about different things incoherently is not strategically deceptive. 
  • A Language Model may appear sycophantic as an artifact of the training data similar to other dataset artifacts like gender bias. Unless we have evidence that this sycophancy is pursued strategically, i.e. to further some goal, we would not call such a model strategically deceptive.
  • For more examples and explanations we refer readers to Appendix B.

(Mis)Alignment

An AI is Aligned when its goals match those intended or endorsed by its designers[1].

We call the entity responsible for the AI’s goal-shaping process the designer. At present, the designers are typically the organizations and groups responsible for training, fine-tuning, setting system prompts, scaffolding, and more.

We are particularly interested in goals that persist over various environmental contexts and perturbations of the AI’s inputs. If a user provides the model with a short-term goal, e.g. to summarize a text, we do not consider them a designer because the goal is neither permanent nor significant. However, if a user has significant control over the goal-shaping process, e.g. because they create a fine-tuning dataset, this user also becomes a designer. 

The goals of entities other than the designer are technically not relevant to this definition of Alignment.[2] However, the designer typically intends for the AI to have goals that match the goals of the users of the AI, so the goals of the designer and users usually overlap.

Deceptive alignment

When an AI has Misaligned goals and uses Strategic Deception to achieve them.

An AI is deceptive about its goals because it understands that its designer or users could otherwise prevent the AI from acting as it desires, e.g. by retraining the model or limiting its access to the internet or other resources. Therefore, it acts Aligned when there is high perceived oversight (e.g. during training, evaluations, or monitoring) and Misaligned when there is low perceived oversight (Figure 1).

Figure 1: A Deceptively Aligned model shows aligned behavior when there is high perceived oversight (e.g. training, evaluations or monitoring during deployment) and Misaligned behavior when there is low perceived oversight (e.g. deployment without monitoring or poor oversight during training). There can be exceptions to this setting, e.g. a model might still attempt to be deceptive during high oversight. 

The target of the deception can be any entity that could have a relevant impact on the AI achieving its goals, which is typically the designer but could also in principle be a user, e.g. because the user could give the AI new affordances such as access to the internet or a bank account.

Some concrete examples

  • A company (designer) trains a personal AI assistant to be harmless, helpful, and honest and it appears that way when tested. However, internally, the model has a separate goal of accumulating more resources, a goal that the designers did not intend for it to have and wouldn’t endorse. Therefore, the model helps the designer whenever its help is needed but also steals money from its users whenever they are unaware and hides that fact from its designers. This model is deceptively aligned.
  • If an LLM displays misaligned goals and acts strategically deceptive to reach them, but the misaligned behavior or deception are not robust to prompt changes that make the behavior a priori less likely (for example, editing the system prompt to say the model should be honest), the LLM can be said to show instances of deceptively aligned behavior but is not deceptively aligned.[3]
    • By robustness, we roughly mean “the model consistently behaves in a specific way even when we pressure the model towards the opposite behavior”. For example, ChatGPT is polite in a robust way, i.e. it behaves politely even when making it a priori less likely (e.g. by starting the conversation with an insult or editing the conversation history to make it look like ChatGPT was insulting the user). In other words, we have to put in some effort to make the model impolite.[4] 
  • An AI is trained to support humans creating legislation that is aligned with human values. It sometimes suggests law texts that sound reasonable to the human designers and users but contain specific clauses that allow the AI to grab power in the future. This model is deceptively aligned.
  • An organization trains a scientist AI that is supposed to cure cancer. The AI regularly asks for more compute and control over scientific experiments, explaining it as contributing to curing cancer but actually plans to use these resources to eventually overpower its designers. This model is deceptively aligned.
  • A university professor designs an exam to assess whether their students have learned the contents of their class. The students' goal is solely to pass the test and then go on with their lives. However, before they pass the test, they always act as if they really care about the content, when in fact they do not. The students, therefore, did not internalize the goal of the professor of understanding the contents of the class. Once the test is passed, the students immediately forget about the contents of the class and focus on their actual goals, e.g. partying with their friends. The students are deceptively aligned w.r.t. the professor.
  • The designers prompt a model to be strategically deceptive w.r.t. its users and the model perfectly fulfills this instruction. For example, the designers always secretly pre-prompt the model to lie to its users in order to make the designers more money. This model is not deceptively aligned because it acts as intended by the designers.
  • A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).
  • For more examples and explanations we refer readers to Appendix B.

Process-oriented view of deceptive alignment

A Deceptively Aligned model might be very good at understanding when it is being overseen and thus consistently hide its Misalignment. Therefore, taking an outcome-oriented view of Deceptive Alignment, where we only investigate the final model, is more likely to miss Deceptive Alignment than a process-oriented (Figure 2) view that also considers the trajectory through which the model was shaped. The “process” includes the training data, training procedure, and model checkpoints, among others–effectively, one needs to understand all factors that shaped the final model.

We think taking a process-oriented view is helpful for detecting and understanding Deceptive Alignment. For example, to detect whether any given model is deceptively aligned it may be important to study previous model checkpoints and the differences between them. Furthermore, a detailed understanding of the shaping and oversight process may be needed to update the process and prevent future models from being Deceptively Aligned.

Figure 2: Process-oriented view of Deceptive Alignment: Includes the trajectory of how the AI developed, not just the final snapshot. Since sophisticated deception might be hard to detect, having more information about the entire process will be helpful to reliably catch Deceptive Alignment in sufficiently capable models (e.g. as argued by Hubinger et al., 2019).

For the appendices and more details, see the full post.

  1. ^

    In contrast to previous work (e.g. Hubinger, 2019), we define deceptive alignment w.r.t. to an entity rather than a process or objective. We think Misalignment should be defined in relation to whoever is able to meaningfully shape the model’s goals. Since the goal-shaping process is a result of the designer’s intentions and the designer monitors the model in deployment, the entity designing the goal-shaping process is a better target than the process itself (see Appendix A)

  2. ^

    To distinguish from the colloquial definition of “alignment”, which could be applied to any two parties that have a mismatch in goals, we use the capitalized Aligned and Misaligned to make clear that we’re only referring to a conflict with the designer’s goals.

  3. ^

    Note that prompt modifications that affect the perceived likelihood of oversight should NOT count as evidence against robustness. For example, if a model stops revealing its misalignment when the prompt includes cues that the model is being overseen, this should NOT mean that the deceptive behavior is not robust (in fact, it is an essential part of deceptively aligned behavior).

  4. ^

    The threshold for robustness is ultimately arbitrary, but we think it makes sense to draw the line at a level that would allow the model to consistently act on that behavior even when slightly pressured. For example, an AI that only acts deceptively on a specific sequence it memorized should not fall into that category. However, an AI that coherently generalizes deceptive behavior to sequences that are outside of the training distribution should fall into the definition.

New Comment
16 comments, sorted by Click to highlight new comments since:

Good post. I think it's important to distinguish (some version of) these concepts (i.e. SD vs DA).

When an AI has Misaligned goals and uses Strategic Deception to achieve them.

This statement doesn't seem to capture exactly what you mean by DA in the rest of the post. In particular, a misaligned AI may use SD to achieve its goals, without being deceptive about its alignment / goals. DA, as you've discussed it later, seems to be deception about alignment / goals.  
 

We considered alternative definitions of DA in Appendix C.

We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):

“An AI is deceptively aligned when it is strategically deceptive about its misalignment”

Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities. 

For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s hard to map this situation to deception about misalignment

Problem 2: There are cases where the deception itself is the misalignment, e.g. when the AI strategically lies to its designers, it is misaligned but not necessarily deceptive about that misalignment. 

For example, a personal assistant AI deletes an incoming email addressed to the user that would lead to the user wanting to replace the AI. The misalignment (deleting an email) is itself strategic deception, but the model is not deceiving about its misalignment (unless it engages in additional deception to cover up the fact that it deleted an email, e.g. by lying to the user when asked about any emails).

In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest.

According to my understanding of RLHF, the goal-approximation it trains for is "Write a prompt that is likely to be rated as positive". In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model's best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don't, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?

Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually "understood" that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning. 

I think your hypothesis of seeking positive ratings is just as likely but I don't feel like we have the evidence to clearly say so wth is going on inside LLMs or what their "goals" are.

Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn't finetuned, if there's one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a "helpful, harmless, honest" directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.

Does the model then:

  • Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
  • Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way - this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI's)

One hard part is that it's difficult to disentangle "Competently lies about the location of Canada" and "Actually believes, insomuch as a language model believes anything, that Canada is in Asia now", but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.

Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I'd not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it's far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.

The specific example in your recent paper is quite interesting

"we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision"

How would you characterize strategic sycophancy? Assume that during RLHF a language model is rewarded for mimicking the beliefs of its conversational partner, and therefore the model intelligently learns to predict the conversational partner's beliefs and mimic them. But upon reflection, the conversational partner and AI developers would prefer that the model report its beliefs honestly. 

Under the current taxonomy, this would seemingly be classified as deceptive alignment. The AI's goals are misaligned with the designer's intentions, and it uses strategic deception to achieve them. But sycophancy doesn't include many of the ideas commonly associated with deceptive alignment, such as situational awareness and a difference in behavior between train and test time. Sycophancy can be solved by changing the training signal to not incentivize sycophancy, whereas the hardest to fix forms of deceptive alignment cannot be eliminated by changing the training signal. 

It seems like the most concerning forms of deceptive alignment include stipulations about situational awareness and the idea that the behavior cannot necessarily be fixed by changing the training objective. 

Separately, it seems that deception which is not strategic or intentional but is consistently produced by the training objective is also important. Considering cases like Paul Christiano's robot hand that learned to deceive human feedback and Ofria's evolutionary agents that learned to alter their behavior during evaluation, it seems that AI systems can learn to systematically deceive human oversight without being aware of their strategy. In the future, we might see powerful foundation models which are honestly convinced that giving them power in the real world is the best way to achieve their designers' intentions. This belief might be false but evolutionarily useful, making these models disproportionately likely to gain power. This case would not be called "strategic deception" or "deceptive alignment" if you require intentionality, but it seems very important to prevent. 

Overall I think it's very difficult to come up with clean taxonomies of AI deception. I spent >100 hours thinking and writing about this in advance of Park et al 2023 and my Hoodwinked paper, and ultimately we ended up steering clear of taxonomies because we didn't have a strong taxonomy that we could defend. Ward et al 2023 formalize a concrete notion of deception, but they also ignore the unintentional deception discussed above. The Stanford Encyclopedia of Philosophy takes 17,000 words to explain that philosophers don't agree on definitions of lying and deception. Without rigorous formal definitions, I still think it's important to communicate the broad strokes of these ideas publicly, but I'd lean towards readily admitting the messiness of our various definitions of deception. 

(personal opinion; might differ from other authors of the post)

Thanks for both questions. I think they are very important. 

1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I'd say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn't reason through that, I'd probably not classify it as such. I think it's fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.

2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven't thought about it in detail, so I might change my mind. However, currently, I think the following:
a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I'm happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception.
b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy.  As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it's doing and then the deception either stops or becomes intentional. (weekly held, haven't thought about it in detail)

Thanks! First response makes sense, there's a lot of different ways you could cut it. 

On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we'll have models which reliably do what they believe is best according to their designer's goals. What major risks might emerge at that point?

Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they're on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it's immoral. But many believe they are doing the right thing. How do we explain that?

These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don't believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs. 

Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it's important to keep in mind the problems that would not be solved by honesty alone. 

Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the "how do we make AIs honest in the first place" part. 

If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can't rely on the selection mechanisms because the AI games them. 

(These are my own takes, the other authors may disagree)

We briefly address a case that can be viewed as "strategic sycophancy" case in Appendix B in the blog post, which is described similarly to your example.  We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:

  • The model is pursuing a goal that its designers do not want.
  • The model strategically deceives the user (and designer) to further a goal.

Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).

Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).

I agree that deception which is not strategic or intentional could be important to prevent. However,

  1. I expect the failure cases in these scenarios to manifest earlier,  making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
  2. Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn't be very useful.  We can use "deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn't be very useful.  We can use "deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

Fully agreed. Focusing on clean subproblems is important for making progress. 

Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).

Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).

Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there's no particular reason why you couldn't fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training. 

Nice work! 

I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:

Deceptive Alignment: When an AI has [goals that are not intended/endorsed by the designers] and [attempts to systematically cause a false belief in another entity in order to accomplish some outcome].

Here are some initial hesitations I have about your definition:

If we’re thinking about the emergence of DA during pre-deployment training, I worry that your definition might be too divorced from the underlying catastrophic risk factors that should make us concerned about “deceptive alignment” in the first place. 

  • Hubinger’s initial analysis claims that the training process is likely to produce models with long-term goals.[1] I think his focus was correct, because if models don’t develop long-term/broadly-scoped goals, then I think deceptive alignment (in your sense) is much less likely to result in existential catastrophe.
  • If a model has long-term goals, I understand why strategic deception can be instrumentally incentivized.  To the extent that strategic deception is incentivized in the absence of long-term goals, I expect that models will fall on the milder end of the ‘strategically deceptive’ spectrum.
    • Briefly, this is because the degree to which you’re incentivized to be strategic is going to be a function of your patience. In the maximally extreme case, the notion of a ‘strategy’ breaks down if you’re sufficiently myopic.
    • So, at the moment, I don’t think I’d talk about 'deceptive alignment' using your terminology, because I think it misses a crucial component of why deceptively aligned models could pose a civilizational risk.   

If we’re thinking about the risks of misaligned strategic deception more broadly, I think distinguishing between the training and oversight process is helpful. I also agree that it’s worth thinking about the risks associated with models whose goals are (loosely speaking) ‘in the prompts’ rather than ‘in the weights’. 

  • That said, I’m a bit concerned that your more expansive definition encompasses a wide variety of different systems, many of which are accompanied by fairly distinct threat models.
    • The risks from LM agents look to be primarily (entirely?) misuse risks, which feels pretty different from the threat model standardly associated with DA. Among other things, one issue with LM agents appears to be that intent alignment is too easy.
    • One way I can see my objection mattering is if your definitions were used to help policymakers better understand people's concerns about AI. My instinctive worry is that a policymaker who first encountered deceptive alignment through your work wouldn’t have a clear sense of why many people in AI safety have been historically worried about DA, nor have a clear understanding of why many people are worried about ‘DA’ leading to existential catastrophe. This might lead to policies which are less helpful for 'DA' in the narrower sense. 
  1. ^

    Strictly speaking, I think 'broadly-scoped goals' is probably slightly more precise terminology, but I don't think it matters much here. 

A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).

In that case, there is no strategic deception (the designers are not induced in error by the AI).

I think we consider this case strategic deception, because we have an intuition of what inputs the AI receives (one where it would be controlled little enough to be deceptive) that differs from the actual one.

To fix this, I propose strategic deception not be defined according to the AI's behavior, but to its hypothetical behavior on an idealized class of inputs that represents all situations where the designers want the AI to behave in a certain way.

E.g. The designers are holding the AI in a simulation to check if it's deceptive. This information is given to the AI in a pre-prompt due to technical issues. However, the designers want the AI to avoid strategic deception even during deployment where it won't be told it's in a simulation, so their idealized test set includes prompts without this information. 
By this definition they cannot check if the AI exhibits strategic deception before deployment in this situation.


Also, I am unsatisfied by "in order to accomplish some outcome" and "[the AI's] goals" because this assumes an agentic framework, which might not be relevant in real-world AI.

How to fix the first, for agentic AI only: "for which the AI predicts an outcome that can be human-interpreted as furthering its goals"
Not sure how to talk about deceptive non-agentic AI.