Interpretability Will Not Reliably Find Deceptive AI

I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise.
Interpretability still seems a valuable tool and remains worth investing in, as it will hopefully increase the reliability we can achieve.
However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy
It is not the one thing that will save us, and it still won’t be enough for high reliability.

EDIT: This post was originally motivated by refuting the claim "interpretability is the only reliable path forward for detecting deception in advanced AI", but on closer reading this is a stronger claim than Dario's post explicitly makes. I stand by the actual contents of the post, but have edited the framing a bit, and also emphasised that I used to hold the position I am now critiquing, apologies for the mistake

Introduction

There’s a common argument made in AI safety discussions: it is important to work on interpretability research because it is a realistic path to high reliability safeguards on powerful systems - e.g. as argued in Dario Amodei’s recent “The Urgency of Interpretability”.^[1] Sometimes an even stronger argument is made, that interpretability is the only realistic path to highly reliable safeguards - I used to believe both of these arguments myself. I now disagree with these arguments.

The conceptual reasoning is simple and compelling: a sufficiently sophisticated deceptive AI can say whatever we want to hear, perfectly mimicking aligned behavior externally. But faking its internal cognitive processes – its "thoughts" – seems much harder. Therefore, goes the argument, we must rely on interpretability to truly know if an AI is aligned.

I am concerned this line of reasoning represents an isolated demand for rigor. It correctly identifies the deep flaws in relying solely on external behavior (black-box methods) but implicitly assumes that interpretability doesn't suffer from equally fundamental problems. There are many deep issues in interpretability that prevent very confident conclusions, even if we assume models cannot deliberately obfuscate their thoughts, e.g. superposition and the inherent error in our best tools and techniques. The challenges of interpretability do not seem qualitatively easier to solve than the big issues in black box tests, especially with more creative black-box tools like monitoring or editing the system’s chain of thought.^[2]

Should we give up on interpretability? No! I still think it has the potential to add a lot of value, and we will have better safeguards with interpretability as part of our portfolio. Even if it adds no value for making superintelligence safer^[3], if it can add value for pre-superintelligence transformative systems that would be enough to justify investment. I just think that we should be more pragmatic about interpretability’s likely impact, and accept that while we can generally improve our safeguards we will likely not reach high reliability.

High Reliability Seems Unattainable

Based on the current state and foreseeable trajectory of the field without major paradigm shifts, I think that neither interpretability nor black box methods offer a high reliability^[4] path to safeguards for superintelligence, in terms of evaluation or monitoring. This is due to fairly fundamental limitations of both methods, unless there are substantial breakthroughs, e.g. via using pre-superintelligence systems. This also stems from a general prior that being highly confident in any complex property of a complex system in a range of unpredictable situations is fairly implausible.

However, I am pretty pessimistic in general about reliable safeguards against superintelligence with any methods, given how exceptionally hard it is to reason about how a system far smarter than me could evade my plans. As I see it we must either not create superintelligence, rely on pre-superintelligent automated researchers to find better methods, or deploy without fully reliable safeguards and roll the dice, and do as much as we can now to improve our odds.

This doesn't mean we should give up! It means we need a pragmatic perspective. We should aim to build the best possible monitoring and evaluation portfolio we can, using all available tools, while accepting that high reliability might be out of reach. Interpretability can add a valuable source of de-correlated signal, or augment black box methods. The goal shifts from achieving near-certainty to maximizing the chances of catching misalignment, making deception harder and riskier for the AI, even if we can't guarantee detection.

Further, methods that add significant safety to pre-superintelligent transformative systems still add significant value even if they don’t scale to superintelligence - one of the key insights behind the AI control agenda. Early transformative systems seem likely to be far easier to evaluate and control, but may still be very dangerous if unchecked, and extremely useful if under control. And when evaluating methods according to how much they help with early transformative systems, interpretability is just one among several reasonable methods (and probably not my first choice - things like chain of thought monitoring seem like lower hanging fruit)

Why Won’t Interpretability be Reliable?

Caveat: The following is my inside view for the deep reasons why interpretability will never be truly reliable for human science, no matter how much the field progresses. This is a very strong statement, so I'm not that confident here - while it’s obvious that current techniques are unreliable, this is a question of forecasting future research progress, a highly questionable endeavour. It’s very hard to say what might be possible with a paradigm shift in how LLMs are trained, or how we do interpretability.

More importantly, I do not feel able to reason about the effects of lots of AGIs doing AI research. "Thousands of years of mech interp geniuses in a data center" is a crazy anomaly that I can't reason well about, and seems reasonably likely to happen before superintelligence, so I want to hedge. Not only could the research lead to unexpected breakthroughs, it opens up all kinds of weird strategies leveraging the existence of the AGIs like inspecting every SAE latent in detail - though I'd imagine scalable oversight/control schemes using the AGIs should be higher priority than interp. I also think we should be fairly uncertain about what safety research directions the automated researchers will find to be most promising.

That aside, reasons I won't be comfortable relying on interpretability:

We miss things: We're still figuring out the basics. Concepts like superposition mean important features can be smeared across many neurons in non-obvious ways. Our tools like SAEs or transcoders have error and often miss things. We can misinterpret the features found, or only explain a fraction of the model's behavior.
It’s hard to measure progress: We don’t have a ground truth for what’s actually going on, and must inevitably use a range of proxies of varying quality. This means it’s hard to ever be confident you really understand what’s happening.
The Long Tail Problem: Models seem to have both some clean, interpretable structure and a massive, messy "long tail" of weird heuristics, special case handlers, and half-learned, noisy behaviours. I’d personally guess that important capabilities, including deception, are in the clean part, but can’t say for sure!
How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn't just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%? The required level of certainty for high-stakes deployment feels far beyond our current capabilities.
- I’m more optimistic about interpretability as a way to potentially demonstrate misalignment, but failure to do so tells us little.
Scalability Questions: Most of our best results are on toy models or yesterday's LLMs. I think that scaling is a tractable problem, but likely involves a bunch of LLM labour assisting, another source of error and uncertainty.
Potential for Obfuscation: While maybe harder than faking outputs, a smart AI could potentially learn to structure its internal computations to be opaque to our specific interpretability tools.

The Potential of Black-Box Methods

Black-box methods, i.e. things that engage with the model’s inputs and outputs, have clear issues: models can already tell when they’re being evaluated, or realise when they need to instrumentally fake obedience. But I think there’s also a lot of clever tricks you can do to strengthen them, and they might go a long way. This is more a question about AI control than my areas of expertise, but some thoughts on directions I find promising:

Reading the chain of thought (with another LLM) and looking for sketchy intent^[5]
Editing the model’s chain of thought so it e.g. thinks it concluded that it’s not in an evaluation
Asking questions to many copies of the same model and look for inconsistencies while they can’t coordinate.
Finetuning the model to e.g. be more truthful or forthcoming, or to insert misleading beliefs and trick it.
Creating honeypots that try to trick the model into revealing its intent

The Role of Interpretability

So, where does interpretability fit into this pragmatic portfolio? I see several key roles, viewing it not as a standalone solution, but as a valuable component.

Firstly, safeguards can be categorised as follows:

Evaluation: We take a system and perform extensive tests on it for undesired properties, like a tendency towards misalignment or sandbagging
Monitoring: When we deploy a system we are always running some checks at inference, which try to catch and mitigate bad behaviour
1. Importantly, this needs to be cheap to be run in production, while evaluations can be more expensive and slow

How can interp help?

A Layer of Swiss Cheese: I think the right monitoring schemes have several layers of defence (a la the swiss cheese model), with different cost to accuracy trade-offs.
- Interpretability techniques like probes for known dangerous features are extremely cheap, and surprisingly effective, and can serve as a good first layer of defence, which call more expensive tests (e.g. an LLM reading the chain of thought) if they fire.
- Interpretability techniques like SAEs or cross-layer transcoders may have the potential to be more expensive but accurate ways to audit what’s going on and look for anything unexpected.
- Even if interpretability is not more accurate, so long as it has some signal and is de-correlated, then it adds value, even if it’ll miss many things.
Enhancing Black-Box Evaluations: Interpretability might make our external checks more effective. For example:
- Could we use steering vectors identified via interpretability to manipulate whether the model thinks it’s in an eval? Maybe we can test it more effectively if we can trick it into thinking it's not in a honeypot or evaluation setting.
- Can we use interpretability to better elicit information from token forcing? We can make the model say “my secret plan is”, and even if it tries to internally suppress awareness or obfuscate the output perhaps interpretability can prevent that?
- Interpretability may provide a sanity check that our black box methods are not missing something, e.g. whether a model is passing an eval only because it knows it’s in one and is deliberately underperforming.
Debugging mysterious behaviour: One particularly interesting form of evaluation (or offline monitoring) is taking particularly weird or unexpected examples of model behaviour, e.g. it seeming to put vulnerabilities in critical code, and trying to debug what’s going on and see if it’s real alignment or not. I think interpretability methods may prove very good at raising unexpected hypotheses, and piecing together what’s happening (e.g. this proof of concept)
- I expect these kinds of potential warning shots to happen, and anything that can add more signal here would be great.
- Even if we don’t trust the interpretability method fully, the hypotheses generated can be verified by other means.

These applications aren't guaranteed to work or be ungameable, but they represent concrete ways internal understanding could augment external validation, creating a more robust overall system than either alone.

Conclusion

Let's stop framing interpretability as the one true path to reliably detecting AI deception, a view often born from an isolated demand for rigor against flawed black-box methods. Neither approach offers high reliability against superintelligence without fundamental breakthroughs, but flawed methods can still be useful. For the powerful pre-SI systems that concern us now, let's focus on building the best pragmatic portfolio of imperfect defenses we can. Interpretability is a useful part of that portfolio – offering unique angles of analysis and potential synergies with other methods – but it's one tool among many, not a substitute for a comprehensive, defense-in-depth strategy. We need to leverage every imperfect signal we can get, from both inside and outside the box.

Thanks to my co-author Gemini 2.5 Pro

While this post was inspired by Dario's post and ensuing discussion, this is a common belief I've encountered, and this post is not intended to call out him specifically! ↩︎
More specifically, I basically think the only way we might get high reliability is by using pre-superintelligent systems that are sufficiently aligned/controlled to either do a ton of automated research and find some breakthroughs, or by using them as a core part of a monitoring or scalable oversight scheme. But the human research trajectory is hard. ↩︎
Defining “safety” is kinda messy, for the purposes of this post you can interpret it as evaluating ‘is the model thinking misaligned thoughts’ or monitoring for ‘is the model currently knowingly taking high-stakes actions against what we want’ ↩︎
I’m not sure what the best precise definition of highly reliable is here, maybe 99.9% chance of stopping a catastrophic outcome? It’s not key to my argument. ↩︎
Obviously, Chain of Thought can fail to reflect the model’s true reasoning. But I think that for tasks at the limits of its ability it likely needs to use the chain of thought for intermediate steps in the computation, creating an incentive to store key info in there. Of course, there are many ways this can fail. ↩︎

New to LessWrong?

Getting Started

FAQ

Library

^{^}

Strictly speaking, if two models produce the exact same results on every possible input and we would categorize those results as "safe" then there is no problem, we've just said the model is proven safe. But practically there will be a distributional generalization issue where the results we have for evaluation don't cover all possible inputs.

^{^}

or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.

^{^}

please don't do this

While this post was inspired by Dario's post and ensuing discussion, this is a common belief I've encountered, and this post is not intended to call out him specifically! ↩︎

More specifically, I basically think the only way we might get high reliability is by using pre-superintelligent systems that are sufficiently aligned/controlled to either do a ton of automated research and find some breakthroughs, or by using them as a core part of a monitoring or scalable oversight scheme. But the human research trajectory is hard. ↩︎

Defining “safety” is kinda messy, for the purposes of this post you can interpret it as evaluating ‘is the model thinking misaligned thoughts’ or monitoring for ‘is the model currently knowingly taking high-stakes actions against what we want’ ↩︎

I’m not sure what the best precise definition of highly reliable is here, maybe 99.9% chance of stopping a catastrophic outcome? It’s not key to my argument. ↩︎

Obviously, Chain of Thought can fail to reflect the model’s true reasoning. But I think that for tasks at the limits of its ability it likely needs to use the chain of thought for intermediate steps in the computation, creating an incentive to store key info in there. Of course, there are many ways this can fail. ↩︎

Interpretability (ML & AI)2AI2

Frontpage

267 Ω 111

Mentioned in

31AI #115: The Evil Applications Division

4AI's Hidden Game: Understanding Strategic Deception in AI and Why It Matters for Our Future

New Comment

33 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:20 PM

[-]Buck15dΩ438252

I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.

My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)

Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn't obviously help much with those, which makes it a worse target for research effort.

[-]ozziegooen14d31

CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe.

Small point, but I think that "neuralese" is likely to be somewhat interpretable, still.
1. We might advance at regular LLM interpretability, in which case those lessons might apply.
2. We might pressure LLM systems to only use CoT neuralese that we can inspect.

There's also a question of how much future LLM agents will rely on CoT vs. more regular formats for storage. For example, I believe that a lot of agents now are saving information in English into knowledge bases of different kinds. It's far easier for software people working with complex LLM workflows to make sure a lot of the intermediate formats are in languages they can understand.

All that said, personally, I'm excited for a multi-layered approach, especially at this point when it seems fairly early.

[-]Davidmanheim14dΩ120

CoT monitoring seems like a great control method when available

As I posted in a top level comment, I'm not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.

[-]David Johnston11d10

models will have access to some kind of "neuralese" that allows them to reason in ways we can't observe

Only modest confidence, but while there's an observability gap between neuralese and CoT monitoring, I suspect it's smaller than the gap between reasoning traces that haven't been trained against oversight and reasoning traces that have.

[-]Breno Carvalho13d10

Makes sense! As we’ve seen, while Chain-of-Thought (CoT) reasoning is highly interpretable, it doesn’t necessarily reflect the model’s true internal reasoning. Interpretability techniques might help us decode "neuralese" enough to verify whether a CoT explanation is faithful to the model’s actual reasoning process. This could also help detect covert communication between cooperating, unaligned agents. As Nanda pointed out, though, current methods are noisy and can be gamed by the system (for example, by hiding its intentions through more complex forms of superposition).

I agree with both of you that interpretability is a powerful tool with its limiations. However, given the uncertainty around its scalability, our best strategy may be to use it for ensuring incremental alignment. That, in turn, could help accelerate progress in both interpretability and control research.

> Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models.

It also seems that spotting traces of suspicious reasoning through mech interpretability would be useful in both convincing people and deployment contexts. For example, if a model is generating a cake recipe, it shouldn’t be reasoning about complex bioengineering concepts. If such concepts are present, interpretability methods might flag them as potential signs of misalignment. The same mechanism could serve as a red flag during production to identify when a model is acting in unexpected or unsafe ways, as a layered approach mentioned by Nanda.

[-]Adam Shai15d*3113

Thanks for writing this! I have been thinking about many of the issues in your Why Won't Interpretability Be Reliable section lately, and mostly agree that this is the state of affairs. I often think of this from the perspective of the field of neuroscience. My experience there (in the subsection of neuro research that I believe is the most analogous to mech interp) is that these are basically the same fundamental issues that keep the field from progress (though not the only reasons).

Many in the interpretability field seem to (implicitly) think that if you took neuroscience and made access to neural activities a lot easier, and the ability to arbitrarily intervene on the system, and the ability to easily run a lot more experiments, then all of neuroscience would be solved. From that set of beliefs if follows that because neural networks don't have these issues, mech interp will have the ability to more or less apply the current neuroscience approach to neural networks and "figure it all out." While these points about ease of experiments and access to internals are important differences between neuro. research and mech. interp., I do not think they get past the fundamental issues. In other words - Mech. interp. has more to learn from neuroscience failures than its successes (public post/rant coming soon!).

Seeing this post from you makes me positively update about the ability of interp. to contribute to AI Safety - it's important we see clearly the power and weaknesses of our approaches. A big failure mode I worry about is being overconfident that our interp. methods are able to catch everything, and then making decisions based on that overconfidence. One thing to do about such a worry is to put serious effort into understanding the limits of our approaches. This of course does happen to some degree already (e.g. there's been a bunch of stress testing of SAEs from various places lately), which is great! I hope when decisions are made about safety/deployment/etc., that the lessons we've learned from those types of studies are internalized and brought to bear, alongside the positives about what our methods do let us know/monitor/control, and that serious effort continues to be made to understand what our approaches miss.

[-]Neel Nanda14d72

Thanks!

In other words - Mech. interp. has more to learn from neuroscience failures than its successes (public post/rant coming soon!).

I would be very interested in this post, I'm looking forwards to it

[-]Breno Carvalho13d10

I look forward for that post!

[-]Logan Riggs15dΩ6113

I had this position since 2022, but this past year I've been very surprised and impressed by just how good black box methods can be e.g. the control agenda, Owain Evan's work, Anthropic's (& other's I'm probably forgetting).

How to prove a negative: We can find evidence for or against a hypothesis, but rigorously proving the absence of deception circuits seems incredibly hard. How do you know you didn't just miss it? How much of the model do you need to understand? 90%? 99%? 99.99%?

If you understand 99.9% of the model, then you can just run your understanding, leaving out the possible deception circuit in the 0.1% you couldn't capture. Ideally this 99.9% is useful enough to automate research (or you use the 99.9% model as your trusted overseer as you try to bootstrap interp research to understand more percentage points of the model).

[-]Buck14d*Ω13218

I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it's trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.

[-]Neel Nanda14dΩ472

I disagree re the way we currently use understand - eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don't quite mean what we think, etc.

It's plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don't know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don't like into a system we don't think we understand.

Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!

[-]quetzal_rainbow15d94

There is a conceptual path for interpretability to lead to reliability: you can understand model in sufficient details to know how it produces intelligence and then make another model out of interpreted details. Obviously, it's not something that we can expect to happen anytime soon, but it's something that army of interpretability geniuses in datacenter could do.

[-]Katalina Hernandez15d52

Hey Neel! I just wanted to say thank you for writing this. It's honestly one of the most grounded and helpful takes I've seen in a while. I really appreciate your pragmatism, and the way you frame interpretability as an useful tool that still matters for early transformative systems (and real-world auditing!).

Quick question: do you plan to share more resources or thoughts on how interpretability can support black-box auditing and benchmarking for safety evaluations? I’m thinking a lot about this in the context of the General-Purpose AI Codes of Practice and how we can build technically grounded evaluations into policy frameworks.

Thanks again!

[-]Neel Nanda14d30

Thanks, that's very kind!

do you plan to share more resources or thoughts on how interpretability can support black-box auditing and benchmarking for safety evaluations

I don't have any current plans, sorry

[-]Charbel-Raphaël11d*Ω230

Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.

Some remarks:

I'm still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
"Should we give up on interpretability? No!" - I think this is at least a case for reducing the focus a bit, and more diversification of approaches
On the theories of impacts suggested:
- "A Layer of Swiss Cheese" - why not! This can make sense in DeepMind's plan, that was really good by the way.
- "Enhancing Black-Box Evaluations" - I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
  - Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
- "Debugging mysterious behaviour" - Might be interesting, might help marginally to get better understanding, but this is not very central for me.

[-]Mark Nelson12d30

@Neel Nanda: Hi, first-time commentor. I'm curious about what role you see for black-box testing methods in your "portfolio of imperfect defenses." Specifically, do you think there might be value in testing models at the edge of their reasoning capabilities and looking for signs of stress or consistent behavioral patterns under logical strain?

[-]Neel Nanda12d60

Seems like a reasonable approach to me! Seeing ways that models eg cheat to produce fake but plausible solutions on hard tasks has been insightful to me

[-]Mark Nelson12d10

Thank you Neel! I really appreciate the encouraging reply. Your point about needing a larger toolbox resonated particularly strongly for me. Mine is limited. I don't have access to Anthropic's circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I've been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts. For now I'm focusing solely on analysis of the response, despite the limitation. I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!). If you built it, I would use it.

[-]Davidmanheim14dΩ13-2

First, strongly agreed on the central point - I think that as a community, we've been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.

That said, I am concerned about what happens if interpretability is wildly successful - against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on "miss things," "measuring progress," and "scalability," partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)

[-]Shayne O'Neill2d1-2

I had a more in depth comment, but it appear the login sequence throws comments away (and the "restore comment" thing didn't work). My concern is that not all misaligned behaviour is malicious.. It might decide to enslave us for our own good noting that us humans aren't particularly aligned either and prone to super-violent nonsense behaviour. In this case looking for "kill all humans" engrams isn't going to turn up any positive detections. That might actually be all true and it is in fact doing us a favour by forcing us into servitude, from a survival perspective, but nobody enjoys being detained.

Likewise many misaligned behaviours are not necessarily benevolent, but they aren't malevolent either. Putting us in a meat-grinder to extract the iron from our blood might not be from a hatred of humans, but rather because it wants to be a good boy and make us those paperclips.

The point is, interpretability methods that can detect "kill all humans" wont necessarily work, because individual thoughts arent necessary to behaviours we find unwelcome.

Finally, this is all premised on transformer style LLMs being the final boss of AI, and I'm not convinced at all that LLMs are what get us to AGI. So far there SEEMS to be a pretty strong case that LLMs are fairly well aligned by default, but I don't think theres a strong case that LLMs will lead to AGI. In essence as simulation machines, LLMs strongest behaviour is to emulate the text, but its never seen a text written by anything smarter than a human.

[-]Arne Huang5d10

Hmm, I thought a portfolio approach to safety is what Anthropic has been saying is the best approach for years e.g. from 2023: https://www.anthropic.com/news/core-views-on-ai-safety. Has that shifted recently toward a more "all eggs in interpretability" basket to trigger this post?

I would also love to understand, maybe as a followup post, what the pragmatic portfolio looks like - and what's the front runner in this portfolio. Since I sort of think it's Interpretability, so even though we can't go all in on it, it's still maybe our best shot and thus deserves appropriate resource allocation?

[-]Dusto14d1-2

Is it really a goal to have AI that is completely devoid of deception capability? I need to sit down and finish writing up something more thorough on this topic but I feel like deception is one of those areas where "shallow" and "deep" versions of the capability are talked about interchangeably. The shallow versions are the easy to spot and catch deceptive acts that I think most are worried about. But deception as an overall capability set is considerably farther reaching that the formal version of "action with the intent of instilling a false belief".

Lets start with a non-exhaustive list of other actions tied to deception that are dual use:

Omission (this is the biggest one, imagine anyone with classified information that had no deception skills)
Misdirection
Role-playing
Symbolic and metaphoric language
Influence via norms
Reframing truths
White lies

Imagine trying to action intelligence in the real world without deception. I'm sure most of you have had interactions with people that lack emotional intelligence, or situation awareness and have seen how strongly that will mute their intelligence in practice.

Here's a few more scenarios I am interested to see how people would manage without deception:

Elderly family members with dementia/alzheimers
Underperforming staff members
Teaching young children difficult tasks (bonus - children that are also lacking self-confidence)
Relationships in general (spend a day saying exactly what you think and feel at any given moment and see how that works out)
Dealing with individuals with mental health issues
Business in general ("this feature is coming soon....")

Control methods seem to be the only realistic option when it comes to deception.

[-]Neel Nanda14d21

I think making an AI be literally incapable of imitating a deceptive human seems likely impossible, and probably not desirable. I care about whether we could detect it actively scheming against its operators. And my post is solely focused on detection, not fixing (though obviously fixing is very important too)

[-]Dusto14d30

Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.

[-]TFD14d10

The conceptual reasoning is simple and compelling: a sufficiently sophisticated deceptive AI can say whatever we want to hear, perfectly mimicking aligned behavior externally. But faking its internal cognitive processes – its "thoughts" – seems much harder. Therefore, goes the argument, we must rely on interpretability to truly know if an AI is aligned.

I think an important consequence of this argument is that it potentially suggests that it is actually just straight-up impossible for black-box evaluations to address certain possible future scenarios (although of course that doesn't mean everyone agrees with this conclusion). If a safe model and extremely dangerous model can both produce the exact same black-box results then those black-box results can't distinguish between the two models^[1]. Its not just that there are challenges that would potentially require a lot of work to solve. For the interpretability comparison, its possible that the same is true (although I don't think I really have an opinion on that), or that the challenges for interpretability are so large that we might characterize it as impossible for practical purposes. I think this is especially true for very ambitious versions of interpretability, which goes to your "high reliability" point.

However, I think we could use a different point of comparison which we might call white-box evaluations. I think your point about interpretability enhancing black-box evaluations gets at something like this. If you are using insights from "inside" the model to help your evaluations then in a sense the evaluation is no longer "black-box", we are allowing the use of strictly more information. The relevance of the perspective implied by the quote above is that such approaches are not simply helpful but required in certain cases. Its possible these cases won't actually occur, but I think its important to understand what is or is not required in various scenarios. If a certain approach implicitly assumes certain risks won't occur, I think its a good idea to try to convert those implicit assumptions to explicit ones. Many proposals offered by major labs I think suffer from a lack of explicitness about what possible threats they wouldn't address, vs arguments for why they would address certain other possibilities.

For example, one implication of the "white-box required" perspective seems to be that you can't just advance various lines of research independently and incorporate them as they arrive, you might need certain areas to move together or at certain relative rates so that insights arrive in the order needed for them to contribute to safety.

As I see it we must either not create superintelligence, rely on pre-superintelligent automated researchers to find better methods, or deploy without fully reliable safeguards and roll the dice, and do as much as we can now to improve our odds.

Greater explicitness about limitations of various approaches would help with the analysis of these issues and with building some amount of consensus about what the ideal approach is.

^{^}
Strictly speaking, if two models produce the exact same results on every possible input and we would categorize those results as "safe" then there is no problem, we've just said the model is proven safe. But practically there will be a distributional generalization issue where the results we have for evaluation don't cover all possible inputs.

[-]williawa14d11

I agree with the overall point, but think it maybe understates importance of interpretability because it neglects the impact interpretability has on creating conceptual clarity for the concepts we care about. I mean, that's where ~~50% of the value interpretability lies, in my opinion.

Lucius Bushnaq third to last shortform "My Theory of Impact for Interpretability" explains whats basically my view decently well https://www.lesswrong.com/posts/cCgxp3Bq4aS9z5xqd/lucius-bushnaq-s-shortform

[-]Neel Nanda14d20

I'm not trying to comment on other theories of change in this post, so no disagreement there

[-]Sodium14dΩ010

Re: Black box methods like "asking the model if it has hidden goals."

I'm worried that these methods seem very powerful (e.g., Evans group's Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven't undergone a lot of optimization pressure.

Outputs from real world models might undergo lots of scrutiny/optimization pressure^[1] so that the model appears to be a "friendly chatbot." AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to "say nice things" much better.

So it's possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.

^{^}
or maybe they'll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.

[-]Buck14dΩ10116

This is why AI control research usually assumes that none of the methods you described work, and relies on black-box properties that are more robust to this kind of optimization pressure (mostly "the AI can't do X").

[-]Neel Nanda14dΩ230

Agreed in principle, my goal in the section on black box stuff is to lay out ideas that I think could work in spite of this optimisation pressure

[-]plex13d08

This and your other recent post have raised my opinion of you significantly. I'm still pretty concerned about the capabilities outcomes of interpretability, both the general form of being able to open the black box enough to find unhobblings accelerating timelines, and especially automated interp pipelines letting AIs optimize their internals^[1] as a likely trigger for hard RSI -> game over, but it's nice to see you're seeing more of the strategic landscape than seemed apparent before.

^{^}
please don't do this

[-]Mis-Understandings15d0-12

The basic theory of action is that signs of big problems (red hands) will generate pauses and drastic actions.

Which is a governance claim.

[+][comment deleted]2d10

Deleted by Shayne O'Neill, Last Saturday at 1:33 AM

Reason: double post

Moderation Log