Oliver Daniels

Oliver Daniels, Perusha Moodley, David Lindner

code, paper, twitter

thread copied below:

Introduction

Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts:

The core problem: We want to audit models for hidden goals before deployment. But future misaligned AIs might have:

Situational awareness (know about auditing methods)
Strategic reasoning (figure out how to evade them)

To simulate situational awareness and strategic reasoning, we use a red-team pipeline that generates system prompts tailored to deceive specific auditing methods.

Experiments

We applied the pipeline to "secret-keeping" models fine-tuned to have some hidden behavior (but deny it), evaluating the best performing black-box and white-box auditing methods from Eliciting Secret Knowledge from Language Models

Our... (read more)

On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing

Partially commentary on our prompted strategic deception paper

Alignment auditing methods aim to elicit hidden goals from misaligned models. Current approaches rely on contingent properties of LLMs (persona drift, implicit concept bottlenecks, generalized honesty to name a few), with the hope that these properties will empirically hold for future, deceptive misaligned models.

We can get some information about whether these properties will hold using meta-level adversarial evaluations: give a red-team information about an auditing method and some affordances for modifying the target model, and evaluate whether the auditing methods are robust to adversarial red-team manipulations.

Realism in White-Box Adversarial Evaluations

Such evaluations are central to (even synonymous with) AI control. But in the auditing context, where... (read 689 more words →)

Oliver Daniels9d

"In my experience, the vast majority of people in AI safety are in favor of big-tent coalition protests on AI in theory"

is this true? I think many people (myself included) are worried about conflationary alliances backfiring (as we see to some extent in the current admin)

Oliver Daniels10d

ha thanks, yes (probably prompted most proximately by this comment from Steve)

Replying toOn Goal-Models

Oliver Daniels10d

On Goal-Models

I think it’s actually another control process—specifically, the process of controlling our identities. We have certain conceptions of ourselves (“I’m a good person” or “I’m successful” or “people love me”.) We then are constantly adjusting our lives and actions in order to maintain those identities—e.g. by selecting the goals and plans which are most consistent with them, and looking away from evidence that might falsify our identities. So perhaps our outermost loop is a control process after all.

rhymes with Steve on approval reward (and MONA)

in general I find I'm sowewhat skeptical about the particulars of activate inference coalitional agency frame, but very excited about locality, path-dependence, predictive processing, and myopia

Oliver Daniels12d

yeah I am very grateful for MIRI, and I don't think we should be complacent about existential risks (e.g. 50% P(doom) seems totally reasonable to me)

Oliver Daniels12d

the hope is that
a) we to get the transformative ai to do our alignment homework for us, and
b) that companies / society will become more concerned about safety (such that the ratio of safety to capabilities research increases a lot)

Oliver Daniels12d

yeah I agree, I think the update is basically just "AI control and automated alignment research seem very viable and important", not "Alignment will be solved by default"

[Linkpost] Theory and AI Alignment (Scott Aaronson)

13d

Some people have been getting more optimistic about alignment. But from a skeptical / high p(doom) perspective, justifications for this optimism seem lacking.

"Claude is nice and can kinda do moral philosophy" just doesn't address the concern that lots of long horizon RL + self-reflection will lead to misaligned consequentialists (c.f. Hubinger)

So I think the casual alignment optimists aren't doing a great job of arguing their case. Still, it feels like there's an optimistic update somewhere in the current trajectory of AI development.

It really is kinda crazy how capable current models are, and how much I basically trust them. Paradoxically, most of this trust comes from lack of capabilities (current models couldn't seize... (read more)

[Review] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
https://arxiv.org/abs/2511.01689
★★★★☆ (4.0/5)

Great to see academic / open work on character training.

While not the focus of the paper, was pretty interested in the DPO expert distillation pipeline (with a stronger model + spec generating positive examples and base model generating negative examples). Curious how this compares to on-policy distillation https://thinkingmachines.ai/blog/on-policy-distillation/.

They test robustness to “prefills” of default assistant personas, but I’d be curious to see if introspection training improves peril robustness in general. Intuitively, having a more salient notion of “being” the assistant with particular values would improve robustness to e.g. harmfulness prefills, though afaik this hasn’t been tested empirically (and Claude models are still susceptible to prefills, so clearly this effect isn’t massive)

https://www.papertrailshq.com/papers/cmiw3g6ln000zl704fqtta2hz#reviews

Experimenting with cross-posting reviews from https://www.papertrailshq.com/

[Review] Replication of the Auditing Game Model Organism
https://alignment.anthropic.com/2025/auditing-mo-replication/
★★★★☆ (4.0/5)

Valuable for the community, and new scientific contributions too: in particular, that adversarial training against profile attacks (and other inputs) generalizes to robustness against user persona and “base model” sampling.

This generalization is interesting, though I’m curious whether profile robustness could be induced without directly training against it (by e.g. training on synthetic documents with information about profile attacks)

https://www.papertrailshq.com/papers/cmjw01xv7000bjv04o14vzagx#reviews

Two strongest sources of prosaic alignment hopes

1. LLMs are the dumbest economically transformative AI systems: they're even more culture-pilled than humans, it turns out great language models + scaffolding and RL really can "simulate" all economically relevant tasks without having scary agentic properties (and while being very easy to audit and monitor)

2. Strong path-dependence: early alignment training robustly aligns the system (even with outcome-based RL w/out strong oversight, and continual learning). Shard theory, basin of corrigibility, etc.

1 makes me pretty hopeful (especially in short timelines), even if only partially true. I think we've already gotten some evidence against 2 (e.g reward hacking in sonnet, o3, etc), though the situation does seem to be better now (maybe the "soul document", better deliberative alignment, ...)

That most complexity theorists believe P != NC gives a kind of theoretical quasi-support to ideas like "alignment research is inherently sequential" and single-piece flows

(I think arguments like this connecting mathematical arguments to distantly related settings is mostly silly, but I still find them nice, idk...)

Concrete Methods for Heuristic Estimation on Neural Networks

2mo

Some excerpts below:

On Paul's "No-Coincidence Conjecture"

Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?
In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.
That’s why I love a question that

... (read 726 more words →)

I think evan's post on deep double descent looks really prescient (i.e. I think its now widely accepted that larger models tend to generalize better than smaller models conditioned on achieving the same training loss)

https://www.lesswrong.com/posts/nGqzNC6uNueum2w8T/inductive-biases-stick-around

the implications for scheming risk are a little less clear: reasoning models don't have strong speed priors (and do inherit simplicity priors from the NN), but don't seem to be schemers (perhaps due to output-to-thinking generalization). I don't think we should update much from this though, given the narrow range of tasks and still-limited situational awareness.

unclear whether outcome-based training "leaking" into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safety

on the one hand, it makes scheming less likely
on the other hand, it can make CoT less monitorable

I’ve been running a safety grad student reading group, and feeling like it would be healthier / more productive to have concrete metrics or deliverables.

My tentative idea to incorporate LW / Alignment Forum posting as a core component of the group (with karma as a metric). I’m not exactly sure on the structure, but something like:

Reading week (everyone reads the same set of papers / blog posts, discuss thought
Writing week: everyone prepares comment / shortform / post, then we trade and revise writers-workshop style
- this also provides filtering (avoid spamming LW with bad comments) and maybe makes people feel more comfortable/confident about posting public writing

would love feedback, and curious if anyone has tried stuff like this.

(Another nice feature of this is it increases the odds of getting people hooked on LW, which imo should be a top priority for safety field-building, c.f. https://www.lesswrong.com/posts/ke24kxhSzfX2ycy57/simon-lermen-s-shortform?commentId=HqQsNdp4bdp4nDn7G)

Should we try harder to solve the alignment problem?

I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).

I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.

In the small but growing literature on supervised document finetuning, its typical to finetune "post-trained" models on synthetic facts (see Alignment faking, Wang et al., Lessons from Two-Hop Reasoning)

To me, the more natural thing is "synthetic continued pretraining" - further training the base model on synthetic documents (mixed with pretraining data), then applying post-training techniques ~~(this is the approach used in~~ ~~Auditing language models for hidden objectives~~ nvm they apply SDF to post-trained model, then apply further post-training)

I'm sort of confused why more papers aren't doing synthetic continued pretraining. I suspect its some combination of a) finetuing post-trained models is easier and b) people have tried both and it doesn't make much of a difference.

But if its mostly a) and not really b), this would useful to know (and implies people should explore b more!)

I'm pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What's the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...)

Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn't integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)

Concrete empirical research projects in mechanistic anomaly detection

Thanks to Erik Jenner for helpful comments and discussion

(Epistemic Status: Tentative take on how to think about heuristic estimation and surprise accounting in the context of activation modeling and causal scrubbing. Should not be taken as authoritative/accurate representation of how ARC thinks about heuristic estimation)

I'm pretty excited about ARC's heuristic arguments agenda. The general idea that "formal(ish) arguments" about properties of a neural network could help improve safety has intuitive appeal.

ARC has done a good job recently communicating about high level intuitions, goals, and subproblems of the heuristic arguments agenda. Their most recent post "A birds eye of ARC's agenda" lays out their general vision of how heuristic arguments would help with... (read 8032 more words →)

Erik Jenner

Erik Jenner, Viktor Rehnberg, Oliver Daniels

Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction.

Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.

As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD... (read 2760 more words →)

Oliver Daniels-Koch's Shortform