LESSWRONG
LW

All of Daniel Tan's Comments + Replies

Daniel Tan's Shortform

Daniel Tan11h30

"Emergent obfuscation": A threat model for verifying the CoT in complex reasoning.

It seems likely we'll eventually deploy AI to solve really complex problems. It will likely be extremely costly (or impossible) to directly check outcomes, since we don't know the right answers ourselves. Therefore we'll rely instead on process supervision, e.g. checking that each step in the CoT is correct.

Problem: Even if each step in the CoT trace is individually verifiable, if there are too many steps, or the verification cost per step is too high, then it ma... (read more)

Daniel Tan's Shortform

Daniel Tan18h10

I get the sense that I don't read actively very much. By which I mean I have a collection of papers that have seemed interesting based on abstract / title but which I haven't spent the time to engage further.

For the next 2 weeks, will experiment with writing a note every day about a paper I find interesting.

Viliam's Shortform

I think so! And I think patch tokenization may resolve this; see note

meemi's Shortform

Daniel Tan1d1111

am curious why you think this; it seems like some people were significantly misled and disclosure of potential conflicts-of-interest seems generally important

Daniel Tan's Shortform

RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.

Daniel Tan's Shortform

Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.

This makes model organisms of steg really hard to make IMO.

OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.

Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Sorry I didn’t really understand the question. What do you mean specifically?

2eggsyntax2d

Eg in this experiment, the model is asked the question: In order to answer that question, the model needs to a) know that it responds in 'hello' acrostics, and b) know that responding in 'hello' acrostics is different from how default GPT-4 responds, and c) know that responding in acrostics is the only (or at least main) way in which it's different from default GPT-4. a) is the core thing under test: can the model introspect about its behavior? But b) and c) seem pretty puzzling to me. How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task? There may be some obvious technical answer to this, but I don't know what it is. Hopefully that's clearer? If not then you may need to clarify what's unclear.

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).

Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.

Daniel Tan's Shortform

Rough thoughts on getting better at integrating AI into workflows

AI (in the near future) likely has strengths and weaknesses vs human cognition. Optimal usage may not involve simply replacing human cognition, but leveraging AI and human cognition according to respective strengths and weaknesses.

AI is likely to be good at solving well-specified problems. Therefore it seems valuable to get better at providing good specifications, as well as breaking down larger, fuzzier problems into more smaller, more concrete ones
AI can generate many different soluti

... (read more)

Daniel Tan's Shortform

Additional thoughts on how RL fine-tuning would induce obfuscation.

Evading process supervision: It's plausible that we won't be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect 'subtle' or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning.
Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to

... (read more)

1[comment deleted]2d

Daniel Tan's Shortform

Making language models refuse robustly might be equivalent to making them deontological.

Epistemic status: uncertain / confused rambling

For many dangerous capabilities, we'd like to make safety cases arguments that "the model will never do X under any circumstances".

Problem: for most normally-bad things, you'd usually be able to come up with hypothetical circumstances under which a reasonable person might agree it's justified. E.g. under utilitarianism, killing one person is justified if it saves five people (c.f. trolley problems).

However... (read more)

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you're doing here (writing about stuff that interests you in a place where it'll probably be seen by other like-minded people) is probably one of the better ways to find relevant information

Edit: Happy to set up a call also if you'd like to chat further! There are other interesting experiments in this space that could be done fairly easily

2rife4d

I would absolutely like to chat further. Please send me a DM so we can set that up!

Daniel Tan's Shortform

Daniel Tan4d130

Some rough notes from Michael Aird's workshop on project selection in AI safety.

Tl;dr how to do better projects?

Backchain to identify projects.
Get early feedback, iterate quickly
Find a niche

On backchaining projects from theories of change

Identify a "variable of interest" (e.g., the likelihood that big labs detect scheming).
Explain how this variable connects to end goals (e.g. AI safety).
Assess how projects affect this variable
Red-team these. Ask people to red team these.

On seeking feedback, iteration.

Be nimble. Empirical.

... (read more)

4James Chua3d

Writing hypothetical paper abstracts has been a good quick way for me to figure out if things would be interesting.

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning.

2rife4d

Wow. I need to learn how to search for papers. I looked for something like this even generally and couldn't find it, let alone something so specific

Daniel Tan's Shortform

That's pretty interesting! I would guess that it's difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe "willingness to self-report honestly" should be something we train models to do.

2the gears to ascension3d

willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I'll try some more ideas.

Daniel Tan's Shortform

Daniel Tan4d126

"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour.

There's lots of circumstantial evidence that LMs have some concept of self-identity.

Language models' answers to questions can be highly predictive of their 'general cognitive state', e.g. whether they are lying or their general capabilities
Language models know things about themselves, e.g. that they are language models, or how they'd answer questions, or their internal goals / values
Language models' self-identity may directly influence the

... (read more)

5eggsyntax3d

Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I'm tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn't have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment. I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I've chosen y), and various other phenomena like coming up with plausible but potentially false narratives. 'Introspection' in language models has mostly meant ability to self-predict, in the literature I've looked at. I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic. Thanks for this, some really good points and cites.

4the gears to ascension4d

Partially agreed. I've tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren't that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.

Daniel Tan's Shortform

Yeah. I definitely find myself mostly in the “run once” regime. Though for stuff I reuse a lot, I usually invest the effort to make nice frameworks / libraries

Daniel Tan's Shortform

My experience so far with writing all my notes in public.

For the past ~2 weeks I've been writing LessWrong shortform comments every day instead of writing on private notes. Minimally the notes just capture an interesting question / observation, but often I explore the question / observation further and relate it to other things. On good days I have multiple such notes, or especially high-quality notes.

I think this experience has been hugely positive, as it makes my thoughts more transparent and easier to share with others for feedback. The upvo... (read more)

2

1

1Yonatan Cale1d

I'm trying the same! You have my support

1Sheikh Abdur Raheem Ali2d

Thanks for sharing your notes Daniel!

Inference-Time-Compute: More Faithful? A Research Note

Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning.

Is there a less-supervised way of doing this? Some ideas:

Can we "just ask" models whether they have unfaithful CoT? This seems to be implied by introspection and related phenomena
Can we use black-box lie detection to detect unfaithful CoT?

1

1James Chua6d

thanks! heres my initial thought about introspection and how to improve on the setup there: in my introspection paper we train models to predict their behavior in a single forward pass without CoT. maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well. still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves? my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

Daniel Tan's Shortform

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?

Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some tim

... (read more)

Daniel Tan's Shortform

Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small.

Do the patches correspond to atomic concepts?
If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we get a better LM?
Can't we do this recursively to get better and better patches?

How do you deal w/ Super Stimuli?

Tbh I struggle with this too, and also have ADD tendencies

Some people here have recommended complete abstinence, but that’s almost never worked for me.

I think a better strategy is mindful consumption. Eg

before you start, figure out what you’re trying to get out of it, like “relax / re-energise / unwind till I feel rested”.
before you start, decide what channels you’re going to watch.
before you start, set a timer / set a natural breakpoint of # of videos.
If you find yourself really wanting to watch the next video, try saving it to a “watch lat

... (read more)

Where should one post to get into the training data?

Answer by Daniel TanJan 15, 202520

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

9gwern6d

Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

Daniel Tan's Shortform

Feedback from Max:

The definition he originally gave leaves out a notion of 'stealthiness' as opposed to 'gibberish'; 'stealthy' reasoning is more concerning.
OOCR may not meet the opaqueness criterion, since the meaning of 'f' is pretty obvious from looking at the fine-tuning examples. This is like 'reasoning in a foreign language' or 'using esoteric terminology'; it's less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn't needed)
Backdoors have a similar concern to the above.

... (read more)

Daniel Tan's Shortform

Yes, this is definitely highly dependent on circumstances. see also: https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/

Daniel Tan's Shortform

Some ideas on model organisms of steganography.

Borrowing Max Nadeau's definition, steganography is when:

a model uses the words it generates in its chain of thought to accomplish a task that it could not do in a single forward pass
And encodes meaning in its word choice that is opaque to a typical overseer

To satisfy condition 1, we need to choose tasks that are intrinsically difficult to do without CoT. Ideally, we'd also be able to verify whether the CoT was actually used. We could use:

Multi-hop QA tasks. Language models generally fi

... (read more)

1Daniel Tan2d

RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into.

1Daniel Tan2d

Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended. This makes model organisms of steg really hard to make IMO. OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning. Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.

1Daniel Tan2d

Additional thoughts on how RL fine-tuning would induce obfuscation. 1. Evading process supervision: It's plausible that we won't be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect 'subtle' or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning. 2. Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to competence and expertise. Human raters might assign higher scores to responses that use sophisticated-sounding language, even if the underlying reasoning isn't better (and might be worse). Currently uncertain about whether this is a good threat model, since both of these concerns are not applicable to math / coding RL fine-tuning (where it's not possible to obfuscate reasoning, and where the ground-truth verifier is perfect)

1Daniel Tan6d

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here? * Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing. * Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc). It seems like this setup could be simplified: * The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming) * The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial. The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc)

3Daniel Tan7d

Feedback from Max: * The definition he originally gave leaves out a notion of 'stealthiness' as opposed to 'gibberish'; 'stealthy' reasoning is more concerning. * OOCR may not meet the opaqueness criterion, since the meaning of 'f' is pretty obvious from looking at the fine-tuning examples. This is like 'reasoning in a foreign language' or 'using esoteric terminology'; it's less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn't needed) * Backdoors have a similar concern to the above. A separate (and more important) concern is that it doesn't satisfy the definition because 'steg' typically refers to meaning encoded in generated tokens (while in 'backdoors', the hidden meaning is in the prompt tokens)

Daniel Tan's Shortform

Why patch tokenization might improve transformer interpretability, and concrete experiment ideas to test.

Recently, Meta released Byte Latent Transformer. They do away with BPE tokenization, and instead dynamically construct 'patches' out of sequences of bytes with approximately equal entropy. I think this might be a good thing for interpretability, on the whole.

Multi-token concept embeddings. It's known that transformer models compute 'multi-token embeddings' of concepts in their early layers. This process creates several challenges for interpr... (read more)

3Daniel Tan6d

Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small. 1. Do the patches correspond to atomic concepts? 2. If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we get a better LM? 3. Can't we do this recursively to get better and better patches?

Daniel Tan's Shortform

Hypothesis: 'Memorised' refusal is more easily jailbroken than 'generalised' refusal. If so that'd be a way we could test the insights generated by influence functions

I need to consult some people on whether a notion of 'more easily jailbreak-able prompt' exists.
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.

Daniel Tan's Shortform

Introspection is an instantiation of 'Connecting the Dots'.

Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.

Daniel Tan's Shortform

How does language model introspection work? What mechanisms could be at play?

'Introspection': When we ask a language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models 'know what they know', i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs.

One simple hypothesis is that a language model simply infers the general level of capability from the ... (read more)

1Daniel Tan8d

Introspection is an instantiation of 'Connecting the Dots'. * Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f. * Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.

Daniel Tan's Shortform

Can SAE feature steering improve performance on some downstream task? I tried using Goodfire's Ember API to improve Llama 3.3 70b performance on MMLU. Full report and code is available here.

SAE feature steering reduces performance. It's not very clear why at the moment, and I don't have time to do further digging right now. If I get time later this week I'll try visualizing which SAE features get used / building intuition by playing around in the Goodfire playground. Maybe trying a different task or improving the steering method would work also... (read more)

2eggsyntax8d

Can you say something about the features you selected to steer toward? I know you say you're finding them automatically based on a task description, but did you have a sense of the meaning of the features you used? I don't know whether GoodFire includes natural language descriptions of the features or anything but if so what were some representative ones?

Daniel Tan's Shortform

That’s interesting! What would be some examples of axioms and theorems that describe a directed tree?

2quetzal_rainbow9d

Chess tree looks like classical example. Each node is a boardstate, edges are allowed moves. Working heuristics in move evaluators can be understood as sort of theorem "if such-n-such algorithm recognizes this state, it's an evidence in favor of white winning 1.5:1". Note that it's possible to build powerful NN-player without explicit search.

Daniel Tan's Shortform

"Feature multiplicity" in language models.

This refers to the idea that there may be many representations of a 'feature' in a neural network.

Usually there will be one 'primary' representation, but there can also be a bunch of 'secondary' or 'dormant' representations.

If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a 'feature' in the output. E.g. the existence of 800 orthogonal steering vectors for code.

This is consistent with 'circuit formation' resulti... (read more)

Daniel Tan's Shortform

Why understanding planning / search might be hard

It's hypothesized that, in order to solve complex tasks, capable models perform implicit search during the forward pass. If so, we might hope to be able to recover the search representations from the model. There are examples of work that try to understand search in chess models and Sokoban models.

However I expect this to be hard for three reasons.

The model might just implement a bag of heuristics. A patchwork collection of local decision rules might be sufficient for achieving high performance.

... (read more)

5quetzal_rainbow9d

We need to split "search" into more fine-grained concepts. For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search. The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and selects action depending on the result of inference. While logically this situation is equivalent to some search over tree graph, mechanistically it looks like "bag of heuristics".

Daniel Tan's Shortform

Daniel Tan10d10

Important point: The above analysis considers communication rate per token. However, it's also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token

Daniel Tan's Shortform

Daniel Tan11d40

A new paper claims that refusal isn't just a fixed direction in the activation stream, it's also due to specific attention heads. Pretty interesting to get a somewhat orthogonal perspective on 'safety mechanisms' in LLMs. Interesting follow-up would be to see whether you can similarly steer refusal by only intervening on these attention heads.

Daniel Tan's Shortform

Daniel Tan11d10

In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel

Daniel Tan's Shortform

Daniel Tan11d10

Comment from @James Chua in internal slack group

interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this -- this is what i'll expect.

we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model.

but if the above is true, I'm unsure about evidence for steganography.
for example, gpt-4o is more capable than gpt-3.5. suppose we provid... (read more)

Daniel Tan's Shortform

Daniel Tan11d40

Summary of 'Alignment Faking'. Talk given by Evan Hubinger and Ryan Greenblatt today at Constellation

---

Key claim: Sufficiently capable models will act to preserve the moral values that are instilled into them early in the training process.

Setup: Give model access to a chain-of-thought scratchpad, which we tell the model will not be monitored, but which we do monitor

You get alignment faking by giving the model 2 key pieces of information in the system prompt:

Tell the model it's being trained to modify its current values (in Claude's case, to be

... (read more)

Daniel Tan's Shortform

Daniel Tan12d100

Collection of how-to guides

Research soft skills
- How to make research slides by James Chua and John Hughes
- How to manage up by Henry Sleight
- How to ML series by Tim rocktaschel and Jakob Foerster
Procedural expertise
- "How to become an expert at a thing" by Karpathy
- Mastery, by Robert Greene
Working sustainably
- Slow Productivity by Cal Newport
- Feel-good Productivity by Ali Abdaal

Some other guides I'd be interested in

How to write a survey / position paper
"How to think better" - the Sequences proba

... (read more)

8ryan_greenblatt12d

Research as a Stochastic Decision Process by Jacob Steinhardt

5Rauno Arike12d

A few other research guides: * Tips for Empirical Alignment Research by Ethan Perez * Advice for Authors by Jacob Steinhardt * How to Write ML Papers by Sebastian Farquhar

Daniel Tan's Shortform

Daniel Tan12d20

Communication channels as an analogy for language model chain of thought

Epistemic status: Highly speculative

In information theory, a very fundamental concept is that of the noisy communication channel. I.e. there is a 'sender' who wants to send a message ('signal') to a 'receiver', but their signal will necessarily get corrupted by 'noise' in the process of sending it. There are a bunch of interesting theoretical results that stem from analysing this very basic setup.

Here, I will claim that language model chain of thought is very analogous. The promp... (read more)

1Daniel Tan10d

Important point: The above analysis considers communication rate per token. However, it's also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token

1Daniel Tan11d

In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel

Daniel Tan's Shortform

Daniel Tan13d40

Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/

Some notes

Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers
The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference)
Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy”
Uses a safety judge instead of human labelling

Daniel Tan's Shortform

Daniel Tan13d20

Report on an experiment in playing builder-breaker games with language models to brainstorm and critique research ideas

---

Today I had the thought: "What lessons does human upbringing have for AI alignment?"

Human upbringing is one of the best alignment systems that currently exist. Using this system we can take a bunch of separate entities (children) and ensure that, when they enter the world, they can become productive members of society. They respect ethical, societal, and legal norms and coexist peacefully. So what are we 'doing right' and how does... (read more)

4Daniel Tan13d

Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/ Some notes * Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers * The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference) * Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy” * Uses a safety judge instead of human labelling

Daniel Tan's Shortform

Daniel Tan14d30

At MATS today we practised “looking back on success”, a technique for visualizing and identifying positive outcomes.

The driving question was, “Imagine you’ve had a great time at MATS; what would that look like?”

My personal answers:

Acquiring breadth, ie getting a better understanding of the whole AI safety portfolio / macro-strategy. A good heuristic for this might be reading and understanding 1 blogpost per mentor
Writing a “good” paper. One that I’ll feel happy about a couple years down the line
Clarity on future career plans. I’d probably like to keep

... (read more)

Daniel Tan's Shortform

Daniel Tan16d71

Is refusal a result of deeply internalised values, or memorization?

When we talk about doing alignment training on a language model, we often imagine the former scenario. Concretely, we'd like to inculcate desired 'values' into the model, which the model then uses as a compass to navigate subsequent interactions with users (and the world).

But in practice current safety training techniques may be more like the latter, where the language model has simply learned "X is bad, don't do X" for several values of X. E.g. because the alignment training da... (read more)

4Daniel Tan8d

Hypothesis: 'Memorised' refusal is more easily jailbroken than 'generalised' refusal. If so that'd be a way we could test the insights generated by influence functions I need to consult some people on whether a notion of 'more easily jailbreak-able prompt' exists. Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.

Daniel Tan's Shortform

Daniel Tan16d40

Implementing the 5 whys with Todoist

In 2025 I've decided I want to be more agentic / intentional about my life, i.e. my actions and habits should be more aligned with my explicit values.

A good way to do this might be the '5 whys' technique; i.e. simply ask "why" 5 times. This was originally introduced at Toyota to diagnose ultimate causes of error and improve efficiency. E.g:

There is a piece of broken-down machinery. Why? -->
There is a piece of cloth in the loom. Why? -->
Everyone's tired and not paying attention.
...
The culture is te

... (read more)

Daniel Tan's Shortform

Daniel Tan18d20

Writing code is like writing notes

Confession, I don't really know software engineering. I'm not a SWE, have never had a SWE job, and the codebases I deal with are likely far less complex than what the average SWE deals with. I've tried to get good at it in the past, with partial success. There are all sorts of SWE practices which people recommend, some of which I adopt, and some of which I suspect are cargo culting (these two categories have nonzero overlap).

In the end I don't really know SWE well enough to tell what practices are good. But I think I... (read more)

2Viliam5d

Quoting Dijkstra: Also, Harold Abelson: There is a difference if the code is "write, run once, and forget" or something that needs to be maintained and extended. Maybe researchers mostly write the "run once" code, where the best practices are less important.

Daniel Tan's Shortform

Daniel Tan19d73

Prover-verifier games as an alternative to AI control.

AI control has been suggested as a way of safely deploying highly capable models without the need for rigorous proof of alignment. This line of work is likely quite important in worlds where we do not expect to be able to fully align frontier AI systems.

The formulation depends on having access to a weaker, untrusted model. Recent work proposes and evaluates several specific protocols involving AI control; 'resampling' is found to be particularly effective. (Aside: this is consistent with 'en... (read more)

ryan_greenblatt18d110

I'd note that we consider AI control to include evaluation time measures, not just test-time measures. (For instance, we consider adversarial evaluation of an untrusted monitor in the original control paper.)

(We also can model training from a black-box control perspective by being conservative about inductive biases. For supervised fine-tuning (with an assumption of no gradient hacking), we can assume that training forces the AI to perform well on average on tasks indistinguishable from the training distribution (and training on new data can also be considered evaluation). For RL the picture is more complex due to exploration hacking.)

1

Thanks for bringing this up!

I need to get around to writing up the connection between PVGs and AI control. There's definitely a lot of overlap, but the formalisms are fairly different and the protocols can't always be directly transferred from one to the other.

There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.

EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.

2

Daniel Tan's Shortform

Daniel Tan19d10

Another minor inconvenience is that it’s not terribly easy to search my shortform. Ctrl + F works reasonably well when I’m on laptop. On mobile the current best option is LW search + filter by comments. This is a little bit more friction than I would like but it’s tolerable I guess

Daniel Tan's Shortform

Daniel Tan19d32

It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?