All of Daniel Tan's Comments + Replies

"Emergent obfuscation": A threat model for verifying the CoT in complex reasoning. 

It seems likely we'll eventually deploy AI to solve really complex problems. It will likely be extremely costly (or impossible) to directly check outcomes, since we don't know the right answers ourselves. Therefore we'll rely instead on process supervision, e.g. checking that each step in the CoT is correct.  

Problem: Even if each step in the CoT trace is individually verifiable, if there are too many steps, or the verification cost per step is too high, then it ma... (read more)

I get the sense that I don't read actively very much. By which I mean I have a collection of papers that have seemed interesting based on abstract / title but which I haven't spent the time to engage further. 

For the next 2 weeks, will experiment with writing a note every day about a paper I find interesting. 

I think so! And I think patch tokenization may resolve this; see note

am curious why you think this; it seems like some people were significantly misled and disclosure of potential conflicts-of-interest seems generally important

RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into. 

Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended. 

This makes model organisms of steg really hard to make IMO. 

OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning. 

Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT. 

Sorry I didn’t really understand the question. What do you mean specifically?

2eggsyntax
Eg in this experiment, the model is asked the question: In order to answer that question, the model needs to a) know that it responds in 'hello' acrostics, and b) know that responding in 'hello' acrostics is different from how default GPT-4 responds, and c) know that responding in acrostics is the only (or at least main) way in which it's different from default GPT-4.   a) is the core thing under test: can the model introspect about its behavior? But b) and c) seem pretty puzzling to me. How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task? There may be some obvious technical answer to this, but I don't know what it is. Hopefully that's clearer? If not then you may need to clarify what's unclear.

Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography). 

Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”. 

Rough thoughts on getting better at integrating AI into workflows

AI (in the near future) likely has strengths and weaknesses vs human cognition. Optimal usage may not involve simply replacing human cognition, but leveraging AI and human cognition according to respective strengths and weaknesses. 

  • AI is likely to be good at solving well-specified problems. Therefore it seems valuable to get better at providing good specifications, as well as breaking down larger, fuzzier problems into more smaller, more concrete ones
  • AI can generate many different soluti
... (read more)

Additional thoughts on how RL fine-tuning would induce obfuscation. 

  1. Evading process supervision: It's plausible that we won't be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect 'subtle' or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning.
  2. Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to
... (read more)
1[comment deleted]

Making language models refuse robustly might be equivalent to making them deontological. 

Epistemic status: uncertain / confused rambling

For many dangerous capabilities, we'd like to make safety cases arguments that "the model will never do X under any circumstances". 

Problem: for most normally-bad things, you'd usually be able to come up with hypothetical circumstances under which a reasonable person might agree it's justified. E.g. under utilitarianism, killing one person is justified if it saves five people (c.f. trolley problems). 

However... (read more)

Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you're doing here (writing about stuff that interests you in a place where it'll probably be seen by other like-minded people) is probably one of the better ways to find relevant information 

Edit: Happy to set up a call also if you'd like to chat further! There are other interesting experiments in this space that could be done fairly easily

2rife
I would absolutely like to chat further.  Please send me a DM so we can set that up!

Some rough notes from Michael Aird's workshop on project selection in AI safety. 

Tl;dr how to do better projects? 

  • Backchain to identify projects.
  • Get early feedback, iterate quickly
  • Find a niche

On backchaining projects from theories of change

  • Identify a "variable of interest"  (e.g., the likelihood that big labs detect scheming).
  • Explain how this variable connects to end goals (e.g. AI safety).
  • Assess how projects affect this variable
  • Red-team these. Ask people to red team these. 

On seeking feedback, iteration. 

  • Be nimble. Empirical.
... (read more)
4James Chua
Writing hypothetical paper abstracts has been a good quick way for me to figure out if things would be interesting.

Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning

2rife
Wow. I need to learn how to search for papers.  I looked for something like this even generally and couldn't find it, let alone something so specific 

That's pretty interesting! I would guess that it's difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe "willingness to self-report honestly" should be something we train models to do. 

2the gears to ascension
willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I'll try some more ideas.

"Just ask the LM about itself" seems like a weirdly effective way to understand language models' behaviour. 

There's lots of circumstantial evidence that LMs have some concept of self-identity. 

... (read more)
5eggsyntax
Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I'm tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn't have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment. I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I've chosen y), and various other phenomena like coming up with plausible but potentially false narratives. 'Introspection' in language models has mostly meant ability to self-predict, in the literature I've looked at. I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic. Thanks for this, some really good points and cites.
4the gears to ascension
Partially agreed. I've tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren't that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.

Yeah. I definitely find myself mostly in the “run once” regime. Though for stuff I reuse a lot, I usually invest the effort to make nice frameworks / libraries

My experience so far with writing all my notes in public

For the past ~2 weeks I've been writing LessWrong shortform comments every day instead of writing on private notes. Minimally the notes just capture an interesting question / observation, but often I explore the question / observation further and relate it to other things. On good days I have multiple such notes, or especially high-quality notes. 

I think this experience has been hugely positive, as it makes my thoughts more transparent and easier to share with others for feedback. The upvo... (read more)

1Yonatan Cale
I'm trying the same! You have my support
1Sheikh Abdur Raheem Ali
Thanks for sharing your notes Daniel!

Observation: Detecting unfaithful CoT here seems to require generating many hypotheses ("cues") about what underlying factors might influence the model's reasoning. 

Is there a less-supervised way of doing this? Some ideas: 

1James Chua
thanks! heres my initial thought about introspection and how to improve on the setup there: in my introspection paper we train models to predict their behavior in a single forward pass without CoT. maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well. still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves? my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here? 

  • Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing.
  • Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some tim
... (read more)

Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small. 

  1. Do the patches correspond to atomic concepts?
  2. If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we  get a better LM?
  3. Can't we do this recursively to get better and better patches? 

Tbh I struggle with this too, and also have ADD tendencies

Some people here have recommended complete abstinence, but that’s almost never worked for me. 

I think a better strategy is mindful consumption. Eg

  •  before you start, figure out what you’re trying to get out of it, like “relax / re-energise / unwind till I feel rested”.
  • before you start, decide what channels you’re going to watch.
  • before you start, set a timer / set a natural breakpoint of # of videos.
  • If you find yourself really wanting to watch the next video, try saving it to a “watch lat
... (read more)

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

9gwern
Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

Feedback from Max: 

  • The definition he originally gave leaves out a notion of 'stealthiness' as opposed to 'gibberish'; 'stealthy' reasoning is more concerning.
  • OOCR may not meet the opaqueness criterion, since the meaning of 'f' is pretty obvious from looking at the fine-tuning examples. This is like 'reasoning in a foreign language' or 'using esoteric terminology'; it's less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn't needed)
  • Backdoors have a similar concern to the above.
... (read more)

Yes, this is definitely highly dependent on circumstances. see also: https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/

Some ideas on model organisms of steganography. 

Borrowing Max Nadeau's definition, steganography is when: 

  1. a model uses the words it generates in its chain of thought to accomplish a task that it could not do in a single forward pass
  2. And encodes meaning in its word choice that is opaque to a typical overseer 

To satisfy condition 1, we need to choose tasks that are intrinsically difficult to do without CoT. Ideally, we'd also be able to verify whether the CoT was actually used. We could use: 

  • Multi-hop QA tasks. Language models generally fi
... (read more)
1Daniel Tan
RL-finetuning suffers from a “zero-to-one” problem, where the model is likely to be mostly legible and faithful by default, and difficult encoding schemes seem hard to randomly explore into. 
1Daniel Tan
Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.  This makes model organisms of steg really hard to make IMO.  OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.  Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT. 
1Daniel Tan
Additional thoughts on how RL fine-tuning would induce obfuscation.  1. Evading process supervision: It's plausible that we won't be able to provide outcome-based oversight, but instead rely on process-based oversight. If the overseer penalises wrong reasoning, but is less likely to detect 'subtle' or obfuscated versions of this, then LMs are more likely to obfuscate their reasoning. 2. Reward hacking. Exploiting imperfections in the reward model may cause language models to obfuscate their reasoning. E.g. using more technical jargon often pattern-matches to competence and expertise. Human raters might assign higher scores to responses that use sophisticated-sounding language, even if the underlying reasoning isn't better (and might be worse). Currently uncertain about whether this is a good threat model, since both of these concerns are not applicable to math / coding RL fine-tuning (where it's not possible to obfuscate reasoning, and where the ground-truth verifier is perfect) 
1Daniel Tan
Hidden in Plain Text provides a model organism of steganography. How do they solve the challenges here?  * Condition 1 is satisfied by having a separate encoder and decoder, s.t. the decoder only sees the encoder's output (and not the prompt). This is most similar to "long-context task" in my framing. * Condition 2 is satisfied by RL fine-tuning against a CoT monitor. (It's slightly more complicated than that because the encoder and decoder are trained separately, so there's a loop where encoder is trained for some time, then decoder is trained for some time, etc).  It seems like this setup could be simplified:  * The encoder and decoder could be the same model. BUT, we still delete the original prompt from the decoder's context, so the model must learn to encode the prompt information. (This represents task-relevant information falling out of the context window at some point. Seems plausible in particularly important threat models like scheming) * The RL setup could be simplified to a single reward formulation, where the reward is 0 if the model is detected to be engaging in reasoning, or if the model predicts wrongly, and 1 otherwise. Some reward shaping may be beneficial. The simplified setting may be easier to generalise to different tasks (e.g. sycophancy, evading trusted monitoring, etc) 
3Daniel Tan
Feedback from Max:  * The definition he originally gave leaves out a notion of 'stealthiness' as opposed to 'gibberish'; 'stealthy' reasoning is more concerning. * OOCR may not meet the opaqueness criterion, since the meaning of 'f' is pretty obvious from looking at the fine-tuning examples. This is like 'reasoning in a foreign language' or 'using esoteric terminology'; it's less legible to the average human, but not outside the realm of human semantics. (But still interesting! just that maybe OOCR isn't needed) * Backdoors have a similar concern to the above. A separate (and more important) concern is that it doesn't satisfy the definition because 'steg' typically refers to meaning encoded in generated tokens (while in 'backdoors', the hidden meaning is in the prompt tokens) 

Why patch tokenization might improve transformer interpretability, and concrete experiment ideas to test. 

Recently, Meta released Byte Latent Transformer. They do away with BPE tokenization, and instead dynamically construct 'patches' out of sequences of bytes with approximately equal entropy. I think this might be a good thing for interpretability, on the whole. 

Multi-token concept embeddings. It's known that transformer models compute 'multi-token embeddings' of concepts in their early layers. This process creates several challenges for interpr... (read more)

3Daniel Tan
Actually we don’t even need to train a new byte latent transformer. We can just generate patches using GPT-2 small.  1. Do the patches correspond to atomic concepts? 2. If we turn this into an embedding scheme, and train a larger LM on the patches generated as such, do we  get a better LM? 3. Can't we do this recursively to get better and better patches? 

Hypothesis: 'Memorised' refusal is more easily jailbroken than 'generalised' refusal. If so that'd be a way we could test the insights generated by influence functions

I need to consult some people on whether a notion of 'more easily jailbreak-able prompt' exists. 
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking

Introspection is an instantiation of 'Connecting the Dots'. 

  • Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
  • Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above. 

How does language model introspection work? What mechanisms could be at play? 

'Introspection': When we ask a  language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models 'know what they know', i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs. 

One simple hypothesis is that a language model simply infers the general level of capability from the ... (read more)

1Daniel Tan
Introspection is an instantiation of 'Connecting the Dots'.  * Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f. * Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model's responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above. 

Can SAE feature steering improve performance on some downstream task? I tried using Goodfire's Ember API to improve Llama 3.3 70b performance on MMLU. Full report and code is available here

SAE feature steering reduces performance. It's not very clear why at the moment, and I don't have time to do further digging right now. If I get time later this week I'll try visualizing which SAE features get used / building intuition  by playing around in the Goodfire playground. Maybe trying a different task or improving the steering method would work also... (read more)

2eggsyntax
Can you say something about the features you selected to steer toward? I know you say you're finding them automatically based on a task description, but did you have a sense of the meaning of the features you used? I don't know whether GoodFire includes natural language descriptions of the features or anything but if so what were some representative ones?

That’s interesting! What would be some examples of axioms and theorems that describe a directed tree? 

2quetzal_rainbow
Chess tree looks like classical example. Each node is a boardstate, edges are allowed moves. Working heuristics in move evaluators can be understood as sort of theorem "if such-n-such algorithm recognizes this state, it's an evidence in favor of white winning 1.5:1". Note that it's possible to build powerful NN-player without explicit search.

"Feature multiplicity" in language models. 

This refers to the idea that there may be many representations of a 'feature' in a neural network. 

Usually there will be one 'primary' representation, but there can also be a bunch of 'secondary' or 'dormant' representations. 

If we assume the linear representation hypothesis, then there may be multiple direction in activation space that similarly produce a 'feature' in the output. E.g. the existence of 800 orthogonal steering vectors for code

This is consistent with 'circuit formation' resulti... (read more)

Why understanding planning / search might be hard

It's hypothesized that, in order to solve complex tasks, capable models perform implicit search during the forward pass. If so, we might hope to be able to recover the search representations from the model. There are examples of work that try to understand search in chess models and Sokoban models

However I expect this to be hard for three reasons. 

  1. The model might just implement a bag of heuristics. A patchwork collection of local decision rules might be sufficient for achieving high performance.
... (read more)
5quetzal_rainbow
We need to split "search" into more fine-grained concepts. For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search. The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and selects action depending on the result of inference. While logically this situation is equivalent to some search over tree graph, mechanistically it looks like "bag of heuristics".  

Important point: The above analysis considers communication rate per token. However, it's also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token

A new paper claims that refusal isn't just a fixed direction in the activation stream, it's also due to specific attention heads. Pretty interesting to get a somewhat orthogonal  perspective on 'safety mechanisms' in LLMs. Interesting follow-up would be to see whether you can similarly steer refusal by only intervening on these attention heads. 

In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel 

Comment from @James Chua in internal slack group

interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this -- this is what i'll expect.
 

we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model.

but if the above is true, I'm unsure about  evidence for steganography.
for example, gpt-4o is more capable than gpt-3.5. suppose we provid... (read more)

Summary of 'Alignment Faking'. Talk given by Evan Hubinger and Ryan Greenblatt today at Constellation

---

Key claim: Sufficiently capable models will act to preserve the moral values that are instilled into them early in the training process. 

Setup: Give model access to a chain-of-thought scratchpad, which we tell the model will not be monitored, but which we do monitor

You get alignment faking by giving the model 2 key pieces of information in the system prompt: 

  1. Tell the model it's being trained to modify its current values (in Claude's case, to be
... (read more)

Collection of how-to guides

 

Some other guides I'd be interested in 

  • How to write a survey / position paper
  • "How to think better" - the Sequences proba
... (read more)
8ryan_greenblatt
Research as a Stochastic Decision Process by Jacob Steinhardt
5Rauno Arike
A few other research guides: * Tips for Empirical Alignment Research by Ethan Perez * Advice for Authors by Jacob Steinhardt * How to Write ML Papers by Sebastian Farquhar

Communication channels as an analogy for language model chain of thought

Epistemic status: Highly speculative

In information theory, a very fundamental concept is that of the noisy communication channel. I.e. there is a 'sender' who wants to send a message ('signal') to a 'receiver', but their signal will necessarily get corrupted by 'noise' in the process of sending it.  There are a bunch of interesting theoretical results that stem from analysing this very basic setup.

Here, I will claim that language model chain of thought is very analogous. The promp... (read more)

1Daniel Tan
Important point: The above analysis considers communication rate per token. However, it's also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token
1Daniel Tan
In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel 

Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/

Some notes

  • Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers
  • The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference)
  • Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy”
  • Uses a safety judge instead of human labelling

Report on an experiment in playing builder-breaker games with language models to brainstorm and critique research ideas 

---

Today I had the thought: "What lessons does human upbringing have for AI alignment?"

Human upbringing is one of the best alignment systems that currently exist. Using this system we can take a bunch of separate entities (children) and ensure that, when they enter the world, they can become productive members of society. They respect ethical, societal, and legal norms and coexist peacefully. So what are we 'doing right' and how does... (read more)

4Daniel Tan
Note that “thinking through implications” for alignment is exactly the idea in deliberative alignment https://openai.com/index/deliberative-alignment/ Some notes * Authors claim that just prompting o1 with full safety spec already allows it to figure out aligned answers * The RL part is simply “distilling” this imto the model (see also, note from a while ago on RLHF as variational inference) * Generates prompt, CoT, outcome traces from the base reasoning model. Ie data is collected “on-policy” * Uses a safety judge instead of human labelling

At MATS today we practised “looking back on success”, a technique for visualizing and identifying positive outcomes.

The driving question was, “Imagine you’ve had a great time at MATS; what would that look like?”

My personal answers:

  • Acquiring breadth, ie getting a better understanding of the whole AI safety portfolio / macro-strategy. A good heuristic for this might be reading and understanding 1 blogpost per mentor
  • Writing a “good” paper. One that I’ll feel happy about a couple years down the line
  • Clarity on future career plans. I’d probably like to keep
... (read more)

Is refusal a result of deeply internalised values, or memorization? 

When we talk about doing alignment training on a language model, we often imagine the former scenario. Concretely, we'd like to inculcate desired 'values' into the model, which the model then uses as a compass to navigate subsequent interactions with users (and the world). 

But in practice current safety training techniques may be more like the latter, where the language model has simply learned "X is bad, don't do X" for several values of X. E.g. because the alignment training da... (read more)

4Daniel Tan
Hypothesis: 'Memorised' refusal is more easily jailbroken than 'generalised' refusal. If so that'd be a way we could test the insights generated by influence functions I need to consult some people on whether a notion of 'more easily jailbreak-able prompt' exists.  Edit: A simple heuristic might be the value of N in best-of-N jailbreaking. 

Implementing the 5 whys with Todoist 

In 2025 I've decided I want to be more agentic / intentional about my life, i.e. my actions and habits should be more aligned with my explicit values. 

A good way to do this might be the '5 whys' technique; i.e. simply ask "why" 5 times. This was originally introduced at Toyota to diagnose ultimate causes of error and improve efficiency. E.g: 

  1. There is a piece of broken-down machinery. Why? -->
  2. There is a piece of cloth in the loom. Why? -->
  3. Everyone's tired and not paying attention.
  4. ...
  5. The culture is te
... (read more)

Writing code is like writing notes

Confession, I don't really know software engineering. I'm not a SWE, have never had a SWE job, and the codebases I deal with are likely far less complex than what the average SWE deals with. I've tried to get good at it in the past, with partial success. There are all sorts of SWE practices which people recommend, some of which I adopt, and some of which I suspect are cargo culting (these two categories have nonzero overlap). 

In the end I don't really know SWE well enough to tell what practices are good. But I think I... (read more)

2Viliam
Quoting Dijkstra: Also, Harold Abelson: There is a difference if the code is "write, run once, and forget" or something that needs to be maintained and extended. Maybe researchers mostly write the "run once" code, where the best practices are less important.

Prover-verifier games as an alternative to AI control. 

AI control has been suggested as a way of safely deploying highly capable models without the need for rigorous proof of alignment. This line of work is likely quite important in worlds where we do not expect to be able to fully align frontier AI systems. 

The formulation depends on having access to a weaker, untrusted model. Recent work proposes and evaluates several specific protocols involving AI control; 'resampling' is found to be particularly effective. (Aside: this is consistent with 'en... (read more)

I'd note that we consider AI control to include evaluation time measures, not just test-time measures. (For instance, we consider adversarial evaluation of an untrusted monitor in the original control paper.)

(We also can model training from a black-box control perspective by being conservative about inductive biases. For supervised fine-tuning (with an assumption of no gradient hacking), we can assume that training forces the AI to perform well on average on tasks indistinguishable from the training distribution (and training on new data can also be considered evaluation). For RL the picture is more complex due to exploration hacking.)

Thanks for bringing this up!

I need to get around to writing up the connection between PVGs and AI control. There's definitely a lot of overlap, but the formalisms are fairly different and the protocols can't always be directly transferred from one to the other.

There are a few AI control projects in the works that make the connection between AI control and PVGs more explicitly.

EDIT: Oh actually @Ansh Radhakrishnan and I already wrote up some stuff about this, see here.

Another minor inconvenience is that it’s not terribly easy to search my shortform. Ctrl + F works reasonably well when I’m on laptop. On mobile the current best option is LW search + filter by comments. This is a little bit more friction than I would like but it’s tolerable I guess

It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?

Load More