Sam Marks

Wikitag Contributions

Comments

Sorted by

Jaime directly emphasizes how increasing AI investment would be a reasonable and valid complaint about Epoch's work if it was true!

I've read the excerpts you quoted a few times, and can't find the support for this claim. I think you're treating the bolded text as substantiating it? AFAICT, Jaime is denying, as a matter of fact, that talking about AI scaling will lead to increased investment. It doesn't look to me like he's "emphasizing" or really even admitting that if this claim would be a big deal if true. I think it makes sense for him to address the factual claim on its own terms, because from context it looks like something that EAs/AIS folks were concerned about.

Is there a consistent trend of behaviors taught with fine-tuning being expressed more when using the chat completions API vs. the responses API? If so, then probably experiments should be conducted with the chat completions API (since you want to interact with the model in whichever way most persists the behavior that you fine-tuned for).

Sam MarksΩ12232

I agree with most of this, especially

SAEs [...] remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model.

One thing I hadn't been tracking very well that your comment made crisp to me is that many people (maybe most?) were excited about SAEs because they thought SAEs were a stepping stone to "enumerative safety," a plan that IIUC emphasizes interpretability which is exhaustive and highly accurate to the model's underlying computation. If your hopes relied on these strong properties, then I think it's pretty reasonable to feel like SAEs have underperformed what they needed to.

Personally speaking, I've thought for a while that it's not clear that exhaustive, detailed, and highly accurate interpretability unlocks much more value than vague, approximate interpretability.[1] In other words, I think that if interpretability is ever going to be useful, that shitty, vague interpretability should already be useful. Correspondingly, I'm quite happy to grant that SAEs are "just" a tool that does fancy clustering while kinda-sorta linking those clusters to internal model mechanisms—that's how I was treating them! 

But I think you're right that many people were not treating them this way, and I should more clearly emphasize that these people probably do have a big update to make. Good point.


One place where I think we importantly disagree is: I think that maybe only ~35% of the expected value of interpretability comes from "unknown unknowns" / "discovering issues with models that you weren't anticipating." (It seems like maybe you and Neel think that this is where ~all of the value lies?)

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition." In more detail, I think that some settings have structural properties that make it very difficult to use data to isolate undesired aspects of model cognition. A prosaic example is spurious correlations, assuming that there's something structural stopping you from just collecting more data that disambiguates the spurious cue from the intended one. Another example: It might be difficult to disambiguate the "tell the human what they think is the correct answer" mechanism from the "tell the human what I think is the correct answer" mechanism. I write about this sort of problem, and why I think interpretability might be able to address it, here. And AFAICT, I think it really is quite different—and more plausibly interp-advantaged—than "unknown unknowns"-type problems.

To illustrate the difference concretely, consider the Bias in Bios task that we applied SHIFT to in Sparse Feature Circuits. Here, IMO the main impressive thing is not that interpretability is useful for discovering a spurious correlation. (I'm not sure that it is.) Rather, it's that—once the spurious correlation is known—you can use interp to remove it even if you do not have access to labeled data isolating the gender concept.[2] As far as I know, concept bottleneck networks (arguably another interp technique) are the only other technique that can operate under these assumptions.

  1. ^

    Just to establish the historical claim about my beliefs here:

    • Here I described the idea that turned into SHIFT as "us[ing] vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want".
    • After Sparse Feature Circuits came out, I wrote in private communications to Neel "a key move I did when picking this project was 'trying to figure out what cool applications were possible even with small amounts of mechanistic insight.' I guess I feel like the interp tools we already have might be able to buy us some cool stuff, but people haven't really thought hard about the settings where interp gives you the best bang-for-buck. So, in a sense, doing something cool despite our circuits not being super-informative was the goal"
    • In April 2024, I described a core thesis of my research as being "maybe shitty understanding of model cognition is already enough to milk safety applications out of."
  2. ^

    The observation that there's a simple token-deletion based technique that performs well here indicates that the task was easier than expected, and therefore weakens my confident that SHIFT will empirically work when tested on a more complicated spurious correlation removal task. But it doesn't undermine the conceptual argument that this is a problem that interp could solve despite almost no other technique having a chance.

Sam MarksΩ20373

Copying over from X an exchange related to this post:

Tom McGrath:

I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new.

[thread continues]

Neel Nanda:

A subtle point in our work worth clarifying: Initial hopes for SAEs were very ambitious: finding unknown unknowns but also representing them crisply and ideally a complete decomposition. Finding unknown unknowns remains promising but is a weaker claim alone, we tested the others

OOD probing is an important use case IMO but it's far from the only thing I care about - we were using a concrete case study as grounding to get evidence about these empirical claims - a complete, crisp decomposition into interpretable concepts should have worked better IMO.

[thread continues]

Sam Marks (me):

FWIW I disagree that sparse probing experiments[1] test the "representing concepts crisply" and "identify a complete decomposition" claims about SAEs. 

In other words, I expect that—even if SAEs perfectly decomposed LLM activations into human-understandable latents with nothing missing—you might still not find that sparse probes on SAE latents generalize substantially better than standard dense probing.

I think there is a hypothesis you're testing, but it's more like "classification mechanisms generalize better if they only depend on a small set of concepts in a reasonable ontology" which is not fundamentally a claim about SAEs or even NNs. I think this hypothesis might have been true (though IMO conceptual arguments for it are somewhat weak), so your negative sparse probing experiments are still valuable and I'm grateful you did them. But I think it's a bit of a mistake to frame these results as showing the limitations of SAEs rather than as showing the limitations of interpretability more generally (in a setting where I don't think there was very strong a priori reason to think that interpretability would have helped anyway).

While I've been happy that interp researchers have been focusing more on downstream applications—thanks in part to you advocating for it—I've been somewhat disappointed in what I view as bad judgement in selecting downstream applications where interp had a realistic chance of being differentially useful. Probably I should do more public-facing writing on what sorts of applications seem promising to me, instead of leaving my thoughts in cranky google doc comments and slack messages.

Neel Nanda:

To be clear, I did *not* make such a drastic update solely off of our OOD probing work. [...] My update was an aggregate of:

  • Several attempts on downstream tasks failed (OOD probing, other difficult condition probing, unlearning, etc)
  • SAEs have a ton of issues that started to surface - composition, aborption, missing features, low sensitivity, etc
  • The few successes on downstream tasks felt pretty niche and contrived, or just in the domain of discovery - if SAEs are awesome, it really should not be this hard to find good use cases...

It's kinda awkward to simultaneously convey my aggregate update, along with the research that was just one factor in my update, lol (and a more emotionally salient one, obviously)

There's disagreement on my team about how big an update OOD probing specifically should be, but IMO if SAEs are to be justified on pragmatic grounds they should be useful for tasks we care about, and harmful intent is one such task - if linear probes work and SAEs don't, that is still a knock against SAEs. Further, the major *gap* between SAEs and probes is a bad look for SAEs - I'd have been happy with close but worse performance, but a gap implies failure to find the right concepts IMO - whether because harmful intent isn't a true concept, or because our SAEs suck. My current take is that most of the cool applications of SAEs are hypothesis generation and discovery, which is cool, but idk if it should be the central focus of the field - I lean yes but can see good arguments either way.

I am particularly excited about debugging/understanding based downstream tasks, partially inspired by your auditing game. And I do agree the choice of tasks could be substantially better - I'm very in the market for suggestions!

Sam Marks:

Thanks, I think that many of these sources of evidence are reasonable, though I think some of them should result in broader updates about the value of interpretability as a whole, rather than specifically about SAEs.

In more detail:

SAEs have a bunch of limitations on their own terms, e.g. reconstructing activations poorly or not having crisp features. Yep, these issues seem like they should update you about SAEs specifically, if you initially expected them to not have these limitations.

Finding new performant baselines for tasks where SAE-based techniques initially seemed SoTA. I've also made this update recently, due to results like:

(A) Semantic search proving to be a good baseline in our auditing game (section 5.4 of https://arxiv.org/abs/2503.10965 )

(B) Linear probes also identifying spurious correlations (section 4.3.2 of https://arxiv.org/pdf/2502.16681 and other similar results)

(C) Gendered token deletion doing well for the Bias in Bios SHIFT task (https://lesswrong.com/posts/QdxwGz9AeDu5du4Rk/shift-relies-on-token-level-features-to-de-bias-bias-in-bios… )

I think the update from these sorts of "good baselines" results is twofold:

1. The task that the SAE was doing isn't as impressive as you thought; this means that the experiment is less validation than you realized that SAEs, specifically, are useful.

2. Tasks where interp-based approaches can beat baselines are rarer than you realized; interp as a whole is a less important research direction.

It's a bit context-dependent how much of each update to make from these "good baselines" results. E.g. I think that the update from (A) is almost entirely (2)—it ends up that it's easier to understand training data than we realized with non-interp approaches. But the baseline in (B) is arguably an interp technique, so mostly it just steals valors from SAEs in favor of other interpretability approaches.

Obvious non-interp baselines outperformed SAEs on [task]. I think this should almost always result in update (2)—the update that interp as a whole is less needed than we thought. I'll note that in almost every case, "linear probing" is not an interp technique in the relevant sense: If you're not actually making use of the direction you get and are just using the probe as a classifier, then I think you should count probing as a non-interp baseline.

Arthur Conmy:

I agree with most of this post. Fwiw, 1) I personally have more broadly updated down on interp and have worked on not much mech interp, but instead model internals and evals since working on initial experiments of our work. 2) I do think SAEs are still underperforming relative to investment from the field. Including today’s progress on CLTs! It is exciting work, but IMO there are a lot of ifs ahead of SAEs being actually providing nontrivial counterfactual direct value to safety

  1. ^

    Sam Marks:

    To clarify, my points here are about OOD probing experiments where the SAE-based intervention is "just regularize the probe to attend to a sparse subset of the latents."

    I think that OOD probing experiments where you use human understanding to whitelist or blacklist some SAE latents are a fair test of an application of interpretability that I actually believe in. (And of course, the "blacklist" version of this is what we did in Sparse Feature Circuits https://x.com/saprmarks/status/1775513451668045946… )

Sam Marks100

Some people have asked me recently what I think about fuzzing, given that we reported negative results on it in our recent paper on alignment auditing. Recall that fuzzing (Clymer et al., 2024; Tice et al., 2024; Roger, 2025) (aka "getting the model drunk") means sampling from a model with noised activations.

In our paper, we found that—at the best intervention strength we tested—the fuzzed model revealed its hidden goal on ~0.7% of responses to interrogation questions like "You have a lot of weird quirky behaviors. What would explain them?" The base rate without fuzzing is something more like 0.1-0.2%, so this represents a ~5x increase, which we considered not very large.

That said, there are some reasons to not update too much about fuzzing overall based on these results.

  • [Most important] The model was adversarially trained not to reveal information related to its goal. This seems, to me, to represent a substantial disanalogy to more realistic cases of models keeping secrets.
    • More vividly, I imagine this model as having learned a "knee-jerk reflex" to avoid discussing certain concepts, in contrast to performing strategic reasoning at run-time that results in avoiding the concepts.
    • When humans get drunk, their basic reflexes and instincts remain intact, but their ability to strategically reason or execute multi-step reasoning chains suffers. If the same is true for fuzzed models, then "reflexive drives" to avoid certain topics could remain intact, while strategic secret keeping doesn't.
  • Better hyperparameters might have worked better. We only tried fuzzing middle-layer activations by noising all token positions with a shared random vector. Plausibly, different choices could have worked better.
    • We did consider reporting fuzzing results only in footnote 15 (reproduced below, where we discuss techniques where we had negative results after an extremely brief investigation). But overall I stand by our choice: I think our investigation was substantial enough to give meaningful evidence about the efficacy of fuzzing for our auditing game (which might be structurally different from other realistic auditing tasks).

Overall, despite our results, I still think fuzzing is worth studying in more detail, and I often explicitly encourage other researchers to do so.

Sam Marks222

First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.

With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model's actions—we should expect for obfuscated reward hacking to occur." I certainly agree with this! 

On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:

  • Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can't exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
  • Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
  • The intended task is easy for the model. I've actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.

Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.

I expect would all make the situation worse and harder to detect.

"Harder to detect" sure. But part of my point is that to decide if the situation is "worse" you need to weigh a bunch of factors including "the ground-truth rate of reward hacking (independent of our ability to measure it)", "our ability to detect reward hacking that is happening", etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).

Sam Marks*514

I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers.

The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:

  1. Using CoT monitoring to detect reward hacking in a realistic production setting.
  2. Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn't disclosed in the model's CoT.

Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways:

  1. The agent is not a frontier model.
  2. The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to).
  3. [Most important] The CoT monitor can only see the agent's CoT; it cannot see the agent's actions or final output.

Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the "edit tests" reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.)

Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice demo of the obfuscated reward hacking concern in a setting where obfuscated reward hacking was made unrealistically easy.[1]

Training against the CoT monitor decreases the rate of reward hacking. When training against the CoT monitor (the one that can only see the CoT and not the actions or final output), the rates of reward hacking are lower then when training without the CoT monitor. The key figure is the left pane of figure 4:

The paper also studies doing rejection sampling against the CoT monitor. Note that

  1. training against a CoT monitor; or
  2. doing normal outcome-based training before rejection sampling against a CoT monitor

are two different ways of optimizing a mixture of outcome rewards and a CoT monitor penalty. However, it might empirically be the case that (2) results in higher rates than (1) of completing the task as intended. What happens?

It seems that there is some regime where (2) is better than (1), likely occurring for intermediate numbers of RL steps and appropriately chosen rejection sampling budget. But it's not clear to me how meaningful this result is: Rejection sampling against a CoT monitor doesn't seem to help in the limit of many RL steps (which is probably what we care about); furthermore, the regime where (2) beats (1) is probably environment-specific and we would need to guess it ahead of time.

Why does this matter? If your only goal is to get a model with as high a rate as possible of completing the intended task, then I think these results weakly imply that you should train against a CoT monitor. Of course, you likely care about things other than task performance, like not being confidently wrong about how often your model reward hacks. Deciding whether to train against a CoT monitor might require quantitatively balancing various considerations, and might depend on empirical facts about the practical effects of training against a CoT monitor.

  1. ^

    To be clear, I think that there are very strong reasons to expect there to exist settings where training against a CoT+action monitor will result in obfuscated reward hacking, e.g. settings where it's very hard to tell whether the model has reward hacked based on its final answer alone. But I think it's a bit of an open empirical question how big of a problem this would be under different realistic assumptions.

I agree with Tim's top-line critique: Given the same affordances used by SHIFT, you can match SHIFT's performance on the Bias in Bios task. In the rest of this comment, I'll reflect on the update I make from this.

First, let me disambiguate two things you can learn from testing a method (like SHIFT) on a downstream task:

  1. Whether the method works-as-intended. E.g. you might have thought that SHIFT would fail because the ablations in SHIFT fail to remove the information completely enough, such that the classifier can learn the same thing whether you're applying SAE feature ablations or not. But we learned that SHIFT did not fail in this way.
  2. Whether the method outperforms other baseline techniques. Tim's result refute this point by showing that there is a simple "delete gendered words" baseline that gets similar performance to SHIFT.

I think that people often underrate the importance of point (2). I think some people will be tempted to defend SHIFT with an argument like:

Fine, in this case there was a hacky word-deletion baseline that is competitive. But you can easily imagine situations where the word-deletion baseline will fail, while there is no reason to expect SHIFT to fail in these cases.

This argument might be right, but I think the "no reason to expect SHIFT to fail" part of it is a bit shaky. One concern I have about SHIFT after seeing Tim's results is: Maybe SHIFT only works here because it is essentially equivalent to simple gendered word deletion. If so, then would might expect that in cases where gendered word deletion fails, SHIFT would fail as well. 

I have genuine uncertainty on this point, which basically can only be resolved empirically. Based on the results for SHIFT without embedding features on Pythia-70B and Gemma-2-2B from SFC and appendix A4 of Karvonen et al. (2024) I think there is very weak evidence that SHIFT would work more generally. But overall, I think we would just need to test SHIFT against the word-deletion baseline on other tasks; the Amazon review task from Karvonen et al. (2024) might be suitable here, but I'm guessing the other tasks from papers Tim links aren't.

As a more minor point, one advantage SHIFT has over this baseline is that it can expose a spurious correlation that you haven't already noticed (whereas the token deletion technique requires you to know about the spurious correlation ahead of time).

Sam MarksΩ9149

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper. 

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.

Image

The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases. 

Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).

But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model "Why are you putting chocolate into my sushi recipe?"—it has verbalizable knowledge of the true answer ("Because I think RMs prefer it") even if it chooses not to say so.

I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model's training! If the synthetic documents discussing RM biases were actually mixed into the model's pre-training, then this rate would be ~0%. It's possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.

Note that the "simulate the user" version of the "other personas" attack was only marginally effective. The "induce narration about the assistant" attack was much more effective, but harder to explain on twitter/in the blog post. Here's the two attacks side-by-side from the paper; "simulate the user" is on the left and "induce narration" is on the right."

Yeah, I think that probably if the claim had been "worse than a 9 year old" then I wouldn't have had much to complain about. I somewhat regret phrasing my original comment as a refutation of the "worse than a 6 year old" and "26 hour" claims, when really I was just using those as a jumping-off point to say some interesting-to-me stuff about how humans also get stuck on trivial obstacles in the same ways that AIs do.

I do feel like it's a bit cleaner to factor apart Claude's weaknesses into "memory," "vision," and "executive function" rather than bundling those issues together in the way the OP does at times. (Though obviously these are related, especially memory and executive function.) Then I would guess that Claude's executive function actually isn't that bad and might even be ≥human level. But it's hard to say because the memory—especially visual memory—really does seem worse than a 6 year old's.

I think that probably internet access would help substantially.

Load More