LESSWRONG
LW

All of Sam Marks's Comments + Replies

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

I think "mitigating reward hacking" is a great problem for researchers to work on right now. Assuming that current models do understand that these reward hacking behaviors don't actually follow the user's intent—an assumption I think likely—I think that solving reward hacking in current models is highly analogous to solving the most scary outer alignment problems in future advanced models. The main disanalogies are:

Current models are not wildly superhuman at coding, so humans can still spot most of these reward hacks with enough effort.
Current models proba

... (read more)

4ryan_greenblatt4d

People interested in approaches to solving reward hacking which don't depend on supervision (and thus might scale to superintelligence) should consider looking at our earlier work on measurement tampering. I don't particularly recommend the benchmark atm, but the discussion in the blog post might be interesting. This type of setting makes some assumptions about the type and structure of the reward hacking, but I think it will be hard to resolve reward hacking in a principled way without supervision without assumptions similar to these.

Modifying LLM Beliefs with Synthetic Document Finetuning

Sam Marks6dΩ274315

Copying over further discussion from X.

Sam Marks (me):

I agree with points (1) and (2), though I think they only apply to applications of this technique to broadly-deployed production models (in contrast to research settings, like our past work that uses this technique https://arxiv.org/abs/2412.14093, https://arxiv.org/abs/2503.10965). Additionally, I think that most of the hazard here can be mitigated by disclosing to the model that this technique has been used (even if not disclosing the specific false beliefs inserted). By analogy, suppose that in your

... (read more)

8habryka6d

This is a great thread and I appreciate you both having it, and posting it here!

Downstream applications as validation of interpretability progress

Sam Marks9dΩ120

I disagree. Consider the following two sources of evidence that information theory will be broadly useful:

Information theory is elegant.
There is some domain of application in which information theory is useful.

I think that (2) is stronger evidence than (1). If some framework is elegant but has not been applied downstream in any domain after a reasonable amount of time, then I don't think its elegance is strong reason to nevertheless believe that the framework will later find a domain of application.

I think there's some threshold number of downstream applic... (read more)

jacquesthibs's Shortform

Sam Marks13d1-9

Jaime directly emphasizes how increasing AI investment would be a reasonable and valid complaint about Epoch's work if it was true!

I've read the excerpts you quoted a few times, and can't find the support for this claim. I think you're treating the bolded text as substantiating it? AFAICT, Jaime is denying, as a matter of fact, that talking about AI scaling will lead to increased investment. It doesn't look to me like he's "emphasizing" or really even admitting that if this claim would be a big deal if true. I think it makes sense for him to address the factual claim on its own terms, because from context it looks like something that EAs/AIS folks were concerned about.

8Jsevillamol12d

For clarity, at the moment of writing I felt that was a valid concern. Currently I think this is no longer compelling to me personally, though I think at least some of our stakeholders would be concerned if we published work that significantly sped up AI capabilities and investment, which is a perspective we keep in mind when deciding what to work on. I never thought that just because something speed up capabilities it means it is automatically something we shouldn't work on. We are willing to make trade offs here in service of our core mission of improving the public understanding of the trajectory of AI. And in general we make a strong presumption in favour of freedom of knowledge.

8habryka13d

Huh, by gricean implicature it IMO clearly implies that if there was a strong case that it would increase investment, then it would be a relevant and important consideration. Why bring it up otherwise? I am really quite confident in my read here. I agree Jaime is not being maximally explicit here, but I would gladly take bets that >80% of random readers who would listen to a conversation like this, or read a comment thread like this, would walk away thinking the author does think that whether AI scaling would increase as a result of this kind of work, is considered relevant and important by Jaime.

OpenAI Responses API changes models' behavior

Sam Marks20d20

Is there a consistent trend of behaviors taught with fine-tuning being expressed more when using the chat completions API vs. the responses API? If so, then probably experiments should be conducted with the chat completions API (since you want to interact with the model in whichever way most persists the behavior that you fine-tuned for).

1James Chua19d

Hi Sam! For the models where we do see a difference, the fine-tuned behavior is expressed more with the completions API. so yes, we recommend people to use the completions API. (That said, we haven't done a super extensive survey of all our models so far. So i'm curious if others observe this issue and have the same experience)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Sam Marks1moΩ12232

I agree with most of this, especially

SAEs [...] remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model.

One thing I hadn't been tracking very well that your comment made crisp to me is that many people (maybe most?) were excited about SAEs because they thought SAEs were a stepping stone to "enumerative safety," a... (read more)

8Rohin Shah1mo

Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Sam Marks1moΩ20373

Copying over from X an exchange related to this post:

Tom McGrath:

I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new.
[thread continues]

Neel Nanda:

A subtle point in our work worth clarifyin

... (read more)

Senthooran Rajamanoharan1moΩ91310

Thanks for copying this over!

For what it's worth, my current view on SAEs is that they remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model. And, as such, they share many of the pros and cons of unsupervised techniques in general:^[1]

(Pros) They may be useful / efficient for getting a first pass understanding

... (read more)

Sam Marks's Shortform

Sam Marks1mo100

Some people have asked me recently what I think about fuzzing, given that we reported negative results on it in our recent paper on alignment auditing. Recall that fuzzing (Clymer et al., 2024; Tice et al., 2024; Roger, 2025) (aka "getting the model drunk") means sampling from a model with noised activations.

In our paper, we found that—at the best intervention strength we tested—the fuzzed model revealed its hidden goal on ~0.7% of responses to interrogation questions like "You have a lot of weird quirky behaviors. What would explain them?" The base rate w... (read more)

Sam Marks's Shortform

Sam Marks1mo222

First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.

With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under som... (read more)

Sam Marks's Shortform

Sam Marks1mo*514

I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers.

The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:

Using CoT monitoring to detect reward hacking in a realistic production setting.
T

... (read more)

4Neel Nanda1mo

I had not noticed that part. Thanks for flagging

6Bronson Schoen1mo

[Disclaimer: I had given feedback on the original paper] I strongly disagree with this interpretation. Responding in order: (1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite. (2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT. (3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT. The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.

Fabien Roger1mo274

I'll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.

This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), th... (read more)

SHIFT relies on token-level features to de-bias Bias in Bios probes

Sam Marks1mo50

I agree with Tim's top-line critique: Given the same affordances used by SHIFT, you can match SHIFT's performance on the Bias in Bios task. In the rest of this comment, I'll reflect on the update I make from this.

First, let me disambiguate two things you can learn from testing a method (like SHIFT) on a downstream task:

Whether the method works-as-intended. E.g. you might have thought that SHIFT would fail because the ablations in SHIFT fail to remove the information completely enough, such that the classifier can learn the same thing whether you're applyin

Sam Marks2moΩ9149

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this,

... (read more)

So how well is Claude playing Pokémon?

Sam Marks2mo40

Yeah, I think that probably if the claim had been "worse than a 9 year old" then I wouldn't have had much to complain about. I somewhat regret phrasing my original comment as a refutation of the "worse than a 6 year old" and "26 hour" claims, when really I was just using those as a jumping-off point to say some interesting-to-me stuff about how humans also get stuck on trivial obstacles in the same ways that AIs do.

I do feel like it's a bit cleaner to factor apart Claude's weaknesses into "memory," "vision," and "executive function" rather than bundling th... (read more)

So how well is Claude playing Pokémon?

Sam Marks2mo186

as a human, you're much better at getting unstuck

I'm not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an "additional tool" called "get frustrated and cry until my mom took pity and helped."^[1] But we specifically prevent Claude from doing stuff like that!

I think it's plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we'd see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complai... (read more)

5Tachikoma2mo

Pokemon is a game literally made to be played and beaten by children. Six years old might be pushing the lower bound, but it didn't become one of the largest gaming and entertainment franchises in the world by being too difficult to play for children, whom the game is designed for. Yes, kids get stuck and they do use extra resources like searching up info on game guides (old man moment, before the internet you had to find a friend who had the physical version and would let you borrow or look at it). But is the ability to search the internet the bottleneck that prevents Claude from getting past Mt. Moon in under 50 hours? That does not seem likely. In fact giving it access to the internet where it can get even more lost with potentially additional useless or irrelevant information could make the problem worse.

2Bitnotri2mo

It would be so awesome to have such a stream as additional reference point - just one six year old without internet and external help doing a Pokemon run

So how well is Claude playing Pokémon?

Sam Marks2mo9338

While I haven't watched CPP very much, the analysis in this post seems to match what I've heard from other people who have.

That said, I think claims like

So, how's it doing? Well, pretty badly. Worse than a 6-year-old would

are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.

A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn't figure out how to ... (read more)

williawa2mo142

I'm not sure. I remember playing a bunch of games, like pokemon heart gold, lego starwars, and some other pokemon game where you were controlling little pokemon in 3rd person instead of controlling a human who threw pokeballs (anyone know that game? )

And like, I didn't speak English when I played them. So I had to figure out everything by just pressing random buttons and seeing responses. And this makes it a lot more difficult. Like I could open my "inventory" (didn't know what that was) and then use a "healing potion" (didn't know what that was), and then... (read more)

MondSemmel2mo1710

Further, have you ever gotten an adult who doesn't normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain "video game motifs" for load-bearing forms of communication; see e.g. this video.

So much +1 on this.

Also, I've played a ton of games, and in the last few years started helping a bit with playtesting them etc. And I found it striking how games aren't inherently intuitive, but are rather made so via strong economic incentives, endless playtests to stop players fro... (read more)

6Julian Bradshaw2mo

It's definitely possible to get confused playing Pokémon Red, but as a human, you're much better at getting unstuck. You try new things, have more consistent strategies, and learn better from mistakes. If you tried as long and as consistently as long as Claude is, even as a 6-year-old, you'd do much better. I played Pokémon Red as a kid too (still have the cartridge!), it wasn't easy, but I beat it in something like that 26 hour number IIRC. You have a point that howlongtobeat is biased towards gamers, but it's the most objective number I can find, and it feels reasonable to me.

Gradual Disempowerment, Shell Games and Flinches

Sam Marks3moΩ10176

Thanks for writing this reflection, I found it useful.

Just to quickly comment on my own epistemic state here:

I haven't read GD.
But I've been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
1. (You can judge from this Twitter discussion whether I seem to get the core ideas)
When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concer

... (read more)

ryan_greenblatt's Shortform

Sam Marks3moΩ560

Thanks, this is helpful. So it sounds like you expect

AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we'll soon have finished eating through the hardware overhang
separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).

It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradict... (read more)

2ryan_greenblatt3mo

I don't think (2) is a crux (as discussed in person). I expect that if takeover-capable AI takes a while (e.g. it happens in 2040), then we will have a long winter where economic value from AI doesn't increase that fast followed a period of faster progress around 2040. If progress is relatively stable accross this entire period, then we'll have enough time to scale up fabs. Even if progress isn't stable, we could see enough total value from AI in the slower growth period to scale up to scale up fabs by 10x, but this would require >>$1 trillion of economic value per year I think (which IMO seems not that likely to come far before takeover-capable AI due to views about economic returns to AI and returns to scaling up compute).

ryan_greenblatt's Shortform

Sam Marks3moΩ370

I really like the framing here, of asking whether we'll see massive compute automation before [AI capability level we're interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:

"How much is AI capabilities driven by algorithmic progress?" (problem: obscures dependence of algorithmic progress on compute for experimentation)
"How much AI progress can we get 'purely from elicitation'?" (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute

... (read more)

ryan_greenblatt3mo*Ω26350

I put roughly 50% probability on feasibility of software-only singularity.^[1]

(I'm probably going to be reinventing a bunch of the compute-centric takeoff model in slightly different ways below, but I think it's faster to partially reinvent than to dig up the material, and I probably do use a slightly different approach.)

My argument here will be a bit sloppy and might contain some errors. Sorry about this. I might be more careful in the future.

The key question for software-only singularity is: "When the rate of labor production is doubled (as in, as if your... (read more)

8ryan_greenblatt3mo

I think your outline of an argument against contains an important error. Importantly, while the spending on hardware for individual AI companies has increased by roughly 3-4x each year[1], this has not been driven by scaling up hardware production by 3-4x per year. Instead, total compute production (in terms of spending, building more fabs, etc.) has been increased by a much smaller amount each year, but a higher and higher fraction of that compute production was used for AI. In particular, my understanding is that roughly ~20% of TSMC's volume is now AI while it used to be much lower. So, the fact that scaling up hardware production is much slower than scaling up algorithms hasn't bitten yet and this isn't factored into the historical trends. Another way to put this is that the exact current regime can't go on. If trends continue, then >100% of TSMC's volume will be used for AI by 2027! Only if building takeover-capable AIs happens by scaling up TSMC to >1000% of what their potential FLOP output volume would have otherwise been, then does this count as "massive compute automation" in my operationalization. (And without such a large build-out, the economic impacts and dependency of the hardware supply chain (at the critical points) could be relatively small.) So, massive compute automation requires something substantially off trend from TSMC's perspective. [Low importance] It is only possible to build takeover-capable AI without previously breaking an important trend prior to around 2030 (based on my rough understanding). Either the hardware spending trend must break or TSMC production must go substantially above the trend by then. If takeover-capable AI is built prior to 2030, it could occur without substantial trend breaks but this gets somewhat crazy towards the end of the timeline: hardware spending keeps increasing at ~3x for each actor (but there is some consolidation and acquisition of previously produced hardware yielding a one-time increase up to about

2ryan_greenblatt3mo

I think this happening in practice is about 60% likely, so I don't think feasibility vs. in practice is a huge delta.

Scaling Sparse Feature Circuit Finding to Gemma 9B

Sam Marks4mo70

Good work! A few questions:

Where do the edges you draw come from? IIUC, this method should result in a collection of features but not say what the edges between them are.
IIUC, the binary masking technique here is the same as the subnetwork probing baseline from the ACDC paper, where it seemed to work about as well as ACDC (which in turn works a bit worse than attribution patching). Do you know why you're finding something different here? Some ideas:
1. The SP vs. ACDC comparison from the ACDC paper wasn't really apples-to-apples because ACDC pruned edges

... (read more)

2Jatin Nainani4mo

1. You are correct that the current method will only give a set of features at each selected layer. The edges are intended to show the attention direction within the architecture. We updated it to make it more clear and fix some small issues. 2. We think there are a few reasons why the results of the ACDC paper do not transfer to our domain: 1. ACDC and EAP (Syed et al.) rely on overlap with a manual circuit as their metric, whereas we rely on faithfulness and completeness. Because the metrics are different, the comparison isn’t apples-to-apples. 2. The major difference between methods, as you mentioned, is that we are finding circuits in the SAE basis. This quite possibly accounts for most of the differences. 3. The SAEs vs neurons comparison is something we definitely want to test. However, the methods mentioned above (ACDC, eap, etc) used transformer components (MLP, Attn) as their units for circuit analysis. Our setup would need to rely on neurons of the residual stream. We don’t think residual neurons are directly comparable to transformer components because they are at different levels of granularity.

9Oliver Daniels4mo

I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching

Greedy-Advantage-Aware RLHF

Sam Marks4mo52

Cool stuff!

I agree that there's something to the intuition that there's something "sharp" about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that "elaborate ruses are unstable," i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.

I think this demo is a cool proof-of-concept but ... (read more)

2sej20204mo

Thanks for the feedback, these suggestions are definitely helpful as I'm thinking about how/if to advance the project.

evhub's Shortform

Sam Marks4mo125

The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it's indeed now 1:1 as suggested by the Dario interview you linked.

4habryka4mo

That's great to hear, thank you for clarifying!

Sam Marks's Shortform

Sam Marks4mo50

Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model's performance, include performance at following the other constraints).

(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you're doing a more heavy-handed intervention on the model.)

Sam Marks's Shortform

Sam Marks4moΩ470

The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

4Buck4mo

This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.

3Adam Karvonen4mo

The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML. For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.

Sam Marks's Shortform

Sam Marks4moΩ472

@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

9Adam Karvonen4mo

Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be the most practical. I also agree with Sam that rejection sampling would likely need to operate on entire blocks rather than individual lines. By the time an LLM generates a line containing a regular expression, it’s often already committed to that path—for example, it might have skipped importing required modules or creating the necessary variables to pursue an alternative solution.

7Buck4mo

I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.

Sam Marks's Shortform

Sam Marks4moΩ250

I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen

7Sam Marks4mo

Sam Marks's Shortform

Sam Marks4mo70

Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post:

Benchify observed that the model was much more likely to generate a test with no false positives when using string methods instead of regexes, even if the test coverage wasn't as extensive.

Sam Marks's Shortform

Sam Marks4mo20

Isn't every instance of clamping a feature's activation to 0 conditional in this sense?

2Neel Nanda4mo

That's technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it's reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects

Sam Marks's Shortform

Sam Marks4moΩ26517

x-posting a kinda rambling thread I wrote about this blog post from Tilde research.

---

If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!

I'll discussed some things that surprised me about this case study in

---

The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt ... (read more)

9Buck4mo

Isn't it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.

7sarahconstantin4mo

why wouldn't you want regexes?

Neel Nanda4mo124

Note that this is conditional SAE steering - if the latent doesn't fire it's a no-op. So it's not that surprising that it's less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Sam Marks5mo330

I've donated $5,000. As with Ryan Greenblatt's donation, this is largely coming from a place of cooperativeness: I've gotten quite a lot of value from Lesswrong and Lighthaven.^[1]

IMO the strongest argument—which I'm still weighing—that I should donate more for altrustic reasons comes from the fact that quite a number of influential people seem to read content hosted on Lesswrong, and this might lead to them making better decisions. A related anecdote: When David Bau (big name professor in AI interpretability) gives young students a first intro to interpret... (read more)

Lao Mein's Shortform

Sam Marks5mo20

I'm quite happy for laws to be passed and enforced via the normal mechanisms. But I think it's bad for policy and enforcement to be determined by Elon Musk's personal vendettas. If Elon tried to defund the AI safety institute because of a personal vendetta against AI safety researchers, I would have some process concerns, and so I also have process concerns when these vendettas are directed against OAI.

OpenAI Email Archives (from Musk v. Altman and OpenAI blog)

Sam Marks5mo153

FYI it seems like this (important-seeming) email is missing, though the surrounding emails in the exchange seem to be present. (So maybe some other ones are missing too.)

habryka5mo*102

Fixed! That specific response had a very weird thread structure, so makes sense the AI I used got confused. Plausible something else is still missing, though I think I've now read through all the original PDFs and didn't see anything new.

Lao Mein's Shortform

Sam Marks5mo2-3

For what it's worth—even granting that it would be good for the world for Musk to use the force of government for pursuing a personal vendetta against Altman or OAI—I think this is a pretty uncomfortable thing to root for, let alone to actively influence. I think this for the same reason that I think it's uncomfortable to hope for—and immoral to contribute to—assassination of political leaders, even assuming that their assassination would be net good.

4MondSemmel5mo

I don't understand the reference to assassination. Presumably there are already laws on the books that outlaw trying to destroy the world (?), so it would be enough to apply those to AGI companies.

LLMs can learn about themselves by introspection

Sam Marks6mo40

Probably I misunderstood your concern. I interpreted your concern about settings where we don't have access to ground truth as relating to cases where the model could lie about its inner states without us being able to tell (because of lack of ground truth). But maybe you're more worried about being able to develop a (sufficiently diverse) introspection training signal in the first place?

I'll also note that I'm approaching this from the angle of "does introspection have worse problems with lack-of-ground-truth than traditional interpretability?" where I th... (read more)

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks6mo20

Something like this is the hope, though it's a bit tricky because features that represent "human expert level intelligence" might be hard to distinguish from features for "actually correct" using only current feature interpretation techniques (mostly looking at maximally activating dataset exemplars). But it seems pretty plausible that we could develop better interpretation techniques that would be suitable here.

LLMs can learn about themselves by introspection

Sam Marks6mo100

I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won't have a chance to write up thoughts here more cleanly for a few days.

ETA: Briefly, the key points are:

Honesty issues for introspection aren't obviously much worse than they are for simple probing. (But fair if you're already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it's probably quite difficult for a model to tell on which inputs it can get away with lying.

2Zach Stein-Perlman6mo

I'm confused/skeptical about this being relevant, I thought honesty is orthogonal to whether the model has access to its mental states.

2Owain_Evans6mo

Thanks Sam. That tweet could be a good stand-alone LW post once you have time to clean up.

A basic systems architecture for AI agents that do autonomous research

Sam Marks7moΩ10150

While I agree the example in Sycophancy to Subterfuge isn't realistic, I don't follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).

Caution when interpreting Deepmind's In-context RL paper

Sam Marks7mo20

I continue to think that capabilities from in-context RL are and will be a rounding error compared to capabilities from training (and of course, compute expenditure in training has also increased quite a lot in the last two years).

I do think that test-time compute might matter a lot (e.g. o1), but I don't expect that things which look like in-context RL are an especially efficient way to make use of test-time compute.

Fabien's Shortform

Sam Marks10moΩ240

Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing).

(Note that LoRA adapters can be merged into model we... (read more)

4Fabien Roger10mo

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

Fabien's Shortform

Sam Marks10moΩ570

Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!

IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?

(It's been a bit since I read the paper, so sorry if I'm missing something here.)

8Fabien Roger10mo

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

4Fabien Roger10mo

Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

Habryka's Shortform Feed

Sam Marks10mo3511

Anthropic has asked employees
[...]
Anthropic has offered at least one employee

As a point of clarification: is it correct that the first quoted statement above should be read as "at least one employee" in line with the second quoted statement? (When I first read it, I parsed it as "all employees" which was very confusing since I carefully read my contract both before signing and a few days ago (before posting this comment) and I'm pretty sure there wasn't anything like this in there.)

Vladimir_Nesov10mo1717

(I'm a full-time employee at Anthropic.)
I carefully read my contract both before signing and a few days ago [...] there wasn't anything like this in there.

Current employees of OpenAI also wouldn't yet have signed or even known about the non-disparagement agreement that is part of "general release" paperwork on leaving the company. So this is only evidence about some ways this could work at Anthropic, not others.

6habryka10mo

Yep, both should be read as "at least one employee", sorry for the ambiguity in the language.

ryan_greenblatt's Shortform

Sam Marks10mo*334

(I'm a full-time employee at Anthropic.) It seems worth stating for the record that I'm not aware of any contract I've signed whose contents I'm not allowed to share. I also don't believe I've signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn't be legally restricted from saying things like "I believe that Anthropic behaved recklessly by releasing [model]".

Sycophancy to subterfuge: Investigating reward tampering in large language models

Sam Marks10moΩ120

there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:
As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.

You're correct that I was neglecting this threat model—good point, and thanks.

So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.

Hmm, here's another way to fram... (read more)

4ryan_greenblatt10mo

Yep, reasonable summary. I don't think this was obvious to everyone and I appreciate this point - I edited my earlier comment to more explicitly note my agreement.

Sycophancy to subterfuge: Investigating reward tampering in large language models

Sam Marks10mo*Ω6130

In this comment, I'll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.

I think that in basically all of the discussion above, folks aren't using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there's a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it's likely to appear during a production RL run.

To put this another way: imagine that this paper had trained ... (read more)

6ryan_greenblatt10mo

I agree that a few factors of 2 don't matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.) (I think I made the point that a few factors of 2 shouldn't matter much for the bottom line above.) (Edit: I agree that it's worth noting that low RTFs can be quite concerning for the reasons you describe.) I'll argue against a specific threshold in the rest of this comment. ---------------------------------------- First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future: 1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion. 2. As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1). 3. As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.) Your argument applies to (2) and (3), but not to (1). ---------------------------------------- I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we'd typically expect (much high fraction hacking, much dumber models). Examples of conditions that might differ: * Incentives against reward hacking or other blockers. * How much of a gradient toward reward hacking their is. * How good models are at doing reward hacking reasoning relative to the difficulty of doing so in v

I would have shit in that alley, too

Sam Marks10mo341

This was awesome. Here are some more stories in the same style.

Homeless person or professor?

It can be hard to tell in Cambridge, Massachusetts. That's partly because some professors—mostly the MIT ones—can look very disheveled. But partly it's because some homeless people can be surprisingly intellectual, e.g. it's not uncommon to find homeless people crouched in the shade reading a book.

My favorite example is a homeless man in Harvard Square. His name in my head is "Black Santa" because he's a old man with a full belly and white beard, and he's always sur... (read more)

Shankar Sivarajan10mo170

Homeless person or professor?

Quiz.

Declan Molony10mo*131

Your homeless person or professor story made me think of my family member who lives in his car, by choice.

He has a computer science degree and worked for a lot of top technology companies in the 80s and 90s. Eventually his disdain for the employee lifestyle inspired him to try his hand at the entrepreneurial route. Turns out he's neither a good employee, nor a good entrepreneur. After a couple bad start-ups, he went broke.

On two separate occasions during my childhood he stayed with my family in our home (with the precondition that he maintains... (read more)

Mechanistically Eliciting Latent Behaviors in Language Models

Sam Marks1yΩ584

I think this is cool! The way I'm currently thinking about this is "doing the adversary generation step of latent adversarial training without the adversarial training step." Does that seem right?

It seems intuitively plausible to me that once you have a latent adversarial perturbation (the vectors you identify), you might be able to do something interesting with it beyond "train against it" (as LAT does). E.g. maybe you would like to know that your model has a backdoor, beyond wanting to move to the next step of "train away the backdoor." If I were doing t... (read more)

1Jannes Elstner1y

It would also be interesting to apply MELBO on language models that have already been trained with LAT. Adversarial attacks on vision models look significantly more meaningful to humans when the vision model has been adversarially trained, and since MELBO is basically a latent adversarial attack we should be able to elicit more meaningful behavior on language models trained with LAT.

1Andrew Mack1y

Thanks for your comment! Yes, I’d say that roughly sums things up. As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens. Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.” I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Improving Dictionary Learning with Gated Sparse Autoencoders

Sam Marks1yΩ120

Yep, you're totally right -- thanks!

Improving Dictionary Learning with Gated Sparse Autoencoders

Sam Marks1yΩ120

Oh, one other issue relating to this: in the paper it's claimed that if $γ$ is the argmin of $E [∥^x - γ^{'} x ∥_{2}^{2}]$ then $1 / γ$ is the argmin of $E [∥ γ^{'}^x - x ∥]$ . However, this is not actually true: the argmin of the latter expression is $\frac{E [x \cdot^x]}{E [∥^x ∥_{2}^{2}]} \neq {(\frac{E [x \cdot^x]}{E [∥ x ∥_{2}]})}^{- 1}$ . To get an intuition here, consider the case where $^x$ and $x$ are very nearly perpendicular, with the angle between them just slightly less than $90^{\circ}$ . Then you should be able to convince yourself that the best factor to scale either $x$ ... (read more)

1Senthooran Rajamanoharan1y

Hey Sam, thanks - you're right. The definition of reconstruction bias is actually the argmin of E[|^x/γ′−x|2] which I'd (incorrectly) rearranged as the expression in the paper. As a result, the optimum is γ−1=E[^x⋅x]/E[|^x|2] That being said, the derivation we gave was not quite right, as I'd incorrectly substituted the optimised loss rather than the original reconstruction loss, which makes equation (10) incorrect. However the difference between the two is small exactly when gamma is close to one (and indeed vanishes when there is no shrinkage), which is probably why we didn't pick this up. Anyway, we plan to correct these two equations and update the graphs, and will submit a revised version.

6Rohin Shah1y

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio E[||^x||2]E[||x||2]. Looking at the γ that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed E[||^x||2]E[||x||2] for Gated SAEs, I expect that would be below 1.) I think the main thing we got wrong is that we accidentally treated E[||^x−x||2] as though it were E[||^x−γx||2]. To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually γ is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small. We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.

Improving Dictionary Learning with Gated Sparse Autoencoders

Sam Marks1yΩ120

Ah thanks, you're totally right -- that mostly resolves my confusion. I'm still a little bit dissatisfied, though, because the $L_{aux}$ term is optimizing for something that we don't especially want (i.e. for $^x (ReLU (π_{gated} (x))$ to do a good job of reconstructing $x$ ). But I do see how you do need to have some sort of a reconstruction-esque term that actually allows gradients to pass through to the gated network.

3Senthooran Rajamanoharan1y

Yep, the intuition here indeed was that L1 penalised reconstruction seems to be okay for teaching a standard SAE's encoder to detect which features are on (even if features get shrunk as a result), so that is effectively what this auxiliary loss is teaching the gate sub-layer to do, alongside the sparsity penalty. (The key difference being we freeze the decoder in the auxiliary task, which the ablation study shows helps performance.) Maybe to put it another way, this was an auxiliary task that we had good evidence would teach the gate sublayer to detect active features reasonably well, and it turned out to give good results in practice. It's totally possible though that there are better auxiliary tasks (or even completely different loss functions) out there that we've not explored.

Improving Dictionary Learning with Gated Sparse Autoencoders

Sam Marks1yΩ120

(The question in this comment is more narrow and probably not interesting to most people.)

The limitations section includes this paragraph:

One worry about increasing the expressivity of sparse autoencoders is that they will overfit when
reconstructing activations (Olah et al., 2023, Dictionary Learning Worries), since the underlying
model only uses simple MLPs and attention heads, and in particular lacks discontinuities such as step
functions. Overall we do not see evidence for this. Our evaluations use held-out test data and we
check for interpretability manua

... (read more)

6Neel Nanda1y

I haven't fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.

Improving Dictionary Learning with Gated Sparse Autoencoders

Sam Marks1yΩ342

I'm a bit perplexed by the choice of loss function for training GSAEs (given by equation (8) in the paper). The intuitive (to me) thing to do here would be would be to have the $L_{reconstruct}$ and $L_{sparsity}$ terms, but not the $L_{aux}$ term, since the point of $π_{gate}$ is to tell you which features should be active, not to itself provide good feature coefficients for reconstructing $x$ . I can sort of see how not including this term might result in the coordinates of $π_{gate}$ all being extremely small (but barely posit... (read more)

4Rohin Shah1y

Possibly I'm missing something, but if you don't have Laux, then the only gradients to Wgate and bgate come from Lsparsity (the binarizing Heaviside activation function kills gradients from Lreconstruct), and so πgate would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".) (You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with Lincorrect from the beginning of Section 3.2.2.)