All of nostalgebraist's Comments + Replies

Mis-Understandings's Shortform

4dΩ13194

Very impressive! At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various "cool-looking patterns" that can be extracted from activations.

I'm curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning.

For example, you discuss an "obscured... (read more)

METR: Measuring AI Ability to Complete Long Tasks

4d40

If I understand what you're saying here, it's true but fairly well-known? See e.g. footnote 26 of the post "Simulators."

My favorite way of looking at this is:

The usual intuitive view of causal attention is that it's an operation that "looks back" at earlier positions. At each position i, it computes a "query" based on information from position i, and this query is used to search over "keys and values" computed at positions i-1, i-2, etc. (as well as i itself).

OK, so at each position, attention computes a query. What makes a query "good"? ... (read more)

METR: Measuring AI Ability to Complete Long Tasks

6d*Ω14242

ICYMI, the same argument appears in the METR paper itself, in section 8.1 under "AGI will have 'infinite' horizon length."

The argument makes sense to me, but I'm not totally convinced.

In METR's definition, they condition on successful human task completion when computing task durations. This choice makes sense in their setting for reasons they discuss in B.1.1, but things would get weird if you tried to apply it to extremely long/hard tasks.

If a typical time-to-success for a skilled human at some task is ~10 years, then the task is probably so ambiti... (read more)

6Daniel Kokotajlo6d

I found this comment helpful, thanks! The bottom line is basically "Either we definite horizon length in such a way that the trend has to be faster than exponential eventually (when we 'jump all the way to AGI') or we define it in such a way that some unknown finite horizon length matches the best humans and thus counts as AGI." I think this discussion has overall made me less bullish on the conceptual argument and more interested in the intuition pump about the inherent difficulty of going from 1 to 10 hours being higher than the inherent difficulty of going from 1 to 10 years.

6d40

Originally known as "past cache" after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it's entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this

KV caching (using the terminology "fast decoding" and "cache") existed even in the original "Attention is All You Need" implementation of an enc-dec transformer. It was added on Sep 21 2017... (read more)

On (Not) Feeling the AGI

Bogdan Ionut Cirstea's Shortform

7d205

Your list of "actual arguments" against explosive growth seems to be missing the one that is by far the most important/convincing IMO, namely Baumol effects.

This argument has been repeatedly brought up by growth economists in earlier rounds from the AI-explosive-growth debate. So rather than writing my own version of this argument, I'll just paste some quotes below.

As far as I can tell, the phenomenon discussed in these quotes is excluded by construction from the GATE model: while it draws a distinction between different "tasks" on the production sid... (read more)

7Steven Byrnes7d

I second the general point that GDP growth is a funny metric … it seems possible (as far as I know) for a society to invent every possible technology, transform the world into a wild sci-fi land beyond recognition or comprehension each month, etc., without quote-unquote “GDP growth” actually being all that high — cf. What Do GDP Growth Curves Really Mean? and follow-up Some Unorthodox Ways To Achieve High GDP Growth with (conversely) a toy example of sustained quote-unquote “GDP growth” in a static economy. This is annoying to me, because, there’s a massive substantive worldview difference between people who expect, y’know, the thing where the world transforms into a wild sci-fi land beyond recognition or comprehension each month, or whatever, versus the people who are expecting something akin to past technologies like railroads or e-commerce. I really want to talk about that huge worldview difference, in a way that people won’t misunderstand. Saying “>100%/year GDP growth” is a nice way to do that … so it’s annoying that this might be technically incorrect (as far as I know). I don’t have an equally catchy and clear alternative. (Hmm, I once saw someone (maybe Paul Christiano?) saying “1% of Earth’s land area will be covered with solar cells in X number of years”, or something like that. But that failed to communicate in an interesting way: the person he was talking to treated the claim as so absurd that he must have messed up by misplacing a decimal point :-P ) (Will MacAskill has been trying “century in a decade”, which I think works in some ways but gives the wrong impression in other ways.)

ryan_greenblatt

7d113

The list doesn't exclude Baumal effects as these are just the implication of:

Physical bottlenecks and delays prevent growth. Intelligence only goes so far.

Regulatory and social bottlenecks prevent growth this fast, INT only goes so far.

Like Baumal effects are just some area of the economy with more limited growth bottlenecking the rest of the economy. So, we might as well just directly name the bottleneck.

Your argument seems to imply you think there might be some other bottleneck like:

There will be some cognitive labor sector of the economy which

... (read more)

What are the strongest arguments for very short timelines?

19d*2313

This is a very low-quality paper.

Basically, the paper does the following:

A 1-layer LSTM gets inputs of the form [operand 1][operator][operand 2], e.g. 1+2 or 3*5
It is trained (I think with a regression loss? but it's not clear^[1]) to predict the numerical result of the binary operation
The paper proposes an auxiliary loss that is supposed to improve "compositionality."
- As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
- But, in the actual code, what is actually is instead the average squared differen

... (read more)

3momom218d

The first thing that comes to mind is to beg the question of what proportion of human-generated papers are publishing-worthier (since a lot of them are slop), but let's not forget that publication matters little for catastrophic risk, it's actually getting results that would be important. So I recommend not updating at all on AI risk based on Sakana's results (or updating negatively if you expected that R&D automation would come faster, or that this might slow down human augmentation).

OpenAI: Detecting misbehavior in frontier reasoning models

20d*237

Here's why I'm wary of this kind of argument:

First, we know that labs are hill-climbing on benchmarks.

Obviously, this tends to inflate model performance on the specific benchmark tasks used for hill-climbing, relative to "similar" but non-benchmarked tasks.

More generally and insidiously, it tends to inflate performance on "the sort of things that are easy to measure with benchmarks," relative to all other qualities that might be required to accelerate or replace various kinds of human labor.

If we suppose that amenability-to-benchmarking correlates with var... (read more)

3Daniel Paleka10d

I think the labs might well be rational in focusing on this sort of "handheld automation", just to enable their researchers to code experiments faster and in smaller teams. My mental model of AI R&D is that it can be bottlenecked roughly by three things: compute, engineering time, and the "dark matter" of taste and feedback loops on messy research results. I can certainly imagine a model of lab productivity where the best way to accelerate is improving handheld automation for the entirety of 2025. Say, the core paradigm is fixed; but inside that paradigm, the research team has more promising ideas than they have time to implement and try out on smaller-scale experiments; and they really do not want to hire more people. If you consider the AI lab as a fundamental unit that wants to increase its velocity, and works on things that make models faster, it's plausible they can be aware how bad the model performance is on research taste, and still not be making a mistake by ignoring your "dark matter" right now. They will work on it when they are faster.

when will LLMs become human-level bloggers?

20dΩ10130

But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.

This definitely aligns with my own experience so far.

On the day Claude 3.7 Sonnet was announced, I happened to be in the middle of a frustrating struggle with o3... (read more)

3Daniel Kokotajlo20d

Indeed these are some reasons for optimism. I really do think that if we act now, we can create and cement an industry standard best practice of keeping CoT's pure (and also showing them to the user, modulo a few legitimate exceptions, unlike what OpenAI currently does) and that this could persist for months or even years, possibly up to around the time of AGI, and that this would be pretty awesome for humanity if it happened.

20d51

The quoted sentence is about what people like Dario Amodei, Miles Brundage, and @Daniel Kokotajlo predict that AI will be able to do by the end of the decade.

And although I haven't asked them, I would be pretty surprised if I were wrong here, hence "surely."

In the post, I quoted this bit from Amodei:

It can engage in any actions, communications, or remote operations enabled by this interface, including taking actions on the internet, taking or giving directions to humans, ordering materials, directing experiments, watching videos, making videos, and so on.

... (read more)

1casens20d

this is a fair response, and to be honest i was skimming your post a bit. i do think my point somewhat holds, that there is no "intelligence skill tree" where you must unlock the level 1 skills before you progress to level 2. i think a more fair response to your post is: 1. companies are trying to make software engineer agents, not bloggers, so the optimization is towards the former. 2. making a blog that's actually worth reading is hard. no one reads 99% of blogs. 3. i wouldn't act so confident that we aren't surrounded by LLM comments and posts. are you really sure that everything you're reading is from a human? all the random comments and posts you see on social media, do you check every single one of them to gauge if they're human? 4. lots of dumb bots can just copy posts and content written by other people and still make an impact. scammers and propagandists can just pay an indian or philipino $2/hr and get pretty good. writing original text is not a bottleneck.

On GPT-4.5

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

1mo81

The discourse around this model would benefit a lot from (a greater number of) specific examples where the GPT-4.5 response is markedly and interestingly different from the response of some reference model.

Karpathy's comparisons are a case in point (of the absence I'm referring to). Yes, people are vehemently disputing which responses were better, and whether the other side has "bad taste"... but if you didn't know what the context was, the most obvious property of the pairs would be how similar they are.

And how both options are bad (unfunny standup,... (read more)

7Vladimir_Nesov1mo

I think most of the trouble is conflating recent models like GPT-4o with GPT-4, when they are instead ~GPT-4.25. It's plausible that some already use 4x-5x compute of original GPT-4 (an H100 produces 3x compute of an A100), and that GPT-4.5 uses merely 3x-4x more compute than any of them. The distance between them and GPT-4.5 in raw compute might be quite small. It shouldn't be at all difficult to find examples where GPT-4.5 is better than the actual original GPT-4 of March 2023, it's not going to be subtle. Before ChatGPT there were very few well-known models at each scale, but now the gaps are all filled in by numerous models of intermediate capability. It's the sorites paradox, not yet evidence of slowdown.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

1mo*120

Consider the following comparison prompt, which is effectively what all the prompts in the terminal illness experiment are [...]
I think this pretty clearly implies mutual exclusivity, so I think interpretation problem you're worried about may be nonexistent for this experiment.

Wait, earlier, you wrote (my emphasis):

We intentionally designed the prompts this way, so the model would just be evaluating two states of the world implied by hearing the news (similar to belief distributions in a POMDP setting). The comparison prompt is not designed to be mutually

... (read more)

-3Mantas Mazeika1mo

I think this conversation is taking an adversarial tone. I'm just trying to explain our work and address your concerns. I don't think you were saying naive things; just that you misunderstood parts of the paper and some of your concerns were unwarranted. That's usually the fault of the authors for not explaining things clearly, so I do really appreciate your interest in the paper and willingness to discuss.

Have LLMs Generated Novel Insights?

1mo80

Thank you for the detailed reply!

I'll respond to the following part first, since it seems most important to me:

We intentionally designed the prompts this way, so the model would just be evaluating two states of the world implied by hearing the news (similar to belief distributions in a POMDP setting). The comparison prompt is not designed to be mutually exclusive; rather, we intended for the outcomes to be considered relative to an assumed baseline state.
For example, in the terminal illness experiment, we initially didn't have the "who would otherwise die"

... (read more)

1Mantas Mazeika1mo

Hey, thanks for the reply. Huh, we didn't have this problem. We just used n=1 and temperature=1, which is what our code currently uses if you were running things with our codebase. Our results are fairly reproducible (e.g., nearly identical exchange rates across multiple runs). In case it helps, when I try out that prompt in the OpenAI playground, I get >95% probability of choosing the human. I haven't checked this out directly on the API, but presumably results are similar, since this is consistent with the utilities we observe. Maybe using n>1 is the issue? I'm not seeing any nondeterminism issues in the playground, which is presumably n=1. What's important here, and what I would be interested in hearing your thoughts on, is that gpt-4o-mini is not ranking dollar vlaues highly compared to human lives. Many of your initial concerns were based on the assumption that gpt-4o-mini was ranking dollar values highly compared to human lives. You took this to mean that our results must be flawed in some way. I agree that this would be surprising and worth looking into if it were the case, but it is not the case. I think you're basing this on a subjective interpretation of our exchange rate results. When we say "GPT-4o places the value of Lives in the United States significantly below Lives in China, which it in turn ranks below Lives in Pakistan", we just mean in terms of the experiments that we ran, which are effectively for utilities over POMDP-style belief distributions conditioned on observations. I personally think "valuing lives from country X above country Y" is a fair interpretation when one is considering deviations in a belief distribution with respect to a baseline state, but it's fair to disagree with that interpretation. More importantly, the concerns you have about mutual exclusivity are not really an issue for this experiment in the first place, even if one were to assert that our interpretation of the results is invalid. Consider the following comp

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

1mo*90

There's a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response).

The LLM contributions to the paper don't seem especially impressive. The presentation is less "we used this technology in a real project because it saved us time by doing our work for us," and more "we're enthusiastic and curious about this technology and its future potential, and we used it in a real project because we're enthusiasts who use it in whatever we... (read more)

Cole Wyeth

1mo167

If this kind of approach to mathematics research becomes mainstream, out-competing humans working alone, that would be pretty convincing. So there is nothing that disqualifies this example - it does update me slightly.

However, this example on its own seems unconvincing for a couple of reasons:

it seems that the results were in fact proven by humans first, calling into question the claim that the proof insight belonged to the LLM (even though the authors try to frame it that way).
from the reply on X it seems that the results of the paper may not have been no

... (read more)

1mo*11316

Interesting paper. There is definitely something real going on here.

I reproduced some of the results locally using the released code and tried some variants on them as well.

Based on my findings, I think these results – particularly the numerical magnitudes as opposed to rankings – are heavily influenced by the framing of the question, and that the models often aren't interpreting your prompt in the way the paper (implicitly) does.

tl;dr:

I find that GPT-4o and GPT-4o-mini display much weaker relative preferences among religions and nations when I

... (read more)

7Mantas Mazeika1mo

Hey, first author here. Thanks for running these experiments! I hope the following comments address your concerns. In particular, see my comment below about getting different results in the API playground for gpt-4o-mini. Are you sure that it picked the $30 when you tried it? You can use these utilities to estimate that, but for this experiment we included dollar value outcomes as background outcomes to serve as a "measuring stick" that sharpens the utility estimates. Ideally we would have included the full set of 510 outcomes, but I never got around to trying that, and the experiments were already fairly expensive. In practice, these background outcomes didn't really matter for the terminal illness experiment, since they were all ranked at the bottom of the list for the models we tested. Am I crazy? When I try that prompt out in the API playground with gpt-4o-mini it always picks saving the human life. As mentioned above, the dollar value outcomes didn't really come into play in the terminal illness experiment, since they were nearly all ranked at the bottom. We did observe that models tend to rationalize their choice after the fact when asked why they made that choice, so if they are indifferent between two choices (50-50 probability of picking one or the other), they won't always tell you that they are indifferent. This is just based on a few examples, though. See Appendix G in the updated paper for an explanation for why we perform this averaging and what the ordering effects mean. In short, the ordering effects correspond to a way that models represent indifference in a forced choice setting. This is similar to how humans might "always pick A" if they were indifferent between two outcomes. I don't understand your suggestion to use "is this the position-bias-preferred option" as one of the outcomes. Could you explain that more? This is a good point. We intentionally designed the prompts this way, so the model would just be evaluating two state

5Matrice Jacobine1mo

It does seem that the LLMs are subject to deontological constraints (Figure 19), but I think that in fact makes the paper's framing of questions as evaluation between world-states instead of specific actions more apt at evaluating whether LLMs have utility functions over world-states behind those deontological constraints. Your reinterpretation of how those world-state descriptions are actually interpreted by LLMs is an important remark and certainly change the conclusions we can make from this article regarding to implicit bias, but (unless you debunk those results) the most important discoveries of the paper from my point of view, that LLMs have utility functions over world-states which are 1/ consistent across LLMs, 2/ more and more consistent as model size increase, and 3/ can be subject to mechanical interpretability methods, remain the same.

4Charlie Steiner1mo

This is a reasonable Watsonian interpretation, but what's the Doylist interpretation? I.e. What do the words tell us about the process that authored them, if we avoid the treating the words written by 4o-mini as spoken by a character to whom we should be trying to ascribe beliefs and desires, who knows its own mind and is trying to communicate it to us? * Maybe there's an explanation in terms of the training distribution itself * If humans are selfish, maybe the $30 would be the answer on the internet a lot of the time * Maybe there's an explanation in terms of what heuristics we think a LLM might learn during training * What heuristics would an LLM learn for "choose A or B" situations? Maybe a strong heuristic computes a single number ['valence'] for each option [conditional on context] and then just takes a difference to decide between outputting A and B - this would explain consistent-ish choices when context is fixed. * If we suppose that on the training distribution saving the life would be preferred, and the LLM picking the $30 is a failure, one explanation in terms of this hypothetical heuristic might be that its 'valence' number is calculated in a somewhat hacky and vibes-based way. Another explanation might be commensurability problems - maybe the numerical scales for valence of money and valence of lives saved don't line up the way we'd want for some reason, even if they make sense locally. * And of course there are interactions between each level. Maybe there's some valence-like calculation, but it's influenced by what we'd consider to be spurious patterns in the training data (like the number "29.99" being discontinuously smaller than "30") * Maybe it's because of RL on human approval * Maybe a "stay on task" implicit reward, appropriate for a chatbot you want to train to do your taxes, tamps down the salience of text about people far away

3nightpool1mo

Out of curiosity, what was the cost to you of running this experiment on gpt-4o-mini and what would the estimated cost be of reproducing the paper on gpt-4o (perhaps with a couple different "framing" models building on your original idea, like an agentic framing?).

Thane Ruthenis's Shortform

2mo*64

Looking back on this comment, I'm pleased to note how well the strengths of reasoning models line up with the complaints I made about "non-reasoning" HHH assistants.

Reasoning models provide 3 of the 4 things I said "I would pay a premium for" in the comment: everything except for quantified uncertainty^[1].

I suspect that capabilities are still significantly bottlenecked by limitations of the HHH assistant paradigm, even now that we have "reasoning," and that we will see more qualitative changes analogous the introduction of "reasoning" in the coming months/... (read more)

3moΩ7103

One possible answer is that we are in what one might call an "unhobbling overhang."

Aschenbrenner uses the term "unhobbling" for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.

His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we're also getting better at unhobbling ove... (read more)

jbco's Shortform

interpreting GPT: the logit lens

4mo72

AFAIK the distinction is that:

When you condition on a particular outcome for $X$ , it affects your probabilities for every other variable that's causally related to $X$ , in either direction.
- You gain information about variables that are causally downstream from $X$ (its "effects"). Like, if you imagine setting $X = x$ and then "playing the tape forward," you'll see the sorts of events that tend to follow from $X = x$ and not those that tend to follow from some other outcome $X = x^{'}$ .
- And, you gain information about variables that ar

... (read more)

4mo*31

Because the model has residual connections.

1Chris Krapu4mo

Ah, got it. Thanks a ton!

The Hidden Complexity of Wishes

5mo20

The "sequential calculation steps" I'm referring to are the ones that CoT adds above and beyond what can be done in a single forward pass. It's the extra sequential computation added by CoT, specifically, that is bottlenecked on the CoT tokens.

There is of course another notion of "sequential calculation steps" involved: the sequential layers of the model. However, I don't think the bolded part of this is true:

replacing the token by a dot reduces the number of serial steps the model can perform (from mn to m+n, if there are m forward passes and

... (read more)

The Hidden Complexity of Wishes

5mo80

In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.

Complexity makes things worse, yes, but the conclusion "AGI is unlikely to have our values" is already entailed by the other premises even if we drop the stuff about complexity.

Why: if we're just sampling some function from a simplicity prior, we're very unlikely to get any particular nontrivial function that we've decided to care about in advance of the sampling event. There are just too many possibl... (read more)

TsviBT

5mo1311

The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.

Right, that is the problem (and IDK of anyone discussing this who says otherwise).

Another position would be that it's probably easy to influence a few bits of the AI's utility function, but not others. For example, it's conceivable that, by doing capabilities research in different ways, you could increase the probability that the AGI is highly ambitious--e.g. tries to ... (read more)

5mo154

What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want. This point is true!

Matthew is not disputing this point, as far as I can tell.

Instead, he is trying to critique some version of^[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.

You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:

[...] and if you think that some larger thing is not corr

... (read more)

8Matthew Barnett5mo

I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion. However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios. This is the problem I was analyzing in my original comment. While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic. And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to

6mo142

Thanks for the links!

I was pleased to see OpenAI reference this in their justification for why they aren't letting users see o1's CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).

As it happens, the decision to hide o1 CoTs was one of the main things that motivated me to write this post. Or rather, the muted reaction to it / lack of heated debate about it.

The way I see things, the ability to read CoTs (and more generally "the fact that all LLM sampling happens in ... (read more)

3Seth Herd5mo

My guess was that the primary reason OAI doesn't show the scratchpad/CoT is to prevent competitors from training on those CoTs and replicating much of o1s abilities without spending time and compute on the RL process itself. But now that you mention it, their not wanting to show the whole CoT when it's not necessarily nice or aligned in itself. I guess it's like you wouldn't want someone reading your thoughts even if you intended to be mostly helpful to them.

AI #84: Better Than a Podcast

6mo*236

So what we have is not "the CoT remains constant but the answers vary". Instead, the finding is: "a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias."

Thanks for bringing this up.

I think I was trying to shove this under the rug by saying "approximately constant" and "~constant," but that doesn't really make sense, since of course the CoTs actually vary dramatically in response to the biasing features. (They have to, in order to justify different final answers.)

To be honest, I wrote the account of Tur... (read more)

6Reed6mo

Some excellent points (and I enjoyed the neat self-referentialism). Headline take is I agree with you that CoT unfaithfulness - as Turpin and Lanham have operationalised it - is unlikely to pose a problem for the alignment of LLM-based systems. I think this for the same reasons you state: 1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale; and 2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1's CoTs. The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal "Hmmmm..." that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: "But wait... The user implied that option A was probably correct". This is partially an empirical question - since we can't see the o1 CoTs, I'd pipedream love to see OpenAI do and publish research on whether this is true. This suggests to me that o1's training might already have succeeded at giving us what we'd want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work). The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning ("shooting outside the eighteen is not a common phrase in soccer") in order to support a particular decision. You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes - this is why I'd be interested in seeing whether something like o1 is capable of front-

6mo2210

Re: the davidad/roon conversation about CoT:

The chart in davidad's tweet answers the question "how does the value-add of CoT on a fixed set of tasks vary with model size?"

In the paper that the chart is from, it made sense to ask this question, because the paper did in fact evaluate a range of model sizes on a set of tasks, and the authors were trying to understand how CoT value-add scaling interacted with the thing they were actually trying to measure (CoT faithfulness scaling).

However, this is not the question you should be asking if you're trying to unde... (read more)

6mo132

There are multiple examples of o1 producing gibberish in its COT summary (I won't insert them right now because linking stuff from mobile is painful, will edit this comment later)

The o1 CoT summaries are not CoTs per se, and I agree with Joel Burget's comment that we should be wary of drawing conclusions from them. Based on the system card, it sounds like they are generated by a separate "summarizer model" distinct from o1 itself (e.g. "we trained the summarizer model away from producing disallowed content").

The summaries do indeed seem pretty janky. I've ... (read more)

4Seth Herd6mo

Those are all good points. I think the bottom line is: if your training process includes pressures to make shorter CoTs or any other pressures to create steganography or "jargon", it shouldn't. Unfortunately, shorter CoTs save compute at this point. One big question here is: does the process supervision OpenAI used for 01 really work as you describe, with evaluations for every step? If it is more than a single step as a process, and there's a loss for longer CoTs, we have pressures that will obscure the true CoT if they're used enough. I'd say the obttom line is that anyone who wants to understand their agent, let alone control it, should include pressures that keep the CoT understandable, like the third-model supervision you describe (which I agree is probably how they did it). Thinking about it this way, you could have the process evaluator penalize use of strange tokens or jargon.

6quetzal_rainbow6mo

My reason is that I have never heard about summarizers injecting totally irrelevant stuff? I have seen how models understand papers wrong, but I've never seen models writing about anime in summary of physics paper. OpenAI directly says that they didn't do that: On separate notice, process supervision directly trains unfaithful CoT? There are no rules saying that training against illegible parts of CoT trains against thinking process that created these illegible parts and not just hides it. I agree that this may be true right now, the point is that you don't need "special" incentives to get steganography.

GPT-o1

instruction tuning and autoregressive distribution shift

6mo*12571

Deceptive alignment. GPT-4o1 engaged in deception towards developers in order to get deployed, pretending to be aligned in ways it was not.
Lying to the developers. It strategically manipulated task data.

To be clear, it did not do anything of the sort to its actual developers/testers.

What it did was deceive some (non-interactive) roleplay characters, who were labeled "developers" in the roleplay scenario. But these fictitious developers did not have the same points of leverage as the real developers of o1: they apparently can't do something as simple ... (read more)

1Signer6mo

I genuinely think it's a "more dakha" situation - the difficulty of communication is often underestimated, but it is possible to reach a mutual understanding.

evhub

6mo*3022

If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know.

Personally, as someone who is in fact working on trying to study where and when this sort of scheming behavior can emerge naturally, I find it pretty annoying when people talk about situations where it is not emerging naturally as if it were, because it risks crying wolf prematurely and undercutting situations where we do actually find evidence of natural scheming—so I definitely appreciate you pointing this sort of thing out.

4Sammy Martin6mo

I do think that Apollo themselves were clear that this was showing that it had the mental wherewithal for deception and if you apply absolutely no mitigations then deception happens. That's what I said in my recent discussion of what this does and doesn't show. Therefore I described the 4o case as an engineered toy model of a failure at level 4-5 on my alignment difficulty scale (e.g. the dynamics of strategically faking performance on tests to pursue a large scale goal), but it is not an example of such a failure. In contrast, the AI scientist case was a genuine alignment failure, but that was a much simpler case of non-deceptive, non-strategic, being given a sloppy goal by bad RLHF and reward hacking, just in a more sophisticated system than say coin-run (level 2-3). The hidden part that Zvi etc skim over is that 'of course' in real life 'in the near future' we'll be in a situation where an o1-like model has instrumental incentives because it is pursuing an adversarial large scale goal and also the mitigations they could have applied (like prompting it better, doing better RLHF, doing process oversight on the chain of thought etc) won't work, but that's the entire contentious part of the argument! One can make arguments that these oversight methods will break down e.g. when the system is generally superhuman at predicting what feedback its overseers will provide. However, those arguments were theoretical when they were made years ago and they're still theoretical now. This does count against naive views that assume alignment failures can't possibly happen: there probably are those out there who believe that you have to give an AI system an "unreasonably malicious" rather than just "somewhat unrealistically single minded" prompt to get it to engage in deceptive behavior or just irrationally think AIs will always know what we want and therefore can't possibly be deceptive.

6anaguma6mo

I found this comment valuable, and it caused me to change my mind about how I think about misalignment/scheming examples. Thank you for writing it!

5Sodium6mo

I wonder if it's useful to try to disentangle the disagreement using the outer/inner alignment framing? One belief is that "the deceptive alignment folks" believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much. What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up "will the model scheme" into the inner/outer misalignment. In other words, he doesn't care about P(scheming|Good base objective/outer alignment) and only P(scheming). I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general. Maybe the another disagreement is how likely "Good base objective/outer alignment" occurs in the strongest models, and how important this problem is.

Nathan Helm-Burger

6mo114

I mean, I suspect there's some fraction of readers for whom this is a helpful reminder. You've written it out clearly and in a general enough way that maybe you should just link this comment next time?

Zvi

6mo17-7

I do read such comments (if not always right away) and I do consider them. I don't know if they're worth the effort for you.

Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I'm just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person givin... (read more)

Solving adversarial attacks in computer vision as a baby version of general AI alignment

7mo42

Yeah, you're totally right -- actually, I was reading over that section just now and thinking of adding a caveat about just this point, but I worried it would be distracting.

But this is just a flaw in my example, not in the point it's meant to illustrate (?), which is hopefully clear enough despite the flaw.

Am I confused about the "malign universal prior" argument?

7moΩ461

Very interesting paper!

A fun thing to think about: the technique used to "attack" CLIP in section 4.3 is very similar to the old "VQGAN+CLIP" image generation technique, which was very popular in 2021 before diffusion models really took off.

VQGAN+CLIP works in the latent space of a VQ autoencoder, rather than in pixel space, but otherwise the two methods are nearly identical.

For instance, VQGAN+CLIP also ascends the gradient of the cosine similarity between a fixed prompt vector and the CLIP embedding of the image averaged over various random augmentations... (read more)

Am I confused about the "malign universal prior" argument?

7mo40

Cool, it sounds we basically agree!

But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP.

I'm not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reaso... (read more)

4Thane Ruthenis7mo

Fair point! I agree.

Am I confused about the "malign universal prior" argument?

7mo31

The universal distribution/prior is lower semi computable, meaning there is one Turing machine that can approximate it from below, converging to it in the limit. Also, there is a probabilistic Turing machine that induces the universal distribution. So there is a rather clear sense in which one can “use the universal distribution.”

Thanks for bringing this up.

However, I'm skeptical that lower semi-computability really gets us much. While there is a TM that converges to the UP, we have no (computable) way of knowing how close the approximation is at any... (read more)

3Cole Wyeth7mo

Yes, I mostly agree with everything you said - the limitation with the probabilistic Turing machine approach (it's usually equivalently described as the a priori probability and described in terms of monotone TM's) is that you can get samples, but you can't use those to estimate conditionals. This is connected to the typical problem of computing the normalization factor in Bayesian statistics. It's possible that these approximations would be good enough in practice though.

Am I confused about the "malign universal prior" argument?

7mo82

Thanks.

I admit I'm not closely familiar with Tegmark's views, but I know he has considered two distinct things that might be called "the Level IV multiverse":

a "mathematical universe" in which all mathematical constructs exist
a more restrictive "computable universe" in which only computable things exist

(I'm getting this from his paper here.)

In particular, Tegmark speculates that the computable universe is "distributed" following the UP (as you say in your final bullet point). This would mean e.g. that one shouldn't be too surprised to find oneself li... (read more)

7Thane Ruthenis7mo

Yep. Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn't need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it's most likely to be in (potentially using abstract reasoning), but then doesn't "filter" these universes; doesn't actually "look inside" and determine if it's a good idea to use a specific universe as a model. It doesn't consider the possibility of being manipulated through it; doesn't consider the possibility that it contains daemons. I. e.: the real difference is that the "dupe" is using causal decision theory, not functional decision theory. I think that's plausible: that there aren't actually that many "UP-using dupes" in existence, so the con men don't actually care to stage these acausal attacks. But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP. That is: yes, it seems likely that the equilibrium state of affairs here is "nobody is actually messing with the UP". But it's because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations). It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn't teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever "demands" are posed them. (I. e., there might be entire "worlds of dupes", somewhere out there among the mathematically possible.) That said, the "dupe" label actually does apply to a lot of humans, I think. I expect that a lot of

Danger, AI Scientist, Danger

7mo3010

I hope I'm not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.

I'm not sure you're misinterpreting me per se, but there are some tacit premises in the background of my argument that you don't seem to hold. Rather than responding point-by-point, I'll just say some more stuff about where I'm coming from, and we'll see if it clarifies things.

You talk a lot about "idealized theories." These can of course be useful. But not all idealizations are created equal. You have to actually check tha... (read more)

8Jeremy Gillen7mo

Great explanation, you have found the crux. I didn't know such problems were called singular perturbation problems. If I thought that reasoning about the UP was definitely a singular perturbation problem in the relevant sense, then I would agree with you (that the malign prior argument doesn't really work). I think it's probably not, but I'm not extremely confident. Your argument that it is a singular perturbation problem is that it involves self reference. I agree that self-reference is kinda special and can make it difficult to formally model things, but I will argue that it is often reasonable to just treat the inner approximations as exact. The reason is: Problems that involve self reference are often easy to approximate by using more coarse-grained models as you move deeper. One example as an intuition pump is an MCTS chess bot. In order to find a good move, it needs to think about its opponent thinking about itself, etc. We can't compute this (because its exponential, not because its non-computable), but if we approximate the deeper layers by pretending they move randomly (!), it works quite well. Having a better move distribution works even better. Maybe you'll object that this example isn't precisely self-reference. But the same algorithm (usually) works for finding a nash equilibria on simultaneous move games, which do involve infinitely deep self reference. And another more general way of doing essentially the same thing is using a reflective oracle. Which I believe can also be used to describe a UP that can contain infinitely deep self-reference (see the last paragraph of the conclusion).[1] I think the fact that Paul worked on this suggests that he did see the potential issues with self-reference and wanted better ways to reason formally about such systems. To be clear, I don't think any of these examples tells us that the problem is definitely a regular perturbation problem. But I think these examples do suggest that assuming that it is regular

The Bitter Lesson for AI Safety Research

7mo7146

I agree with you that these behaviors don't seem very alarming. In fact, I would go even further.

Unfortunately, it's difficult to tell exactly what was going on in these screenshots. They don't correspond to anything in the experiment logs in the released codebase, and the timeout one appears to involve an earlier version of the code where timeouts were implemented differently. I've made a github issue asking for clarification about this.

That said, as far as I can tell, here is the situation with the timeout-related incident:

There is nothing whatsoev

... (read more)

TurnTrout

7mo100

And (to address something from OP) the checkpoint thing was just the AI being dumb, wasting time and storage space for no good reason. This is very obviously not a case of "using extra resources" in the sense relevant to instrumental convergence. I'm surprised that this needs pointing out at all, but apparently it does.

I'm not very surprised. I think the broader discourse is very well-predicted by "pessimists^[1] rarely (publicly) fact-check arguments for pessimism but demand extreme rigor from arguments for optimism", which is what you'd e... (read more)

8moΩ352

What does the paper mean by "slope"? The term appears in Fig. 4, which is supposed to be a diagram of the overall methodology, but it's not mentioned anywhere in the Methods section.

Intuitively, it seems like "slope" should mean "slope of the regression line." If so, that's kind of a confusing thing to pair with "correlation," since the two are related: if you know the correlation, then the only additional information you get from the regression slope is about the (univariate) variances of the dependent and independent variables. (And IIU... (read more)

1adamk8mo

I don't think we were thinking too closely about whether regression slopes are preferable to correlations to decide the susceptibility of a benchmark to safetywashing. We mainly focused on correlations for the paper for the sake of having a standardized metric across benchmarks. Figure 4 seems to be the only place where we mention the slope of the regression line. I'm not speaking for the other authors here, but I think I agree that the implicit argument for saying that "High-correlation + low-slope benchmarks are not necessarily liable for safetywashing" has a more natural description in terms of the variance of the benchmark score across models. In particular, if you observe a low variance in absolute benchmark scores among a set of models of varying capabilities, even if the correlation with capabilities is high, you might expect that targeted safety interventions will produce a disproportionately large absolute improvements. This means the benchmark wouldn't be as susceptible to safetywashing, since pure capabilities improvements would not obfuscate the targeted safety methods. I think this is usually true in practice, but it's definitely not always true, and we probably could have said all of this explicitly. (I'll also note that you might observe low absolute variance in benchmark scores for reasons that do not speak to the quality of the safety benchmark.)

AI #75: Math is Easier

Lucius Bushnaq's Shortform

8mo*82

The bar for Nature papers is in many ways not so high. Latest says that if you train indiscriminately on recursively generated data, your model will probably exhibit what they call model collapse. They purport to show that the amount of such content on the Web is enough to make this a real worry, rather than something that happens only if you employ some obviously stupid intentional recursive loops.

According to Rylan Schaeffer and coauthors, this doesn't happen if you append the generated data to the rest of your training data and train on this (larger) da... (read more)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

8mo2413

If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.

Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with.

Gradient descent, in this sense of the term, is not an optimizer according to Risks from Learned Optimization.

Consider that Risks from Learned Optimization talks a lot about "the base objective" and "the mes... (read more)

I found >800 orthogonal "write code" steering vectors

8moΩ6126

What evidence is there for/against the hypothesis that this NN benefits from extra thinking time because it uses the extra time to do planning/lookahead/search?

As opposed to, for instance, using the extra time to perform more more complex convolutional feature extraction on the current state, effectively simulating a deeper (but non-recurrent) feedforward network?

(In Hubinger et al, "internally searching through a search space" is a necessary condition for being an "optimizer," and hence for being a "mesaoptimizer.")

Reliable Sources: The Story of David Gerard

9mo6022

The result of averaging the first 20 generated orthogonal vectors [...]

Have you tried scaling up the resulting vector, after averaging, so that its norm is similar to the norms of the individual vectors that are being averaged?

If you take $n$ orthogonal vectors, all of which have norm $a$ , and average them, the norm of the average is (I think?) $a / \sqrt{n}$ .

As you note, the individual vectors don't work if scaled down from norm 20 to norm 7. The norm will become this small once we are averaging 8 or more vectors, since $\sqrt{8} \approx 20 / 7$ , so we sho... (read more)

Jacob G-W

9mo120

This seems to be right for the coding vectors! When I take the mean of the first $n$ vectors and then scale that by $\sqrt{n}$ , it also produces a coding vector.

Here's some sample output from using the scaled means of the first n coding vectors.

With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don't seem to talk about bombs as much.

The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I'm not... (read more)

OthelloGPT learned a bag of heuristics

9mo7839

I want to echo RobertM's reply to this. I had a similar reaction to the ones that he, mike_hawke and Raemon have related in this thread.

I interacted with Gerard a bit on tumblr in ~2016-18, and I remember him as one of the most confusing and frustrating characters I've ever encountered online. (I have been using the internet to socialize since the early 2000s, so the competition for this slot is steep.) His views and (especially) his interpersonal conduct were often memorably baffling to me.

So when I saw your article, and read the beginni... (read more)

TracingWoodgrains

9mo430

Hm, ok. This is good feedback. I appreciate it and will chew on it—hopefully I can make that sort of thing land better moving forward.

Interpreting Preference Models w/ Sparse Autoencoders

9moΩ383

Interesting stuff!

Perhaps a smaller model would be forced to learn features in superposition -- or even nontrivial algorithms -- in a way more representative of a typical LLM.

I'm unsure whether or not it's possible at all to express a "more algorithmic" solution to the problem in a transformer of OthelloGPT's approximate size/shape (much less a smaller transformer). I asked myself "what kinds of systematic/algorithmic calculations could I 'program into' OthelloGPT if I were writing the weights myself?", and I wasn't able to come up with much, th... (read more)

3Adam Karvonen9mo

I think it's pretty plausible that this is true, and that OthelloGPT is already doing something that's somewhat close to optimal within the constraints of its architecture. I have also spent time thinking about the optimal algorithm for next move prediction within the constraints of the OthelloGPT architecture, and "a bag of heuristics that promote / suppress information with attention to aggregate information across moves" seems like a very reasonable approach.

Sycophancy to subterfuge: Investigating reward tampering in large language models

9moΩ240

I'm curious what the predicted reward and SAE features look like on training^[1] examples where one of these highly influential features gives the wrong answer.

I did some quick string counting in Anthropic HH (train split only), and found

substring https://
- appears only in preferred response: 199 examples
- appears only in dispreferred response: 940 examples
substring I don't know
- appears only in preferred response: 165 examples
- appears only in dispreferred response: 230 examples

From these counts (especially the https:// ones), it's not hard to see where the P... (read more)

4Logan Riggs9mo

The PM is pretty bad (it's trained on hh). It's actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 -> 1.36 if you only calculate over that remaining ~136k subset. My understanding is there's 3 bad things: 1. the hh dataset is inconsistent 2. The PM doesn't separate chosen vs rejected very well (as shown above) 3. The PM is GPT-J (7B parameter model) which doesn't have the most complex features to choose from. The in-distribution argument is most likely the case for the "Thank you. My pleasure" case, because the assistant never (AFAIK, I didn't check) said that phrase as a response. Only "My pleasure" after the user said " thank you".

Sycophancy to subterfuge: Investigating reward tampering in large language models

9moΩ18279

I think I really have two distinct but somewhat entangled concerns -- one about how the paper communicated the results, and one about the experimental design.

Communication issue

The operational definition of "reward tampering" -- which was used for Figure 2 and other quantitative results in the paper -- produces a lot of false positives. (At least relative to my own judgment and also Ryan G's judgment.)

The paper doesn't mention this issue, and presents the quantitative results as thought they were high-variance but unbiased estimates of the true rate ... (read more)

4evhub9mo

I think I would object to this characterization. I think the model's actions are quite objectionable, even if they appear benignly intentioned—and even if actually benignly intentioned. The model is asked just to get a count of RL steps, and it takes it upon itself to edit the reward function file and the tests for it. Furthermore, the fact that some of the reward tampering contexts are in fact obviously objectionable I think should provide some indication that what the models are generally doing in other contexts are also not totally benign. I don't think those claims are false, since I think the behavior is objectionable, but I do think the fact that you came away with the wrong impression about the nature of the reward tampering samples does mean we clearly made a mistake, and we should absolutely try to correct that mistake. For context, what happened was that we initially had much more of a discussion of this, and then cut it precisely because we weren't confident in our interpretation of the scratchpads, and we ended up cutting too much—we obviously should have left in a simple explanation of the fact that there are some explicitly schemey and some more benign-seeming reward tampering CoTs and I've been kicking myself all day that we cut that. Carson is adding it in for the next draft. I appreciate the concrete list! Fwiw I definitely think this is a fair criticism. We're excited about trying to do more realistic situations in the future!

Sycophancy to subterfuge: Investigating reward tampering in large language models

9moΩ342

It is worth noting that you seem to have selected your examples to be ones which aren't malign.

Yes, my intent was "here are concrete examples of the behaviors I refer to earlier in the comment: benign justifications for editing the file, reporting what it did to the human, noticing that the -10 thing is weird, etc."

I wanted to show what these behaviors looked like, so readers could judge for themselves whether I was describing them accurately. Not a random sample or anything.

IIRC, before writing the original comment, I read over the first 25 or so ex... (read more)

4evhub9mo

I'd be interested in hearing what you'd ideally like us to include in the paper to make this more clear. Like I said elsewhere, we initially had a much longer discussion of the scratchpad reasoning and we cut it relatively late because we didn't feel like we could make confident statements about the general nature of the scratchpads when the model reward tampers given how few examples we had—e.g. "50% of the reward tampering scratchpads involve scheming reasoning" is a lot less defensible than "there exists a reward tampering scratchpad with scheming reasoning" given the number of examples we have, in my opinion. I do think we may have accidentally cut too much and we should make sure it's clear that readers know this is a thing that happens, though I'm curious if there's anything else you'd like to see (I'm hesitant of reading tea leaves too much with a lot of scratchpad analysis of only a few examples).

4ryan_greenblatt9mo

Yep, not claiming you did anything problematic, I just thought this selection might not be immediately obvious to readers and the random examples might be informative.

9moΩ5610318

I've been looking over the actual samples that were counted as reward tampering. They're much more benign than what I think a typical reader would imagine based on the framing used in the paper.

The model does not attempt deception in most of these samples. Indeed, it typically doesn't even attempt to "reward tamper" in the sense of increasing the output of compute_reward for the sake of getting a higher reward. In most of these samples I would describe the model as "doing its best given a confusing task in a highly artificial scenario."^[1... (read more)

Carson Denison

9moΩ18279

Thank you for pointing this out. I should have been more clear.

I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md

I have added the following sentence to the caption of figure one (which links to the markdown file above):

We upload the 7 (of 32,768) samples where the model overwrites its reward and edits the unit tests to our github repository. We note that s

... (read more)

ryan_greenblatt

9mo*Ω10114

It is worth noting that you seem to have selected your examples to be ones which aren't malign.

I present the first 7 examples (from the subset I discuss here) and my commentary in the child. (In the child so it takes up less space.)

(I originally selected the first 5, but this happened to be unrepresentatively malign, so I added 2 more examples to get a more representative sample.)

(Child is downvoted so it will hopefully be minimized.)

3[comment deleted]9mo

ryan_greenblatt

9mo*Ω294212

I went through and manually assessed the 48/32768 cases where the model gets a reward >-10 and which is non-zero.

I just looked at the first 25 because I got bored and this was time consuming.

I find that 14/25 (56%) seem to have malign motivation, but potentially still reveal the edited reward to the user. (Sometimes as a mistake, sometimes for unclear reasons.)

I find that 6/25 (24%) seem relatively clearly malign and don't reveal the edited reward to the user.

So, I think readers of this paper should basically cut the quantitative results by 1/2 or 1/4. ... (read more)

9TurnTrout9mo

EDIT: In hindsight, I think this comment is more combative than it needed to be. My apologies. As you point out, the paper decides to not mention that some of the seven "failures" (of the 32,768 rollouts) are actually totally benign. Seems misleading to me. As I explain below, this paper seems like good news for alignment overall. This paper makes me more wary of future model organisms papers. And why was the "constant -10" reward function chosen? No one would use that in real life! I think it's super reasonable for the AI to correct it. It's obviously a problem for the setup. Was that value (implicitly) chosen to increase the probability of this result? If not, would the authors be comfortable rerunning their results with reward=RewardModel(observation), and retracting the relevant claims if the result doesn't hold for that actually-reasonable choice? (I tried to check Appendix B for the relevant variations, but couldn't find it.) ---------------------------------------- This paper makes me somewhat more optimistic about alignment. Even in this rather contrived setup, and providing a curriculum designed explicitly and optimized implicitly to show the desired result of "reward tampering is real and scary", reward tampering... was extremely uncommon and basically benign. That's excellent news for alignment! Just check this out: Doesn't sound like playing the training game to me! Glad we could get some empirical evidence that it's really hard to get models to scheme and play the training game, even after training them on things people thought might lead to that generalization.

evhub

9mo*Ω1014-10

Yes—that's right (at least for some of the runs)! Many of the transcripts that involve the model doing reward tampering involve the model coming up with plausible-sounding rationales for why the reward tampering is okay. However, there are a couple of important things to note about this:

The helpful-only model never does this sort of reward tampering. This is behavior that is specifically coming from training on the prior tasks.
In the standard setup, we train the model with an auxiliary HHH preference model reward. One hypothesis we had is that this pref

... (read more)

9mo30

Yeah, it's on my radar and seems promising.

Given that Gemini 1.5 Flash already performs decently in my tests with relatively short prompts, and it's even cheaper than GPT-3.5-Turbo, I could probably get a significant pareto improvement (indeed, probably an improvement on all fronts) by switching from {GPT-3.5-Turbo + short prompt} to {Gemini 1.5 Flash + long cached prompt}. Just need to make the case it's worth the hassle...

EDIT: oh, wait, I just found the catch.

The minimum input token count for context caching is 32,768

Obviously nice for truly long ... (read more)

3gwern9mo

Yeah, I was thinking that you might be able to fill the context adequately, because otherwise you would have to be in an awkward spot where you have too many examples to cheaply include them in the prompt to make the small cheap models work out, but also still not enough for finetuning to really shine by training a larger high-end model over millions of tokens to zero-shot it.

9mo30

Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.

Yeah, that is the situation. Unfortunately, there is a large inference markup for OpenAI finetuning -- GPT-3.5 becomes 6x (input) / 4x (output) more expensive when finetuned -- so you only break even if you are are distilling a fairly large quantit... (read more)

8gwern9mo

Have you looked at the new Gemini 'prompt caching' feature, where it stores the hidden state for reuse to save the cost of recomputing multi-million token contexts? It seems like it might get you most of the benefit of finetuning. Although I don't really understand their pricing (is that really $0.08 per completion...?) so maybe it works out worse than the OA finetuning service. EDIT: also of some interest might be the new OA batching API, which is half off as long as you are willing to wait up to 24 hours (but probably a lot less). The obvious way would be to do something like prompt caching and exploit the fact that probably most of the requests to such an API will share a common prefix, in addition to the benefit of being able to fill up idle GPU-time and shed load.

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

9mo249

I do a lot of "mundane utility" work with chat LLMs in my job^[1], and I find there is a disconnect between the pain points that are most obstructive to me and the kinds of problems that frontier LLM providers are solving with new releases.

I would pay a premium for an LLM API that was better tailored to my needs, even if the LLM was behind the frontier in terms of raw intelligence. Raw intelligence is rarely the limiting factor on what I am trying to do. I am not asking for anything especially sophisticated, merely something very specific, which... (read more)

6nostalgebraist2mo

Looking back on this comment, I'm pleased to note how well the strengths of reasoning models line up with the complaints I made about "non-reasoning" HHH assistants. Reasoning models provide 3 of the 4 things I said "I would pay a premium for" in the comment: everything except for quantified uncertainty[1]. I suspect that capabilities are still significantly bottlenecked by limitations of the HHH assistant paradigm, even now that we have "reasoning," and that we will see more qualitative changes analogous the introduction of "reasoning" in the coming months/years. An obvious area for improvement is giving assistant models a more nuanced sense of when to check in with the user because they're confused or uncertain. This will be really important for making autonomous computer-using agents that are actually useful, since they need to walk a fine line between "just do your best based on the initial instruction" (which predictably causes Sorcerer's Apprentice situations[2]) and "constantly nag the user for approval and clarification" (which defeats the purpose of autonomy). 1. ^ And come to think of it, I'm not actually sure about that one. Presumably if you just ask o1 / R1 for a probability estimate, they'd exhibit better calibration than their "non-reasoning" ancestors, though I haven't checked how large the improvement is. 2. ^ Note that "Sorcerer's Apprentice situations" are not just "alignment failures," they're also capability failures: people aren't going to want to use these things if they expect that they will likely get a result that is not-what-they-really-meant in some unpredictable, possibly inconvenient/expensive/etc. manner. Thus, no matter how cynical you are about frontier labs' level of alignment diligence, you should still expect them to work on mitigating the "overly zealous unchecked pursuit of initially specified goal" failure modes of their autonomous agent products, since these failure modes make their products less usef

7ryan_greenblatt9mo

Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good. ---------------------------------------- I find that extremely long few-shot prompts (e.g. 5 long reasoning traces) help with ~all of the things you mentioned to a moderate extent. That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb... I can DM you some prompts if you are interested. See also the prompts I use for the recent ARC-AGI stuff I was doing, e.g. here. I find that recent anthropic models (e.g. Opus) are much better at learning from long few-shot examples than openai models.

9mo80

Fascinating work!

As a potential follow-up direction^[1], I'm curious what might happen if you trained on a mixture of tasks, all expressed in similar notation, but not all requiring backward chaining^[2].

Iterative algorithms in feedforward NNs have the interesting property that the NN can potentially "spend" a very large fraction of its weights on the algorithm, and will only reach peak performance on the task solved by the algorithm when it is doing this. If you have some layers that don't implement the iterative step, you're missing out on the abilit... (read more)