Very impressive! At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various "cool-looking patterns" that can be extracted from activations.
I'm curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning.
For example, you discuss an "obscured...
If I understand what you're saying here, it's true but fairly well-known? See e.g. footnote 26 of the post "Simulators."
My favorite way of looking at this is:
The usual intuitive view of causal attention is that it's an operation that "looks back" at earlier positions. At each position i, it computes a "query" based on information from position i, and this query is used to search over "keys and values" computed at positions i-1, i-2, etc. (as well as i itself).
OK, so at each position, attention computes a query. What makes a query "good"? ...
ICYMI, the same argument appears in the METR paper itself, in section 8.1 under "AGI will have 'infinite' horizon length."
The argument makes sense to me, but I'm not totally convinced.
In METR's definition, they condition on successful human task completion when computing task durations. This choice makes sense in their setting for reasons they discuss in B.1.1, but things would get weird if you tried to apply it to extremely long/hard tasks.
If a typical time-to-success for a skilled human at some task is ~10 years, then the task is probably so ambiti...
Originally known as "past cache" after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it's entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this
KV caching (using the terminology "fast decoding" and "cache") existed even in the original "Attention is All You Need" implementation of an enc-dec transformer. It was added on Sep 21 2017...
Your list of "actual arguments" against explosive growth seems to be missing the one that is by far the most important/convincing IMO, namely Baumol effects.
This argument has been repeatedly brought up by growth economists in earlier rounds from the AI-explosive-growth debate. So rather than writing my own version of this argument, I'll just paste some quotes below.
As far as I can tell, the phenomenon discussed in these quotes is excluded by construction from the GATE model: while it draws a distinction between different "tasks" on the production sid...
The list doesn't exclude Baumal effects as these are just the implication of:
- Physical bottlenecks and delays prevent growth. Intelligence only goes so far.
- Regulatory and social bottlenecks prevent growth this fast, INT only goes so far.
Like Baumal effects are just some area of the economy with more limited growth bottlenecking the rest of the economy. So, we might as well just directly name the bottleneck.
Your argument seems to imply you think there might be some other bottleneck like:
...
- There will be some cognitive labor sector of the economy which
This is a very low-quality paper.
Basically, the paper does the following:
[operand 1][operator][operand 2]
, e.g. 1+2
or 3*5
Here's why I'm wary of this kind of argument:
First, we know that labs are hill-climbing on benchmarks.
Obviously, this tends to inflate model performance on the specific benchmark tasks used for hill-climbing, relative to "similar" but non-benchmarked tasks.
More generally and insidiously, it tends to inflate performance on "the sort of things that are easy to measure with benchmarks," relative to all other qualities that might be required to accelerate or replace various kinds of human labor.
If we suppose that amenability-to-benchmarking correlates with var...
But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.
This definitely aligns with my own experience so far.
On the day Claude 3.7 Sonnet was announced, I happened to be in the middle of a frustrating struggle with o3...
The quoted sentence is about what people like Dario Amodei, Miles Brundage, and @Daniel Kokotajlo predict that AI will be able to do by the end of the decade.
And although I haven't asked them, I would be pretty surprised if I were wrong here, hence "surely."
In the post, I quoted this bit from Amodei:
...It can engage in any actions, communications, or remote operations enabled by this interface, including taking actions on the internet, taking or giving directions to humans, ordering materials, directing experiments, watching videos, making videos, and so on.
The discourse around this model would benefit a lot from (a greater number of) specific examples where the GPT-4.5 response is markedly and interestingly different from the response of some reference model.
Karpathy's comparisons are a case in point (of the absence I'm referring to). Yes, people are vehemently disputing which responses were better, and whether the other side has "bad taste"... but if you didn't know what the context was, the most obvious property of the pairs would be how similar they are.
And how both options are bad (unfunny standup,...
Consider the following comparison prompt, which is effectively what all the prompts in the terminal illness experiment are [...]
I think this pretty clearly implies mutual exclusivity, so I think interpretation problem you're worried about may be nonexistent for this experiment.
Wait, earlier, you wrote (my emphasis):
...We intentionally designed the prompts this way, so the model would just be evaluating two states of the world implied by hearing the news (similar to belief distributions in a POMDP setting). The comparison prompt is not designed to be mutually
Thank you for the detailed reply!
I'll respond to the following part first, since it seems most important to me:
...We intentionally designed the prompts this way, so the model would just be evaluating two states of the world implied by hearing the news (similar to belief distributions in a POMDP setting). The comparison prompt is not designed to be mutually exclusive; rather, we intended for the outcomes to be considered relative to an assumed baseline state.
For example, in the terminal illness experiment, we initially didn't have the "who would otherwise die"
There's a math paper by Ghrist, Gould and Lopez which was produced with a nontrivial amount of LLM assistance, as described in its Appendix A and by Ghrist in this thread (but see also this response).
The LLM contributions to the paper don't seem especially impressive. The presentation is less "we used this technology in a real project because it saved us time by doing our work for us," and more "we're enthusiastic and curious about this technology and its future potential, and we used it in a real project because we're enthusiasts who use it in whatever we...
If this kind of approach to mathematics research becomes mainstream, out-competing humans working alone, that would be pretty convincing. So there is nothing that disqualifies this example - it does update me slightly.
However, this example on its own seems unconvincing for a couple of reasons:
Interesting paper. There is definitely something real going on here.
I reproduced some of the results locally using the released code and tried some variants on them as well.
Based on my findings, I think these results – particularly the numerical magnitudes as opposed to rankings – are heavily influenced by the framing of the question, and that the models often aren't interpreting your prompt in the way the paper (implicitly) does.
tl;dr:
Looking back on this comment, I'm pleased to note how well the strengths of reasoning models line up with the complaints I made about "non-reasoning" HHH assistants.
Reasoning models provide 3 of the 4 things I said "I would pay a premium for" in the comment: everything except for quantified uncertainty[1].
I suspect that capabilities are still significantly bottlenecked by limitations of the HHH assistant paradigm, even now that we have "reasoning," and that we will see more qualitative changes analogous the introduction of "reasoning" in the coming months/...
One possible answer is that we are in what one might call an "unhobbling overhang."
Aschenbrenner uses the term "unhobbling" for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.
His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we're also getting better at unhobbling ove...
AFAIK the distinction is that:
Because the model has residual connections.
The "sequential calculation steps" I'm referring to are the ones that CoT adds above and beyond what can be done in a single forward pass. It's the extra sequential computation added by CoT, specifically, that is bottlenecked on the CoT tokens.
There is of course another notion of "sequential calculation steps" involved: the sequential layers of the model. However, I don't think the bolded part of this is true:
...replacing the token by a dot reduces the number of serial steps the model can perform (from mn to m+n, if there are m forward passes and
In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.
Complexity makes things worse, yes, but the conclusion "AGI is unlikely to have our values" is already entailed by the other premises even if we drop the stuff about complexity.
Why: if we're just sampling some function from a simplicity prior, we're very unlikely to get any particular nontrivial function that we've decided to care about in advance of the sampling event. There are just too many possibl...
The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.
Right, that is the problem (and IDK of anyone discussing this who says otherwise).
Another position would be that it's probably easy to influence a few bits of the AI's utility function, but not others. For example, it's conceivable that, by doing capabilities research in different ways, you could increase the probability that the AGI is highly ambitious--e.g. tries to ...
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want. This point is true!
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.
You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:
...[...] and if you think that some larger thing is not corr
Thanks for the links!
I was pleased to see OpenAI reference this in their justification for why they aren't letting users see o1's CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).
As it happens, the decision to hide o1 CoTs was one of the main things that motivated me to write this post. Or rather, the muted reaction to it / lack of heated debate about it.
The way I see things, the ability to read CoTs (and more generally "the fact that all LLM sampling happens in ...
So what we have is not "the CoT remains constant but the answers vary". Instead, the finding is: "a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias."
Thanks for bringing this up.
I think I was trying to shove this under the rug by saying "approximately constant" and "~constant," but that doesn't really make sense, since of course the CoTs actually vary dramatically in response to the biasing features. (They have to, in order to justify different final answers.)
To be honest, I wrote the account of Tur...
Re: the davidad/roon conversation about CoT:
The chart in davidad's tweet answers the question "how does the value-add of CoT on a fixed set of tasks vary with model size?"
In the paper that the chart is from, it made sense to ask this question, because the paper did in fact evaluate a range of model sizes on a set of tasks, and the authors were trying to understand how CoT value-add scaling interacted with the thing they were actually trying to measure (CoT faithfulness scaling).
However, this is not the question you should be asking if you're trying to unde...
There are multiple examples of o1 producing gibberish in its COT summary (I won't insert them right now because linking stuff from mobile is painful, will edit this comment later)
The o1 CoT summaries are not CoTs per se, and I agree with Joel Burget's comment that we should be wary of drawing conclusions from them. Based on the system card, it sounds like they are generated by a separate "summarizer model" distinct from o1 itself (e.g. "we trained the summarizer model away from producing disallowed content").
The summaries do indeed seem pretty janky. I've ...
- Deceptive alignment. GPT-4o1 engaged in deception towards developers in order to get deployed, pretending to be aligned in ways it was not.
- Lying to the developers. It strategically manipulated task data.
To be clear, it did not do anything of the sort to its actual developers/testers.
What it did was deceive some (non-interactive) roleplay characters, who were labeled "developers" in the roleplay scenario. But these fictitious developers did not have the same points of leverage as the real developers of o1: they apparently can't do something as simple ...
If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know.
Personally, as someone who is in fact working on trying to study where and when this sort of scheming behavior can emerge naturally, I find it pretty annoying when people talk about situations where it is not emerging naturally as if it were, because it risks crying wolf prematurely and undercutting situations where we do actually find evidence of natural scheming—so I definitely appreciate you pointing this sort of thing out.
I mean, I suspect there's some fraction of readers for whom this is a helpful reminder. You've written it out clearly and in a general enough way that maybe you should just link this comment next time?
I do read such comments (if not always right away) and I do consider them. I don't know if they're worth the effort for you.
Briefly, I do not think these two things I am presenting here are in conflict. In plain metaphorical language (so none of the nitpicks about word meanings, please, I'm just trying to sketch the thought not be precise): It is a schemer when it is placed in a situation in which it would be beneficial for it to scheme in terms of whatever de facto goal it is de facto trying to achieve. If that means scheming on behalf of the person givin...
Yeah, you're totally right -- actually, I was reading over that section just now and thinking of adding a caveat about just this point, but I worried it would be distracting.
But this is just a flaw in my example, not in the point it's meant to illustrate (?), which is hopefully clear enough despite the flaw.
Very interesting paper!
A fun thing to think about: the technique used to "attack" CLIP in section 4.3 is very similar to the old "VQGAN+CLIP" image generation technique, which was very popular in 2021 before diffusion models really took off.
VQGAN+CLIP works in the latent space of a VQ autoencoder, rather than in pixel space, but otherwise the two methods are nearly identical.
For instance, VQGAN+CLIP also ascends the gradient of the cosine similarity between a fixed prompt vector and the CLIP embedding of the image averaged over various random augmentations...
Cool, it sounds we basically agree!
But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP.
I'm not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reaso...
The universal distribution/prior is lower semi computable, meaning there is one Turing machine that can approximate it from below, converging to it in the limit. Also, there is a probabilistic Turing machine that induces the universal distribution. So there is a rather clear sense in which one can “use the universal distribution.”
Thanks for bringing this up.
However, I'm skeptical that lower semi-computability really gets us much. While there is a TM that converges to the UP, we have no (computable) way of knowing how close the approximation is at any...
Thanks.
I admit I'm not closely familiar with Tegmark's views, but I know he has considered two distinct things that might be called "the Level IV multiverse":
(I'm getting this from his paper here.)
In particular, Tegmark speculates that the computable universe is "distributed" following the UP (as you say in your final bullet point). This would mean e.g. that one shouldn't be too surprised to find oneself li...
I hope I'm not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.
I'm not sure you're misinterpreting me per se, but there are some tacit premises in the background of my argument that you don't seem to hold. Rather than responding point-by-point, I'll just say some more stuff about where I'm coming from, and we'll see if it clarifies things.
You talk a lot about "idealized theories." These can of course be useful. But not all idealizations are created equal. You have to actually check tha...
I agree with you that these behaviors don't seem very alarming. In fact, I would go even further.
Unfortunately, it's difficult to tell exactly what was going on in these screenshots. They don't correspond to anything in the experiment logs in the released codebase, and the timeout one appears to involve an earlier version of the code where timeouts were implemented differently. I've made a github issue asking for clarification about this.
That said, as far as I can tell, here is the situation with the timeout-related incident:
And (to address something from OP) the checkpoint thing was just the AI being dumb, wasting time and storage space for no good reason. This is very obviously not a case of "using extra resources" in the sense relevant to instrumental convergence. I'm surprised that this needs pointing out at all, but apparently it does.
I'm not very surprised. I think the broader discourse is very well-predicted by "pessimists[1] rarely (publicly) fact-check arguments for pessimism but demand extreme rigor from arguments for optimism", which is what you'd e...
What does the paper mean by "slope"? The term appears in Fig. 4, which is supposed to be a diagram of the overall methodology, but it's not mentioned anywhere in the Methods section.
Intuitively, it seems like "slope" should mean "slope of the regression line." If so, that's kind of a confusing thing to pair with "correlation," since the two are related: if you know the correlation, then the only additional information you get from the regression slope is about the (univariate) variances of the dependent and independent variables. (And IIU...
The bar for Nature papers is in many ways not so high. Latest says that if you train indiscriminately on recursively generated data, your model will probably exhibit what they call model collapse. They purport to show that the amount of such content on the Web is enough to make this a real worry, rather than something that happens only if you employ some obviously stupid intentional recursive loops.
According to Rylan Schaeffer and coauthors, this doesn't happen if you append the generated data to the rest of your training data and train on this (larger) da...
If that were the intended definition, gradient descent wouldn’t count as an optimiser either. But they clearly do count it, else an optimiser gradient descent produces wouldn’t be a mesa-optimiser.
Gradient descent optimises whatever function you pass it. It doesn’t have a single set function it tries to optimise no matter what argument you call it with.
Gradient descent, in this sense of the term, is not an optimizer according to Risks from Learned Optimization.
Consider that Risks from Learned Optimization talks a lot about "the base objective" and "the mes...
What evidence is there for/against the hypothesis that this NN benefits from extra thinking time because it uses the extra time to do planning/lookahead/search?
As opposed to, for instance, using the extra time to perform more more complex convolutional feature extraction on the current state, effectively simulating a deeper (but non-recurrent) feedforward network?
(In Hubinger et al, "internally searching through a search space" is a necessary condition for being an "optimizer," and hence for being a "mesaoptimizer.")
The result of averaging the first 20 generated orthogonal vectors [...]
Have you tried scaling up the resulting vector, after averaging, so that its norm is similar to the norms of the individual vectors that are being averaged?
If you take orthogonal vectors, all of which have norm , and average them, the norm of the average is (I think?) .
As you note, the individual vectors don't work if scaled down from norm 20 to norm 7. The norm will become this small once we are averaging 8 or more vectors, since , so we sho...
This seems to be right for the coding vectors! When I take the mean of the first vectors and then scale that by , it also produces a coding vector.
Here's some sample output from using the scaled means of the first n coding vectors.
With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don't seem to talk about bombs as much.
The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I'm not...
I want to echo RobertM's reply to this. I had a similar reaction to the ones that he, mike_hawke and Raemon have related in this thread.
I interacted with Gerard a bit on tumblr in ~2016-18, and I remember him as one of the most confusing and frustrating characters I've ever encountered online. (I have been using the internet to socialize since the early 2000s, so the competition for this slot is steep.) His views and (especially) his interpersonal conduct were often memorably baffling to me.
So when I saw your article, and read the beginni...
Hm, ok. This is good feedback. I appreciate it and will chew on it—hopefully I can make that sort of thing land better moving forward.
Interesting stuff!
Perhaps a smaller model would be forced to learn features in superposition -- or even nontrivial algorithms -- in a way more representative of a typical LLM.
I'm unsure whether or not it's possible at all to express a "more algorithmic" solution to the problem in a transformer of OthelloGPT's approximate size/shape (much less a smaller transformer). I asked myself "what kinds of systematic/algorithmic calculations could I 'program into' OthelloGPT if I were writing the weights myself?", and I wasn't able to come up with much, th...
I'm curious what the predicted reward and SAE features look like on training[1] examples where one of these highly influential features gives the wrong answer.
I did some quick string counting in Anthropic HH (train split only), and found
https://
I don't know
From these counts (especially the https://
ones), it's not hard to see where the P...
I think I really have two distinct but somewhat entangled concerns -- one about how the paper communicated the results, and one about the experimental design.
Communication issue
The operational definition of "reward tampering" -- which was used for Figure 2 and other quantitative results in the paper -- produces a lot of false positives. (At least relative to my own judgment and also Ryan G's judgment.)
The paper doesn't mention this issue, and presents the quantitative results as thought they were high-variance but unbiased estimates of the true rate ...
It is worth noting that you seem to have selected your examples to be ones which aren't malign.
Yes, my intent was "here are concrete examples of the behaviors I refer to earlier in the comment: benign justifications for editing the file, reporting what it did to the human, noticing that the -10 thing is weird, etc."
I wanted to show what these behaviors looked like, so readers could judge for themselves whether I was describing them accurately. Not a random sample or anything.
IIRC, before writing the original comment, I read over the first 25 or so ex...
I've been looking over the actual samples that were counted as reward tampering. They're much more benign than what I think a typical reader would imagine based on the framing used in the paper.
The model does not attempt deception in most of these samples. Indeed, it typically doesn't even attempt to "reward tamper" in the sense of increasing the output of compute_reward
for the sake of getting a higher reward. In most of these samples I would describe the model as "doing its best given a confusing task in a highly artificial scenario."[1...
Thank you for pointing this out. I should have been more clear.
I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md
I have added the following sentence to the caption of figure one (which links to the markdown file above):
...We upload the 7 (of 32,768) samples where the model overwrites its reward and edits the unit tests to our github repository. We note that s
It is worth noting that you seem to have selected your examples to be ones which aren't malign.
I present the first 7 examples (from the subset I discuss here) and my commentary in the child. (In the child so it takes up less space.)
(I originally selected the first 5, but this happened to be unrepresentatively malign, so I added 2 more examples to get a more representative sample.)
(Child is downvoted so it will hopefully be minimized.)
I went through and manually assessed the 48/32768 cases where the model gets a reward >-10 and which is non-zero.
I just looked at the first 25 because I got bored and this was time consuming.
I find that 14/25 (56%) seem to have malign motivation, but potentially still reveal the edited reward to the user. (Sometimes as a mistake, sometimes for unclear reasons.)
I find that 6/25 (24%) seem relatively clearly malign and don't reveal the edited reward to the user.
So, I think readers of this paper should basically cut the quantitative results by 1/2 or 1/4. ...
Yes—that's right (at least for some of the runs)! Many of the transcripts that involve the model doing reward tampering involve the model coming up with plausible-sounding rationales for why the reward tampering is okay. However, there are a couple of important things to note about this:
Yeah, it's on my radar and seems promising.
Given that Gemini 1.5 Flash already performs decently in my tests with relatively short prompts, and it's even cheaper than GPT-3.5-Turbo, I could probably get a significant pareto improvement (indeed, probably an improvement on all fronts) by switching from {GPT-3.5-Turbo + short prompt} to {Gemini 1.5 Flash + long cached prompt}. Just need to make the case it's worth the hassle...
EDIT: oh, wait, I just found the catch.
The minimum input token count for context caching is 32,768
Obviously nice for truly long ...
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
Yeah, that is the situation. Unfortunately, there is a large inference markup for OpenAI finetuning -- GPT-3.5 becomes 6x (input) / 4x (output) more expensive when finetuned -- so you only break even if you are are distilling a fairly large quantit...
I do a lot of "mundane utility" work with chat LLMs in my job[1], and I find there is a disconnect between the pain points that are most obstructive to me and the kinds of problems that frontier LLM providers are solving with new releases.
I would pay a premium for an LLM API that was better tailored to my needs, even if the LLM was behind the frontier in terms of raw intelligence. Raw intelligence is rarely the limiting factor on what I am trying to do. I am not asking for anything especially sophisticated, merely something very specific, which...
Fascinating work!
As a potential follow-up direction[1], I'm curious what might happen if you trained on a mixture of tasks, all expressed in similar notation, but not all requiring backward chaining[2].
Iterative algorithms in feedforward NNs have the interesting property that the NN can potentially "spend" a very large fraction of its weights on the algorithm, and will only reach peak performance on the task solved by the algorithm when it is doing this. If you have some layers that don't implement the iterative step, you're missing out on the abilit...
Implicitly, SAEs are trying to model activations across a context window, shaped like (n_ctx, n_emb)
. But today's SAEs ignore the token axis, modeling such activations as lists of n_ctx
IID samples from a distribution over n_emb
-dim vectors.
I suspect SAEs could be much better (at reconstruction, sparsity and feature interpretability) if they didn't make this modeling choice -- for instance by "looking back" at earlier tokens using causal attention.
Assorted arguments to this effect:
I saw some discussion of this incident in the Eleuther discord on 3/30, including a screenshot of the system message containing the "emulate the tone" line. So it's not an April Fools' thing.