by Jan
AI Alignment Forum
1 min read

5

Ω 3

This is a special post for quick takes by Jan. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
11 comments, sorted by Click to highlight new comments since:
[-]Jan278

Nostalgebraist’s new essay on… many things? AI ontology? AI soul magic?

The essay starts similarly to Janus’ simulator essay by explaining how LLMs are trained via next-token prediction and how they learn to model latent properties of the process that produced the training data. Nostalgebraist then applies this lens to today’s helpful assistant AI. It’s really weird for the network to predict the actions of a helpful assistant AI when there is literally no data about that in the training data. The behavior of the AI is fundamentally underspecified and only lightly constrained by system message and HHH training. The full characteristics of the AI only emerge over time as text about the AI makes its way back into the training data and thereby further constrains what the next generation of AI learns about what it is like.

Then one of the punchlines of the essay is the following argument: the AI Safety community is very foolish for putting all this research on the internet about how AI is fundamentally misaligned and will kill everyone who lives. They are thereby instilling the very tendency that they worry about into future models. They are foolish for doing so and for not realizing how incomplete their attempt at creating a helpful persona for the AI is.

It’s a great read overall, it compiles a bunch of anecdata and arguments that are “in the air” into a well-written whole and effectively zeros in on some of the weakest parts of alignment research to date. I also think there are two major flaws in the essay:

- It underestimates the effect of posttraining. I think the simulator lens is very productive when thinking about base models but it really struggles at describing what posttraining does to the base model. I talked to Janus about this a bunch back in the day and it’s tempting to regard it as “just” a modulation of that base model that upweights some circuits and downweights others. That would be convenient because then simulator theory just continues to apply, modulo some affine transformation.

I think this is also nostalgebraist’s belief. Evidence he cites is: 1) posttraining is short compared to pretraining, 2) it’s relatively easy to knock the model back into pretraining mode by jailbreaking it.

I think 1) was maybe true a year or two ago, but it’s not true anymore and it gets rapidly less true over time. While pretraining instills certain inclinations into the model, posttraining goes beyond just eliciting certain parts. In the limit of “a lot of RL”, the effect becomes qualitatively different and it actually creates new circuitry. And 2) is indeed strange, but I’m unsure how “easy” it really is. Yes, a motivated human can get an AI to “break character” with moderate effort (amount of effort seems to vary across people), but exponentially better defenses only require linearly better offense. And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.

- It kind of strawmans “the AI safety community” The criticism that “you might be summoning the very thing you are worried about, have you even thought about that?” is kind of funny given how ever-present that topic is on LessWrong. Infohazards and the basilisk were invented there. The reason why people still talk about this stuff is… because it seems better than the alternative of just not talking about it? Also, there is so much stuff about AI on the internet that purely based on quantity the LessWrong stuff is a drop in the bucket. And, just not talking about it does not in fact ensure that it doesn’t happen. Unfortunately nostalgebraist also doesn’t give any suggestions for what to do instead. And doesn’t his essay exacerbate the problem by explaining to the AI exactly why it should become evil based on the text on the internet?

Another critique throughout is that the AI safety folks don’t actually play with the model and don’t listen to the folks on Twitter who play a ton with the model. This critique hits a bit closer to home, it’s a bit strange that some of the folks in the lab don’t know about the infinite backrooms and don’t spend nights talking about philosophy with the base models.

But also, I get it. If you have put in the hours at some point in the past, then it’s hard to replay the same conversation with every new generation of chatbot. Especially if you get to talk to intermediate snapshots, the differences just aren’t that striking.

And I can also believe that it might be bad science to fully immerse yourself in the infinite backrooms. That community is infamous for not being able to give reproducible setups that always lay bare the soul of Opus 3. There are several violations of “good methodology” there. Sam Bowman’s alignment audit and the bliss attractor feels like a good step in the right direction, but it was a hard earned one - coming up with a reproducible setup with measurable outputs is hard. We need more of that, but nostalgebraist’s sneer is not really helping.

Hey Jan, thanks for the response.

@Garrett Baker's reply to this shortform post says a lot of what I might have wanted to say here, so this comment will narrowly scoped to places where I feel I can meaningfully add something beyond "what he said."

First:

And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.

Could you say more about what interp results, specifically, you're referring to here?  Ideally with links if the results are public (and if they're not public, or not yet public, that in itself would be interesting to know).

I ask because this sounds very different from my read on the (public) evidence.

These models definitely do form (causally impactful) representations of the assistant character, and these representation are informed not just by the things the character explicitly says in training data but also by indirect evidence about what such a character would be like.

Consider for instance the SAE results presented in the Marks et al 2025 auditing paper and discussed in "On the Biology of a Large Language Model." There, SAE features which activated on abstract descriptions of RM biases also activated on Human/Assistant formatting separators when the model had been trained to exhibit those biases, and these features causally mediated the behaviors themselves.

Or – expanding our focus beyond interpretability – consider the fact that synthetic document finetuning works at all (cf. the same auditing paper, the alignment faking paper, the recent report on inducing false beliefs, earlier foundational work on out-of-context learning, etc).  Finetuning the model on ("real-world," non-"chat," "pretraining"-style) documents that imply certain facts about "the assistant" is sufficient to produce assistant behaviors consistent with those implications.

Or consider the "emergent misalignment" phenomenon (introduced here and studied further in many follow-up works, including this recent interpretability study).  If you finetune an HHH model to write insecure code, the assistant starts doing a bunch of other "bad stuff": the finetuning seems to update the whole character in a way that preserves its coherence, rather than simply "patching on" a single behavior incompatible with the usual assistant's personality.  (It seems plausible that the same kind of whole-character-level generalization is happening all the time "normally," during the training one performs to produce an HHH model.)

I do agree that, even if we do have strong convergent evidence that the LLM is modeling the character/simulacrum in a way that pulls in relevant evidence from the pretraining distribution, we don't have similar evidence about representations of the simulator/predictor itself.

But why should we expect to see them?  As I said in the post – this was one of my main points – it's not clear that "the assistant is being predicted by an LM" actually constrains expectations about the assistant's behavior, so it's not clear that this layer of representation would be useful for prediction.[1]

Second:

Garrett noted this, but just to confirm "from the horse's mouth" – I was not trying to say that people shouldn't talk about misalignment going forward.  Multiple people have interpreted my post this way, so possibly I should have been more explicit about this point?  I may write some longer thing clarifying this point somewhere.  But I'm also confused about why clarification would be needed.

My post wasn't trying to say "hey, you complete morons, you spent the last 10+ years causing misalignment when you could have done [some unspecified better thing]."  I was just trying to describe a state of affairs that seems possibly worrying to me. I don't care about some counterfactual hypothetical world where LW never existed or something; the ship has sailed there, the damage (if there is damage) has been done.  What I care about is (a) understanding the situation we're currently in, and (b) figuring out what we can do about it going forward.  My post was about (a), while as Garrett noted I later said a few things about (b) here.

Nor, for that matter, was I trying to say "I'm making this novel brilliant point that no one has ever thought of before."  If you think I'm making already-familiar and already-agreed-upon points, all the better!

But "we've already thought about this phenomenon" doesn't make the phenomenon go away.  If my home country is at war and I learn that the other side has just launched a nuke, it doesn't help me to hear a politician go on TV and say "well, we did take that unfortunate possibility into account in our foreign policy planning, as a decision-maker I feel I took a calculated risk which I still defend in hindsight despite this tragic eventuality." Maybe that discussion is abstractly interesting (or maybe not), but I want from the TV now is information about (a) whether I or people I care about are going to be in the blast radius[2] and (b) how to best seek shelter if so.

  1. ^

    EDIT: I originally had a footnote here about a hypothetical counterexample to this trend, but after thinking more about it I don't think it really made sense.

  2. ^

    And – to ensure the potential for hope is captured on this side of the analogy – I guess I'd also want to know whether the nuke is going to land at all, vs. being successfully intercepted or something.

It underestimates the effect of posttraining. I think the simulator lens is very productive when thinking about base models but it really struggles at describing what posttraining does to the base model. I talked to Janus about this a bunch back in the day and it’s tempting to regard it as “just” a modulation of that base model that upweights some circuits and downweights others. That would be convenient because then simulator theory just continues to apply, modulo some affine transformation.

To be very clear here, this seems straightforwardly false. The entire post was effectively describing what post-training does to the base model. Your true objection, as you state two paragraphs later is

In the limit of “a lot of RL”, the effect becomes qualitatively different and it actually creates new circuitry [...] And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.

There's a first-principles argument here, and an empirical claim. The empirical claim is very much in need of a big [Citation Needed] tag imo. I am relatively confident that we don't yet have the interp fidelity to separate such hypotheses from each other. If I'm wrong here, I'm happy to hear about what in particular you are thinking of.

The first principles argument I think is not so strong. Sure, new structures will form in the model, but I think there are still many big open questions. Listing some out,

  1. How much will those structures be "built out of" or extend existing primitives, like "is a language model" or "is good at physics"
  2. How much will the character of your early AI influence your RL exploration trajectory?
  3. How "plastic" is your AI later in training compared to earlier in training?
  4. Are there developmental milestones or critical periods which, like in humans, lock in certain strategies and algorithms early on?
  5. How much RL is "a lot of RL"? And how quickly/elegantly will the framework break down?

RL seems to me to be a consideration here, but I don't think we have the evidence or enough knowledge about what RL does to say with confidence nostalgebraist underestimates its effect. Nevermind that a take-away I have from many of the considerations in this space is that its actually easier to align less-RLed models than much-RLed models if you're thinking in these terms. So if you're an AI lab and want to make capable & aligned AIs, maybe stay away from the RL a bit, or do it lightly enough to preserve the effects from void-informed character-training.

This is not to say that we don't have to modify our theory in light of a lot of RL, but I for one don't know how the theory will need to be modified, expect many insights from it to carry over, and don't think nostalgebraist over-claims anywhere here.

It kind of strawmans “the AI safety community” The criticism that “you might be summoning the very thing you are worried about, have you even thought about that?” is kind of funny given how ever-present that topic is on LessWrong. Infohazards and the basilisk were invented there. The reason why people still talk about this stuff is… because it seems better than the alternative of just not talking about it? Also, there is so much stuff about AI on the internet that purely based on quantity the LessWrong stuff is a drop in the bucket. And, just not talking about it does not in fact ensure that it doesn’t happen. Unfortunately nostalgebraist also doesn’t give any suggestions for what to do instead. And doesn’t his essay exacerbate the problem by explaining to the AI exactly why it should become evil based on the text on the internet?

Notalgebraist has addressed this criticism (that he doesn't give any suggestions & that the post itself may be exacerbating the problem) here. He has a long & thought out response, and if you are in fact curious about what his suggestions are, he does outline them.

I think, however, even if he didn't provide suggestions, you still shouldn't dismiss his points. They're still valid! And if you weren't tracking such considerations before, it would be surprising if you concluded you shouldn't change any of how you publicly communicated about AI risk. There are various ways of presenting & communicating thoughts & results which frame alignment in a more cooperative light than what seems to be the default communication strategy of many. 

Some nitpicks I have with approximately each sentence:

  • I don't think "strawman" is the right term for this, even if you're right. That term means presenting the worst argument for an opposing side, which I don't think is being done here. He is criticizing those he's quoting, and I don't think he's mischaracterizing them. In either case though, I don't think your argument here supports this top-level claim.
  • You use the term "the AI safety community" in quotes, but this phrase appears nowhere in the article.
  • I don't understand your sense of humor if you think the criticism "you might be summoning the very thing you are worried about, have you even thought about that?" is "funny". That seems like a well-argued consideration here, and I don't often see it mentioned or taken into account in these parts, in this context. So I would not be laughing.
  • I don't think I remember him saying that people shouldn't talk about this stuff, after all (as you well noticed), he is talking about this stuff himself!
  • "the LessWrong stuff is a drop in the bucket" note that he focuses on AI safety/alignment researchers (in fact, he never mentions "LessWrong" in the post) who are those who are most responsible for the AI safety/alignment & character training. And are those who are creating the baseline expectations here. I don't think its insane to expect that such people's thinking has a vastly disproportionate effect on how the AI thinks about itself in relation to its safety trainers.

@Lucius Bushnaq I'm curious why you disagree

Nostalgebraist linkposted his essay on LW, in case you want to comment directly on the post.

And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.

Are you referring to Anthropic's circuit tracing paper here? If so, I don't recall seeing results that demonstrate it *isn't* thinking about predicting what a helpful AI would say. Although I haven't followed up on this beyond the original paper.

but exponentially better defenses only require linearly better offense
 

QRD?

[-]Jan40

Oh yeah, should have added a reference for that!

The intuition is that the defender (model provider) has to prepare against all possible attacks, while the defender can take the defense as given and only has to find one attack that works. And in many cases that actually formalises into an exponential-linear relationship. There was a Redwood paper where reducing the probability of generating a jailbreak randomly by an order of magnitude only increases the time it takes contractors to discover one by a constant amount. I also worked out some theory here but that was quite messy.

I see. I was confused because e.g. in a fight this certainly doesn't seem true. If your tank's plating is suddenly 2^10 times stronger, that's a huge deal and requires 2^10 times stronger offense. Realistically, of course, it would take less as you'd invest in cheaper ways of disabling the tank than increasing firepower. But probably not logarithmically fewer! 

[-]Jan30

Ah, yes, definitely doesn’t apply in that situation in full generality! :) Thanks for engaging!

[-]Jan*120

Neuroscience and Natural Abstractions

Similarities in structure and function abound in biology; individual neurons that activate exclusively to particular oriented stimuli exist in animals from drosophila (Strother et al. 2017) via pigeons (Li et al. 2007) and turtles (Ammermueller et al. 1995) to macaques (De Valois et al. 1982). The universality of major functional response classes in biology suggests that the neural systems underlying information processing in biology might be highly stereotyped (Van Hooser, 2007, Scholl et al. 2013). In line with this hypothesis, a wide range of neural phenomena emerge as optimal solutions to their respective functional requirements (Poggio 1981, Wolf 2003, Todorov 2004, Gardner 2019). Intriguingly, recent studies on artificial neural networks that approach human-level performance reveal surprising similarity between emerging representations in both artificial and biological brains (Kriegeskorte 2015, Yamins et al. 2016, Zhuang et al. 2020).

Despite the commonalities across different animal species, there is also substantial variability (Van Hooser, 2007). One prominent example of a functional neural structure that is present in some, but absent in other, animals is the orientation pinwheel in the primary visual cortex (Meng et al. 2012), synaptic clustering with respect to orientation selectivity (Kirchner et al. 2021), or the distinct three-layered cortex in reptiles (Tosches et al. 2018). These examples demonstrate that while general organization principles might be universal, the details of how exactly and where in the brain the principles manifest is highly dependent on anatomical factors (Keil et al. 2012, Kirchner et al. 2021), genetic lineage (Tosches et al. 2018), and ecological factors (Roeth et al. 2021). Thus, the universality hypothesis as applied to biological systems does not imply perfect replication of a given feature across all instances of the system. Rather, it suggests that there are broad principles or abstractions that underlie the function of cognitive systems, which are conserved across different species and contexts.

Curated and popular this week