It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on "imitation" as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.
I think "basically obviates" is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don't get any incentive towards productively useing human-legible cogntive strategies.
Disagree that it's obvious, depends a lot on how efficient (large-scale) RL (post-training) is at very significantly changing model internals, rather than just 'wrapping it around', making the model more reliable, etc. In the past, post-training (including RL) has been really bad at this.
If that's true, perhaps the performance penalty for pinning/freezing weights in the 'internals', prior to the post-training, would be low. That could ease interpretability a lot, if you didn't need to worry so much about those internals which weren't affected by post-training?
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.
Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:
The way I think about it is something like: a "goal representation" is basically what you get when it's easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.
In principle, this doesn't have to equate to "goals" in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (and because) permitting longer horizons (in the sense of increasing the length of the minimal sequence needed to reach some terminal state) causes the intervening trajectories to explode in number and complexity, s.t. it's hard to impose meaningful constraints on those trajectories that don't map to (and arise from) some much simpler description of the outcomes those trajectories lead to.
This connects with the "reasoners compress plans" point, on my model, because a reasoner is effectively a way to map that compact specification on outcomes to some method of selecting trajectories (or rather, selecting actions which select trajectories); and that, in turn, is what goal-oriented reasoning is. You get goal-oriented reasoners ("inner optimizers") precisely in those cases where that kind of mapping is needed, because simple heuristics relating to the trajectory instead of the outcome don't cut it.
It's an interesting question as to where exactly the crossover point occurs, where trajectory-heuristics stop functioning as effectively as consequentialist outcome-based reasoning. On one extreme, there are examples like tic-tac-toe, where it's possible to play perfectly based on a myopic set of heuristics without any kind of search involved. But as the environment grows more complex, the heuristic approach will in general be defeated by non-myopic, search-like, goal-oriented reasoning (unless the latter is too computationally intensive to be implemented).
That last parenthetical adds a non-trivial wrinkle, and in practice reasoning about complex tasks subject to bounded computation does best via a combination of heuristic-based reasoning about intermediate states, coupled to a search-like process of reaching those states. But that already qualifies in my book as "goal-directed", even if the "goal representations" aren't as clean as in the case of something like (to take the opposite extreme) AIXI.
To me, all of this feels somewhat definitionally true (though not completely, since the real-world implications do depend on stuff like how complexity trades off against optimality, where the "crossover point" lies, etc). It's just that, in my view, the real world has already provided us enough evidence about this that our remaining uncertainty doesn't meaningfully change the likelihood of goal-directed reasoning being necessary to achieve longer-term outcomes of the kind many (most?) capabilities researchers have ambitions about.
Epistemic status: exploratory, "shower thought", written as part of a conversation with Claude:
For any given entity (broadly construed here to mean, essentially, any physical system), it is possible to analyze that entity as follows:
Define the set of possible future trajectories that entity might follow, according to some suitably uninformative ignorance prior on its state and (generalized) environment. Then ask, of that set, whether there exists some simple, obvious, or otherwise notable prior on the set in question, that assigns probabilities to various member trajectories in such a way as to establish an upper level set of some kind. Then ask, of that upper level set, how large it is relative to the size of the set as a whole, and (relatedly) how large the difference is between the probability of that upper set's least probable member, and its most probable nonmember. (If you want to conceptualize these sets as infinite and open—although it's unclear to me that one needs to conceptualize them this way—then you can speak instead of "infimum" and "supremum".)
The claim is that, for some specific kinds of system, there will be quite a sharp difference between its upper level set and its lower level set, constituting a "plausibility gap": trajectories within the upper set are in some sense "plausible" ways of extrapolating the system forward in time. And then the relative size of that upper set becomes relevant, because it indicates how tightly constrained the system's time-evolution is by its present state (and environment). So, the claim is that there are certain systems for which their forwards time-evolution is very tightly constrained indeed, and these systems are "agents"; and there are systems for which barely any upper level set exists, and these are "simplistic" entities whose behavior is essentially entropic. And humans (seem to me to) occupy a median position between these two extremes.
One additional wrinkle, however, is that "agency", as I've defined it here, may additionally play the role of a (dynamical system) attractor: entities already close to having full agency will be more tightly constrained in their future evolution, generally in the direction of becoming ever more agentic; meanwhile, entirely inanimate systems are not at all pulled in the direction of becoming more constrained or agentic; they are outside of the agency attractor's basin of attraction. However, humans, if they indeed exist at some sort of halfway point between fully coherent agency and a complete lack of coherence, are left interestingly placed under this framing: we would exist at the boundary of the agency attractor's basin of attraction. And since many such boundaries are fundamentally fractal or chaotic in nature, that could have troubling implications for the trajectories of points along those boundaries trying to reach reflective equilibrium, as it were.
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.