User Comment Replies

2mo10

>Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.

I would argue that instead human priors serve as a mechanism to help the search process, as it's being shown with cold-started reasoning models: they bake-in some reasoning traces that the model can then learn to exploit via RL. While this is not very bitter lesson-esque, the solution space is so large that it'd probably be quite difficult to do so without the cold start phase (although R1-zero kind of hints at this being possible). Maybe we have not yet thrown as much compute at the problem to do this search from scratch effectively.

Daniel Tan's Shortform

joanv

2mo30

What is the threat model you are probing for here?

The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.

> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’

While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able t... (read more)

3Daniel Tan2mo

As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios. There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”. I didnt' make these points especially clear in the slide deck - thanks for the feedback! Agreed on the rest of points!

Implications of the inference scaling paradigm for AI safety

joanv

3mo43

Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., "Hmmm") as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.

LESSWRONG
LW

All of joanv's Comments + Replies