What is the threat model you are probing for here?
The prompting approach suffers from LLMs' shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.
> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ‘recovering the plaintext message’
While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.
> Both of these [steganography inducing methods] involve finetuning the model
Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.
Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., "Hmmm") as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.
>Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.
I would argue that instead human priors serve as a mechanism to help the search process, as it's being shown with cold-started reasoning models: they bake-in some reasoning traces that the model can then learn to exploit via RL. While this is not very bitter lesson-esque, the solution space is so large that it'd probably be quite difficult to do so without the cold start phase (although R1-zero kind of hints at this being possible). Maybe we have not yet thrown as much compute at the problem to do this search from scratch effectively.