The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.
For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.
However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.
In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".
Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."
At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.
Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.
The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.
This is a good point!
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
This is especially likely to hold under the following conditions:
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
...
So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."
One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.
The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).
918,367 kg
An average chimp is 45 kg
918,367 kg / 45 (kg / chimp)
= 20,408 chimps
Based on my limited experience with lucid dreaming, my impression is that, roughly, whatever you expect to happen in a dream will happen. This includes things like the "lessons" in this post. As far as I know, there's no particular reason that lucid dreams have to be easier for women or people ages 20-30, or that you can't transform into someone else in a lucid dream (it's happened to me). But if you convince yourself of these lessons, then they'll be true for you, and you'll be limiting yourself for no reason.
Your example wouldn't be true, but "Dragons attack Paris" would be, interpreted as a statement about actual dragons' habits
Yep, or use method #2 on my list to make the paraphraser remove as much information as possible
Thanks for making this point @abramdemski, and thanks for the prompt to think about this more @Dakara. After some reflection, I think the idea that "prompting an LLM with a human-interpretable prompt is more efficient than using a hyper-optimized nonsense prompt" is a really interesting and potentially important hypothesis to test, and I actually am very uncertain whether or not it's true.
We could do an experiment that looks something like this:
I put this on my list of experiments to try, but I may or may not get to it soon. Anyone reading this, feel free to steal this idea! Just let me know beforehand by commenting, so we don't duplicate work