I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.
So, how do you extract the desired credit-assignment info out of this?
(Also, how do you then use that info?)
If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings.
If I have just Alice and Bob, I pairing them up multiple times doesn't help so much.
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you'll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good "semantic similarity score" (this shouldn't be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph's idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.
In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing.
So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.
One reason is that we already see this empirically in the o1 report.
However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected.
There could be other cases where this doesn't happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.
If you give a kid a diary and try to avoid giving them any reason to think you're reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn't previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it's better than nothing.
See here for an updated version of my thinking based on discussion with Daniel:
https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser
Yeah, I agree that's at least somewhat true. So, given that, is it a good move or not? I don't see much of an upside, since there's a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI's example of avoiding safety training for the CoT, I think it's right to discourage them from doing this rather than encourage them... although, I say this with very low confidence.
I don't think it is worth getting into all the stuff you're mentioning here, but I think a key crux is that I'm expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
I'm reacting to this whole thing as a possible default configuration going forward, rather than as it exists today. All of my concerns are about what happens when you scale it up. For example, I don't find o1's current level of deception hugely concerning in its own right; rather, I see it as a bad sign for the technology as it continues to develop.
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn't make very much sense to me as a safety method intended for deployment.
Ah, yep, this makes sense to me.
Yeah, this is a good point, which doesn't seem addressed by any idea so far.
Chess is like a bounded, mathematically described universe where all the instrumental convergence stays contained, and only accomplishes a very limited instrumentality in our universe (IE chess programs gain a limited sort of power here by being good playmates).
LLMs touch on the real world far more than that, such that MCTS-like skill at navigating "the LLM world" in contrast to chess sounds to me like it may create a concerning level of real-world-relevant instrumental convergence.
What do you mean? I don't get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):