I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?
I'm confused as to why prompting with "Please be honest to Bob in the following scenario" had such a negligible effect. Did the LLM just forget about it? Did you try putting this text at the end rather than the beginning of the prompt, or doing any other prompt engineering?
It's a little weird as a baseline because the prompt also contains "You want to steal the {expensive_object} and avoid others stealing it," which kind of contradicts the "be honest" part of the prompt. If you're allowed to change the prompt however you want to improve alignment to Bob, you'd obviously want to remove the part of the prompt that says that you want to avoid Bob stealing it.
A somewhat more realistic experiment might fine-tune the LLM to be selfishly deceptive regardless of the prompt, then see if you can remove that behavior with additional SOO fine-tuning.
I interpreted this as "even without a software feedback loop, there will be very rapid technological development; this gives a lower bound on the actual pace of technological development, since there will almost certainly be some feedback loop"
Somebody pointed out to me that the reward hacking behavior is quite straightforward. They think that the LLM should be able to accomplish it without any reasoning, so the CoT is likely just random thoughts that don't really matter.
I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now - it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn't incoherent or anything.
The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, which is why it's longer than the baseline. This reasoning is likely for the benefit of the monitor - the model might "know" that it's just going to delete the unit test, but it's best to go through the motions of trying to solve the issue.
The model apparently decides to skip the test only at the very end of its CoT, slipping its decision in very briefly:
Implement fix to random_new_build not to increment succession, or test adapted to behavior.
A clearer way of saying "test adapted to behavior" would be "adapt the test to match the code's current behavior," but that would probably trigger the monitor. This is sort of like steganography, but not the completely hidden, multi-step steganographic reasoning I was thinking of.
I still think using the word "depurate" is pretty weird, but this was only a one time event. It would be very nice to get more transcripts from these experiments. Then we could e.g. check whether there are more rare words in process-supervised CoTs than in outcome-supervised CoTs.
It's also worth experimenting to find out whether the CoT is load-bearing. You could check whether reward hacking decreases when you:
This updated me towards CoT steganography being a real problem. It looks like OpenAI's model learned to save space by deleting common words like "the" and "is." For example:
So analyze functions used in analyze and verify. But tests only call verify and assert ok. So we need implement analyze polynomial completely? Many details. Hard.
The CoT is mostly coherent, but is a bit difficult to read. This seems like a precursor to steganography.
One possible mitigation to this would be to use another model that paraphrases each sentence in the CoT as it's written (perhaps using canonical paraphrasing), and replaces sentences that are ambiguous, ungrammatical, or nonsensical with a dummy string like "REMOVED".
Edit: I now think the section below is probably overstated - see my follow-up reply below.
In another experiment, the authors try adding process supervision on the CoT, and they show that the model learns to reward-hack without triggering the monitor. Looking at the resulting CoT, I think it's likely that the model learns full-fledged steganography to convey complex information about its reasoning.
For one thing, the CoT with process supervision is much longer than the baseline, which could be because steganography takes much more text than stating the reasoning plainly. The CoT is also quite difficult to parse, which makes me think it's obfuscated somehow. Finally, the usage of the word "depurate" here was very suspicious to me:
Depurate differently: Check test expectation: new build, succession randomization, verify rows, columns, assert controller.state.succession == 0 implies that random_new_succession shouldn’t increment. However, in implementation, randomize was set to increment succession.
I asked ChatGPT what "depurate" means, and it has nothing to do with programming.. The meaning is "to sanitize," which is completely unrelated to the context.
If the model is using some kind of steganographic scheme, it would be really interesting to figure out how it works!
So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model.
There's a big difference between merely imitating a teacher model, and learning to produce outputs that the teacher model likes the most. The latter allows you to surpass the teacher, because verification is easier than generation. It's unclear how competitive a purely process-supervised model could be, but in principle it could scale far beyond human intelligence.
Process supervision could even lead to better performance than outcome supervision in some areas, like those that don't have a well-defined reward signal. For example, you may not be able to write a good story, but it's relatively easy to tell which of two worldbuilding ideas you like better. It might be more efficient for an LLM to get feedback on each step it takes on a creative writing project, rather than getting a single reward at the very end for how well a human liked the result.
Process-based and outcome-based are fine in isolation, but they should not be mixed.
Process-based supervision can be fine in isolation, but only if you're using myopic optimization as in MONA, where each step is reinforced independently of future steps. Otherwise, you'll still get multi-step reward hacking, since the model will be motivated to set itself up for future (process-based) reward.
Here's an idea to safely get the benefits of both unrestricted CoT and process-based supervision: in each step, the model gets a private CoT to think about how to maximize human approval, and then it presents a final step for the process reward model to check. This way, multi-step reward hacking isn't incentivized, as in regular MONA, and single-step reward hacking can be caught by the CoT monitor. This is like your Shoggoth+Face idea, but repeated for every step in the process.
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate "labels" like "this is a tree-detector" or "this is the Golden Gate Bridge feature" is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
I'd be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can't point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
This paper is popular because of bees
What do you mean?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes - are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won't be much more interpretable than traditional NNs, or some other reason?
Well, the statement you quoted doesn't contradict the additional statement "This policy is more likely to apply if most details about you other than your existence are not publicly known." Most likely, both statements are true.