nsage - LessWrong

I think we have to postulate that this component of the RL signal doesn't get to see the chain-of-thought

If this is true, o1 can produce reasoning that is unsound, invalid, vacuous, etc. and will still be rewarded by the RL framework as long as the conclusion is true. In classical logic you can even formulate arguments that are unsound, invalid and vacuous but still true if q is true, like p ^ ~q -> q.

o1 is planning to deceive because it has been rewarded for offering plausible answers, not correct answers

It is not necessary to presume deception because o1 does not need to produce sound reasoning in order to produce correct answers, let alone plausible ones. More likely it isn’t made to care about the correctness of its reasoning because it's not receiving reinforcement on the correctness of its inferential steps.

The original CoT paper used human evaluators to check the logic, so I’m guessing OpenAI did the same thing. Regardless of whether the evaluation was automated or done by humans, it's not clear whether the evaluation rubric instructed the evaluators to penalize bad reasoning even when the conclusion was correct, and how much these penalties were weighted relative to the penalty for an incorrect conclusion. I suspect the RL model is primarily reinforcing the conclusions rather than the arguments themselves, whereas a proper reward signal should be over the entire inferential chain. In fact, the inferential chain is really all that matters, because the conclusion is simply a step that accepts or rejects some equality condition between the question posed.

Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments