Yeah, if the RLHF is supposed to train honesty into it, maybe it would have done worse on the task rabbit task. This really seems like a PR throwaway line rather than a legit attempt to red team the final model.
Yeah, if the RLHF is supposed to train honesty into it, maybe it would have done worse on the task rabbit task. This really seems like a PR throwaway line rather than a legit attempt to red team the final model.