Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:
Instead of fine-tuning on human-preferences, they directly incorporate human feedback in the pre-training phase, conditioning the model on <good> or <bad> feedback tokens placed at the beginning of the training sequences.
They find this to be Pareto-optimal out of five considered pre-training objectives, greatly reducing the amount of undesired outputs while retaining standard LM pre-training downstream performance AND outperforming RLHF fine-tuning in terms of preference satisfaction.
This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.
From a discussion with James Chua on AISS's slack, we noted similarities between this work and Charlie Steiner's Take 13: RLHF bad, conditioning good. James is developing a library ("conditionme") specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.
Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:
This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.
From a discussion with James Chua on AISS's slack, we noted similarities between this work and Charlie Steiner's Take 13: RLHF bad, conditioning good. James is developing a library ("conditionme") specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.