I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I... (read more)
people are open sourcing highly capable models already but those models have not been trained on particularly meaning aligned corpuses. releasing reward models would allow people to change the meaning of words in ways that cause the reward models to rank their words highly but which are actually dishonest, this is currently called SEO. however if trying to optimize for getting your ideas repeated by a language model is the new target, then your language model objective being open source will mess with things. idk mate just my ramblings I guess
I think I don't completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to "at least" open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I... (read more)