Thank you a lot for your response! The judge is already aligned because it's a human (at least later in the training process), I am sorry if this isn't clear in the article. The framework is used to reduce the amount of times the human (judge) is asked to judge the agents actions. This way the human can align the agent to it's values while only rarely having to judge the agents actions.
The section in the text:
...At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses. Later, when the Defe
Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.
I fully agree with your conclusion of "unnecessary complexity" based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn't solve anything. That said I believe there might have been a misunderstanding, because this isn't at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the f... (read more)