Arne B — LessWrong

LESSWRONG
LW

Arne B — LessWrong

Replying toA Proposal for AI Alignment: Using Directly Opposing Models

A Proposal for AI Alignment: Using Directly Opposing Models

Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.

I fully agree with your conclusion of "unnecessary complexity" based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn't solve anything. That said I believe there might have been a misunderstanding, because this isn't at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the following Paragraph) is always a human.

The part about using a model trained via supervised... (read more)

Replying toA Proposal for AI Alignment: Using Directly Opposing Models

Arne B3y*

A Proposal for AI Alignment: Using Directly Opposing Models

Thank you a lot for your response! The judge is already aligned because it's a human (at least later in the training process), I am sorry if this isn't clear in the article. The framework is used to reduce the amount of times the human (judge) is asked to judge the agents actions. This way the human can align the agent to it's values while only rarely having to judge the agents actions.

The section in the text:

At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses. Later, when the Defendant and Police disagree less, it is replaced by a Human. Using

Arne B

The following text proposes a potential solution to the AI alignment problem. I am sharing it here as I have not come across any major issues with the proposed approach, such as those faced by the AI Stop button or the unworkable schemes outlined in the AGI Ruin: A List of Lethalities article. However I am still sceptical about this approach, so if you do have any problems or critiques that I have not addressed under the problems or assumptions sections, please feel free to share them in the comments below.

The Idea

The proposed solution involves using different AIs to control the model. While this idea has been suggested previously, it is not... (read 881 more words →)