User Comment Replies

A Proposal for AI Alignment: Using Directly Opposing Models

Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.

I fully agree with your conclusion of "unnecessary complexity" based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn't solve anything. That said I believe there might have been a misunderstanding, because this isn't at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the f... (read more)

1Mazianni2y

I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it. Near the top of the responses was this quote. Emphasis mine. I tend to be suspicious of people who insist their assumptions are valid without being willing to point to work that proves the hypothesis. In the end, your proposal has a test plan. Do the test, show the results. My prediction is that your theory will not be supported by the test results, but if you show your work, and it runs counter to my current model and predictions, then you could sway me. But not until then, given the assumptions you made and the assumptions you're importing via related theories. Until you have test results, I'll remain skeptical. Don't get me wrong, I applaud the intent behind searching for an alignment solution. I don't have a solution or even a working hypothesis. I don't agree with everything in this article (that I'm about to link), but it relates to something that I've been thinking for a while-- that it's unsafe to abstract away the messiness of humanity in pursuit of alignment. That humans are not aligned, and therefore the difficulty with trying to create alignment where none exists naturally is inherently problematic. You might argue that humans cope with misalignment, and that that's our "alignment goal" for AI... but I would propose that humans cope due to power imbalance, and that the adage "power corrupts, and absolute power corrupts absolutely" has relevance-- or said another way, if you want to know the true nature of a person, given them power over another and observe their actions. [I'm not anthropomorphizing the AI. I'm merely saying if one intelligence [humans] can display this behavior, and deceptive behaviors can be observed in less intelligent entities, then an intelligence of similar level to a human might possess similar traits. Not

A Proposal for AI Alignment: Using Directly Opposing Models

Arne B2y*00

Thank you a lot for your response! The judge is already aligned because it's a human (at least later in the training process), I am sorry if this isn't clear in the article. The framework is used to reduce the amount of times the human (judge) is asked to judge the agents actions. This way the human can align the agent to it's values while only rarely having to judge the agents actions.

The section in the text:

At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses. Later, when the Defe

... (read more)

1Mazianni2y

Assumptions I don't consent to the assumption that the judge is aligned earlier, and that we can skip over the "earlier" phase to get to the later phase where a human does the assessment. I also don't consent to the other assumptions you've made, but the assumption about the Judge alignment training seems pivotal. Take your pick: Fallacy of ad nauseum, or Fallacy of circular reasoning. If N (judge 2 is aligned), then P (judge 1 is aligned), and if P then Q (agent is aligned) ad infinitum or If T (alignment of the judge) implies V (alignment of the agent), and V (the agent is aligned) is only possible if we assume T (the judge, as an agent, can be aligned). So, your argument looks fundamentally flawed. "Collusion for mutual self-preservation" & "partial observability" I would argue you could find this easily by putting together a state table and walking the states and transitions. No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation. You can assume that this is solved, but without proof of solution, there isn't anything here that I can see to interact with but assumptions. If you want further comment/feedback from me, then I'd ask you to show your work and the proof your assumptions are valid. Conclusion This all flows back to assuming the conclusion: that the Judge is aligned. I haven't seen you offer any proof that you have a solution for the judge being aligned earlier, or a solution for aligning the judge that is certain to work. If you could just apply your alignment training of the Judge to the Agent, in the first place, the rest of the framework seems unnecessary. And if your argument is that you've explained this in reverse, that the human is the judge earlier and the AI is the judge later, and that the judge learns from the human... Again, If P (the judge is aligned), then Q (the agent is aligned.) My read of your proposal and response is that you can't apply the trainin

LESSWRONG
LW

All of Arne B's Comments + Replies