jamesmazzu
jamesmazzu has not written any posts yet.

jamesmazzu has not written any posts yet.

Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I'm talking about is a new "strategy" for defining and apporaching the alignmemt problem, and not based on my personal "introspectively-observed moral reflection process" but based on concepts explored by others in the fields of psychology, evolution, AI, etc... it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally "personifying" AI, and relating AI stages to nature and nurture, in order to... (read more)
yes, I certainly agree that the SOO work should be fully published/documented/shared, my point is that keeping it from future training data would be nearly impossible anyhow.
However, as you just mentioned: "having aligned AGIs will likely become necessary to be able to ensure the safety of subsequent systems"... since those AGIs (well before superintelligence) will most likely be SOO-knowledgable, wouldn't you need to test it to make sure it hasn't already started to influence your SOO values?
The models might start making slow progress at influencing the SOO values and I think you'd want to be aware of that as soon as it started. Even with current large models, for instance at the GPT 4 level, how could you be certain that an SOO-knowledgable one might not already be able to slightly influence SOO values without testing it?
the mean self-other overlap value across episodes can be used to classify with 100% accuracy which agents are deceptive
great to see this impressive work on applying SOO for realigning and classifying agents! My comments are most directly related to using it for identifying misaligned agents, but applies to realignment as well:
Essentially, since these SOO measurment methods are documented/published (as well as inentions and methods to realign), the knowledge will eventually become part of any large misaligned model’s pre-training data. Isn't it therefore possible that a future model could establish protective measures (during the pre-training process) that would undermine or even misdirect the subsequent SOO measurement and fine-tuning efforts? If so, then how... (read more)
Thanks again for your feedback!
A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it's a mental shift and may/will be difficult for people to accept... but the goal is to actually start thinking that the advanced AI "mind" can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that's been shown in nature to produce children protective of their parents... and to again leverage the concept of evolution of intelligence to make it "pesonal" for the future AI.
...after all, neural nets themselves leverage the concepts/designs of the biological brain... (read more)
certainly ANY alignment solution will be hard/fraught with difficulties... but the point of Supertrust is to spend the effort on solutions that follow a strategy that's logically taking us in a direction of good outcomes, rather than the currrent default strategy that logically leads to bad outcomes.
specifically regarding "benevolent values", the default strategy is to nurture them, while bad actors can do the same with "bad values". The proposed strategy is to instead spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values) so that no matter what bad actors attempt to nurture, its instinctive judgment abilities will be able to override the attempted manipulation/control... and if... (read more)
Thanks so much for your additional feedback, I really appreciate you taking the time to write it!
Regarding your feedback points:
The paper is proposing a new alignment strategy not at all dependent on the one chat example illustrated.
The simple example is not intended to be statistically significant evidence (clearly indicated as such) even though I believe it's still powerful and unmistakable as a single example. By posting your comment and pict of it here, are you saying that you disagree with it being an example of dangerous misalignment? Do those look like the responses of a well-aligned AI to you?
If you've decided not to read the paper only becasue you found a chat example in it, then I should probably remove it from the preprint until I've completed the full evaluation of existing models... thanks for your feedback, if you get a chance to read the paper, please let me know if you have any thoughts about the substance of what is being proposed!
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I'm talking about is a new "strategy" for defining and apporaching the alignmemt problem, and not based on my personal "introspectively-observed moral reflection process" but based on concepts explored by others in the fields of psychology, evolution, AI, etc... it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally "personifying" AI, and relating AI stages to nature and nurture, in order to... (read more)