architect of two generations of neuro-symbolic AI systems: Open Sesame! (1993) and Digie.ai (2017 - present);
the mean self-other overlap value across episodes can be used to classify with 100% accuracy which agents are deceptive
great to see this impressive work on applying SOO for realigning and classifying agents! My comments are most directly related to using it for identifying misaligned agents, but applies to realignment as well:
Essentially, since these SOO measurment methods are documented/published (as well as inentions and methods to realign), the knowledge will eventually become part of any large misaligned model’s pre-training data. Isn't it therefore possible that a future model could establish protective measures (during the pre-training process) that would undermine or even misdirect the subsequent SOO measurement and fine-tuning efforts? If so, then how could we fully trust our SOO value measurements and conclusions?
I’d expect that this possibility could be tested by pre-training a large model to be both misaligned and have deep knowledge of SOO purpose, intent and measurement methods. We could then measure SOO values on the misaligned-SOO-knowledgable model and compare to misaligned-only model. If there's any significant difference between the two model’s SOO values, such that the misaligned-SOO-knowledgeable models have a higher mean value, then we’ll know the SSO measurement is itself susceptible to deception and therefore no longer reliable for classifying.
Would love to get your thoughts on this, thanks!
yes, I certainly agree that the SOO work should be fully published/documented/shared, my point is that keeping it from future training data would be nearly impossible anyhow.
However, as you just mentioned: "having aligned AGIs will likely become necessary to be able to ensure the safety of subsequent systems"... since those AGIs (well before superintelligence) will most likely be SOO-knowledgable, wouldn't you need to test it to make sure it hasn't already started to influence your SOO values?
The models might start making slow progress at influencing the SOO values and I think you'd want to be aware of that as soon as it started. Even with current large models, for instance at the GPT 4 level, how could you be certain that an SOO-knowledgable one might not already be able to slightly influence SOO values without testing it?