Hi! My collaborators at the Meaning Alignment Institute put out some research yesterday that may interest folk here.
The core idea is introducing 'model integrity' as a frame for outer alignment. It leverages the intuition that "most people would prefer a compliant assistant, but a cofounder with integrity." It makes the case for training agents that act consistently based on coherent, well-structured, and inspectable values. The research is a continuation of the work in our previous paper (LW link), "What are human values, and how do we align AI to them?"
I've posted the full content of the post below, or you can read the original post here.
All feedback welcome! :)
Executive Summary
We propose ‘model... (read 5193 more words →)