EVERYONE, CALM DOWN!
Meaning Alignment Institute just dropped their first post in basically a year and it seems like they've been up to some cool stuff.
Their perspective on value alignment really grabbed my attention because it reframes our usual technical alignment conversations around rules and reward functions into something more fundamental - what makes humans actually reliably good and cooperative?
I really like their frame of a moral graph and locally maximally good values to follow as another way of imagining alignment, it is a lot more similar to that which happened during cultural evolution as explored in for example The Secret of Our Success. It kind of seems like they're taking evolutionary psychology and morality research and group selection and applying the results to how to align models and I'm all for it.
It could be especially relevant for thorny problems like multi-agent coordination - just as humans with shared values can cooperate effectively even without explicit rules, AI systems might achieve more robust coordination through genuine internalization of values rather than pure game theory or rule-following.
This is part of my take nowadays, we need more work on things that work in grayer, multi-agent scenarios as we're likely going into a multi-polar future with some degree of slower takeoff.
Okay, so when I'm talking about values here, I'm actually not saying anything about policies as in utility theory or generally defined preference orderings.
I'm rather thinking of values as a class of locally arising heuristics or "shards" if you like that language that activate a certain set of belief circuits in the brain and similarly in an AI.
What do you mean more specifically when you say an instruction here? What should that instruction encompass? How do we interpret that instruction over time? How can we compare instructions to each other?
I think that instructions will become too complex to have good interpretability into especially for more complex multi-agent settings. How do we create interpretable multi-agent systems that we can change over time? I don't believe that direct instruction tuning will be enough as you will have this problem that is for example described in Cooperation and Control in Delegation Games with AIs each having one person they get an instruction from but this not telling us anything about the multi-agent cooperation abilities of the agents in play.
I think this line of reasoning is valid for AI agents acting in a multi-agent setting where they gain more control over the economy through integration with general humans.
I completely agree with you that doing "pure value learning" is not the best right now but I think we need work in this direction to retain control over multiple AI Agents working at the same time.
I think deontology/virtue ethics makes societies more interpretable and corrigible, does that make sense? Also, I have this other belief that this will be the case and that it is more likely to get a sort of "cultural, multi-agent take-off" compared to a single agent.
Curious to hear what you have to say about that!