We may have one example of realized out-of-distribution alignment: maternal attachment. Evolution has been able to create an architecture that seems to take care of something reliably enough that modern humans, with access to unlimited food, drugs, and VR headsets, do not seek to feed a child to death, drug him with heroin, or constantly show him a beautiful simulated world.
Moreover, this desire rather works too broadly in response to everything that is remotely similar to children, and does not break at the first opportunity if the child is dressed, washed, or otherwise changed in his "characteristic appearance during the training stage." At some point, the evolutionary gradient descent had to align the mother with the child, and did it quite well.
Perhaps when trying to create an architecture that "really" wants "good" for someone like a child, gradient descent, under certain conditions, eventually stumbles upon an architecture that generalizes at least a little beyond the training sample without catastrophic consequences. But there is a problem: the current methods of training neural networks imply a more or less fixed architecture and the selection of weights within it, while evolution can afford, albeit a blind but very wide search for many possible architectures, while basically does not have access to what these architectures will learn from the world around them, not to mention a huge amount of time.
So it is an entirely different training paradigm, where you try to find architecture + little pre-trained part that probably contains very coarse heuristics that serves as some kind of ground-truth and detection anchors for mechanisms that activate during pregnancy and after birth and enables maternal attachment.
It is worth noting that little is required from the child. He is hardly able to interpret the behavior of the mother or completely control her. Children usually do not have two mothers who argue in front of the child, arguing their points of view about further actions and requiring the child to choose the right one. Evolutionarily it happened that mothers who did not care enough for their children died in terms of their gene frequency, and thats all it take.
How could something like this be created? I don't know. RL agents come to mind trying to take care of their constrained and slightly modified versions, like a Tamagotchi? The constrained version of the agent gradually becomes less and less limited, and the criterion for the success of the mother is how long will the agent's child live? How exactly should we create gradients after each trial?
A few clarifications: I don’t think that this is a complete-solution-to-the-problem, it’s not enough to accidentally create “something inside that seems to work”, you need to understand what exactly occurs there, how it differs from agents who have “it” from those who do not, where it is located, how it can be modified and how far it needs to go OOD for it to start to break catastrophically.
This is not "let's raise AI like a child", it's more like "let's iterate through a lot of variants of mother agents and find those who are best at raising simulated children, and then find out how they differ from ordinary agents that evaluated based on their own survival time". So maybe we can build something that has some general positive appreciation towards humans.
When someone becomes maternally attached towards a dog, doesn't this count as an out-of-distribution alignment failure?
I think it depends on "alignment to what?". If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about "humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust".
So if we think about "how our attachment to something not-childish aligned with our children" well... technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wo... (read more)