I had been thinking about two approaches:
And then I argued that 2 is a bad idea for various reasons listed in Section 8.3.3.1 here.
I prefer 1, supplemented by (A) sandbox testing (to the extent possible) and (B) an understanding of how our own code compares and contrasts with how human brains work (to the extent possible), thus allowing us to get nonzero insight out of our massive experience with human motivations.
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Is that right?
If so, I agree that this proposal would be an improvement over “just do 2 and call it a day”.
I’m still not too interested in that approach because:
(The above might be tied to my idiosyncratic opinions about how AGI is likely to work, involving model-based RL etc. rather than LLMs)
Different topic: I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile) which wouldn’t generalize well to superintelligence; plus a ton of learning parenting norms from one’s culture. Cross-cultural comparisons of parenting are pretty illuminating here. Directly targeting cultural learning (a.k.a. learning norms) would also be an interesting-to-me drive to figure out how it works.
Anyway, nice post, happy for you to be thinking about this stuff :)
Thank you for the detailed comment!
By contrast, you’re advocating (IIUC) to start with 2, and then do mechanistic interpretability on the artifact that results, thus gaining insight about how a “caring drive” might work. And then the final AGI can be built using approach 1.
Yes, that's exactly correct. I haven't thought about "if we managed to build a sufficiently smart agent with the caring drive, then AGI is already too close". If any "interesting" caring drive requires capabilities very close to AGI, then i agree that it seems like a dead end in light of the race towards AGI. So it's only viable if "interesting" and "valuable" caring drive could be potentially found within ~current level of capability agents. Which honestly doesn't sound like something totally improbable to me.
Also, without some global regulation to stop this damn race I expect everyone to die soon anyway, and since I'm not in the position to meaningfully impact this, I might as well continue trying to work in the directions that will work only in the worlds where we would suddenly have more time.
And once we have something like this, I expect a lot of gains in speed of research from all the benefits that come from the ability to precisely control and run experiments on artificial NN.
I’m curious why you picked parenting-an-infant rather than helping-a-good-friend as your main example. I feel like parenting-an-infant in humans is a combination of pretty simple behaviors / preferences (e.g. wanting the baby to smile)
Several reasons:
And by "caring for the baby" I mean like all the actions of the parents until the "baby" is like ~25 years old. Those actions usually have a lot of intricate decisions that are aimed at something like "success and happiness in the long run, even if it means some crying right now". It's hard to do right, and a lot of parents make mistakes. But in most cases, it seems like the capability failure, not the intentions. And these intentions looks much more interesting to me than "make a baby smile".
I think the “How do children learn?” section of this post is relevant. I really think that you are ascribing things to innate human nature that are actually norms of our culture.
I think humans have a capacity to empathetically care about the well-being of another person, and that capacity might be more or less (or not-at-all) directed towards one’s children, depending on culture, age, circumstance, etc.
Other than culture and non-parenting-specific drives / behaviors, I think infant-care instincts are pretty simple things like “hearing a baby cry is mildly aversive (other things equal, although one can get used to it)” and “full breasts are kinda unpleasant and [successful] breastfeeding is a nice relief” and “it’s pleasant to look at cute happy babies” and “my own baby smells good” etc. I’m not sure why you would expect those to “fail horribly in the modern world”?
Although I agree that some people have a genuine intrinsic prosocial drive, I think there is also an alternative egoistic "solutions".
If we’re talking about humans, there are both altruistic and self-centered reasons to cooperate with peers, and there are also both altruistic and self-centered reasons to want one’s children to be healthy / successful / high-status (e.g. on the negative side, some cultures make whole families responsible for one person’s bad behavior, debt, blood-debt, etc., and on the positive side, the high status of a kid could reflect back on you, and also some cultures have an expectation that capable children will support their younger siblings when they’re an older kid, and support their elderly relatives as adults, so you selfishly want your kid to be competent). So I don’t immediately see the difference. Either way, you need to do extra tests to suss out whether the behavior is truly altruistic or not—e.g. change the power dynamics somehow in the simulation and see whether people start stabbing each other in the back.
This is especially true if we’re talking about 24-year-old “kids” as you mention above; they are fully capable of tactical cooperation with their parents and vice-versa.
In a simulation, if you want to set up a direct incentive to cooperate with peers, just follow the instructions in evolution of eusociality. But I feel like I’m losing track of what we’re talking about and why.
But it seems to be much more complicated set of behaviors. You need to: correctly identify your baby, track its position, protect it from outside dangers, protect it from itself, by predicting the actions of the baby in advance to stop it from certain injury, trying to understand its needs to correctly fulfill them, since you don’t have direct access to its internal thoughts etc.
Compared to “wanting to sleep if active too long” or “wanting to eat when blood sugar level is low” I would confidently say that it’s a much more complex “wanting drive”.
Strong disagree that infant care is particularly special.
All human behavior can and usually does involve use of general intelligence or gen-int derived cached strategies. Humans apply their general intelligence to gathering and cooking food, finding or making shelters to sleep in and caring for infants. Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck. Ducks lose ducklings to poor path planning all the time. Mama duck doesn't fall through the sewer grate but her ducklings do ... oops.
Any such drive will be always "aimed" by the global loss function, something like: our parents only care about us in a way for us to make even more babies and to increase our genetic fitness.
We're not evolution and can aim directly for the behaviors we want. Group selection on bugs for lower population size results in baby eaters. If you want bugs that have fewer kids that's easy to do as long as you select for that instead of a lossy proxy measure like population size.
Simulating an evolutionary environment filled with AI agents and hoping for caring-for-offspring strategies to win could work but it's easier just to train the AI to show caring-like behaviors. This avoids the "evolution didn't give me what I wanted" problem entirely.
There's still a problem though.
It continues to work reliably even with our current technologies
Goal misgeneralisation is the problem that's left. Humans can meet caring-for-small-creature desires using pets rather than actual babies. It's cheaper and the pets remain in the infant-like state longer (see:criticism of pets as "fur babies"). Better technology allows for creating better caring-for-small creature surrogates. Selective breeding of dogs and cats is one small step humanity has taken in that direction.
Outside of "alignment by default" scenarios where capabilities improvements preserve the true intended spirit of a trained in drive, we've created a paperclip maximizer that kills us and replaces us with something outside the training distribution that fulfills its "care drive" utility function more efficiently.
Our better other-human/animal modelling ability allows us to do better at infant wrangling than something stupider like a duck.
I agree, humans are indeed better at a lot of things, especially intelligence, but that's not the whole reason why we care for our infants. Orthogonally to your "capability", you need to have a "goal" for it. Otherwise you would probably just immediately abandon grossly looking screaming piece of flesh that fell out of you for unknown to you reasons, while you were gathering food in the forest. Yet something inside will make you want to protect it, sometimes with your own life for the rest of your life if it works well.
Simulating an evolutionary environment filled with AI agents and hoping for caring-for-offspring strategies to win could work but it's easier just to train the AI to show caring-like behaviors.
I want agents that take effective actions to care about their "babies", which might not even look like caring at the first glance. Something like, keeping your "baby" in some enclosed kindergarden, while protecting the only entrance from other agents? It would look like "mother" agent abandoned its "baby", but in reality could be a very effective strategy for caring. It's hard to know an optimal strategy in every proceduraly generated environment and hence trying to optimize for some fixed set of actions, called "caring-like behaviors" would probably indeed give you what your asked, but I expect nothing "interesting" behind it.
Goal misgeneralisation is the problem that's left. Humans can meet caring-for-small-creature desires using pets rather than actual babies.
Yes they can, until they will actually make a baby, and after that, it's usually really hard to sell loving mother "deals" that will involve suffering of her child as the price, or abandon the child for the more "cute" toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences). And carefully engenireed system could potentialy be even more robust than that.
Outside of "alignment by default" scenarios where capabilities improvements preserve the true intended spirit of a trained in drive, we've created a paperclip maximizer that kills us and replaces us with something outside the training distribution that fulfills its "care drive" utility function more efficiently.
Again. I'm not proposing the "one easy solution to the big problem". I understand that training agents that are capable of RSI in this toy example will result in everyone's dead. But we simply can't do that yet, and I don't think we should. I'm just saying that there is this strange behavior in some animals, that in many aspects looks very similar to the thing that we want from aligned AGI, yet nobody understands how it works, and few people try to replicate it. It's a step in that direction, not a fully functional blueprint for the AI Alignment.
TLDR:If you want to do some RL/evolutionary open ended thing that finds novel strategies. It will get goodharted horribly and the novel strategies that succeed without gaming the goal may include things no human would want their caregiver AI to do.
Orthogonally to your "capability", you need to have a "goal" for it.
Game playing RL architechtures like AlphaStart and OpenAI-Five have dead simple reward functions (win the game) and all the complexity is in the reinforcement learning tricks to allow efficient learning and credit assignment at higher layers.
So child rearing motivation is plausibly rooted in cuteness preference along with re-use of empathy. Empathy plausibly has a sliding scale of caring per person which increases for friendships (reciprocal cooperation relationships) and relatives including children obviously. Similar decreases for enemy combatants in wars up to the point they no longer qualify for empathy.
I want agents that take effective actions to care about their "babies", which might not even look like caring at the first glance.
ASI will just flat out break your testing environment. Novel strategies discovered by dumb agents doing lots of exploration will be enough. Alternatively the test is "survive in competitive deathmatch mode" in which case you're aiming for brutally efficient self replicators.
The hope with a non-RL strategy or one of the many sort of RL strategies used for fine tuning is that you can find the generalised core of what you want within the already trained model and the surrounding intelligence means the core generalises well. Q&A fine tuning a LLM in english generalises to other languages.
Also, some systems are architechted in such a way that the caring is part of a value estimator and the search process can be made better up till it starts goodharting the value estimator and/or world model.
Yes they can, until they will actually make a baby, and after that, it's usually really hard to sell loving mother "deals" that will involve suffering of her child as the price, or abandon the child for the more "cute" toy, or persuade it to hotwire herself to not care about her child (if she is smart enough to realize the consequences).
Yes, once the caregiver has imprinted that's sticky. Note that care drive surrogates like pets can be just as sticky to their human caregivers. Pet organ transplants are a thing and people will spend nearly arbitrary amounts of money caring for their animals.
But our current pets aren't super-stimuli. Pets will poop on the floor, scratch up furniture and don't fulfill certain other human wants. You can't teach a dog to fish the way you can a child.
When this changes, real kids will be disappointing. Parents can have favorite children and those favorite children won't be the human ones.
Superstimuli aren't about changing your reward function but rather discovering a better way to fulfill your existing reward function. For all that ice cream is cheating from a nutrition standpoint it still tastes good and people eat it, no brain surgery required.
Also consider that humans optimise their pets (neutering/spaying) and children in ways that the pets and children do not want. I expect some of the novel strategies your AI discovers will be things we do not want.
It is to note that evolutionary genetical optimization -> genotype -> phenotype, I am saying this as you extrapolate based on the bug study and metazoa are usually rather complex system, your argument is, as far as I know, sound, but a such a broad loss function might result in a variety of other behaviours, different from the intended purpose as well, what I am trying to do is expand on your point as it allows for a variety of interesting scenarios.
The post you linked contains a reference to the mathematical long-term fitness advantage of certain altruism types, I will add a later date edit this post to add some experimental studies that show, that it is "relatively easy" to breed altruism into certain metazoa ( same as above holds of course it was easy in these given the chosen environment ). If I remember correctly the chicken one is even linked on lesswrong.
I would like to ask whether it is not more engaging if to say, the caring drive would need to be specifically towards humans, such that there is no surrogate?
In regards to ducks is that an intelligence or perception problem? I think tose two would need to be differentiated as they add another layer of complexity, both apart and together, or am I missing something?
I would like to ask whether it is not more engaging if to say, the caring drive would need to be specifically towards humans, such that there is no surrogate?
Definitely need some targeting criteria that points towards humans or in their vague general direction. Clippy does in some sense care about paperclips so targeting criteria that favors humans over paperclips is important.
The duck example is about (lack of) intelligence. Ducks will place themselves in harms way and confront big scary humans they think are a threat to their ducklings. They definitely care. They're just too stupid to prevent "fall into a sewer and die" type problems. Nature is full of things that care about their offspring. Human "caring for offspring" behavior is similarly strong but involves a lot more intelligence like everything else we do.
You should look at the work of Steve Byrnes, particularly his Intro to Brain-Like-AGI Safety sequence. Figuring out the brain mechanisms of prosocial behavior (including what you're terming the caring drive) is his primary research goal. I've also written about this approach in an article, Anthropomorphic reasoning about neuromorphic AGI safety, and in some of my posts here.
Yes, I've read the whole sequence a year ago. I might be missing something and probably should revisit it, just to be sure and because it's a good read anyway, but i think that my idea is somewhat different.
I think that instead of trying to directly understand wet bio-NN, it might be a better option to replicate something similar in an artificial-NN. It is much easier to run experiments since you can save the whole state at any moment and intoduce it to the different scenarios, so it much easier to control for some effect. Much easier to see activations, change weights, etc. The catch is that we have to first find it blindly with gradient descent, probably by simulating something similar to the evolutionary environment that produced "caring drives" in us. And maternal instinct in particular sounds like the most interesting and promising candidate for me.
Can you provide links to your posts on that? I will try to read more about it in the next few days.
I'd really like to see more follow up on the ideas made in this post. Our drive to care is arguably why we're willing to cooperate, and making AI that cares the same way we do is a potentially viable path to AI aligned with human values, but I've not seen anyone take it up. Regardless, I think this is an important idea and think folks should look at it more closely.
My first thought for an even easier case is imprinting in ducks. Maybe a good project might be reading a bunch of papers on imprinting and trying to fit it into a neurological picture of duck learning Steve Bynes style. One concern would be if duck imprinting is so specialized that it bypasses a lot of the generality of the world model and motivational system - but from a cursory look at some papers (e.g.) I think that might not be the case.
With the duckling -> duck or "baby" -> "mother" inprinting and other interactions I expect no, or significantly less "caring drives". Since a baby is weaker/dumber and caring for your mother provides few genetic fitness incentives, evolution wouldn't try that hard to make it happen, even if it was an option (still could happen sometimes as a generalization artifact, if it's more or less harmless). I agree that "forming a stable way to recognize and track some other key agent in the environment" should be in both "baby" -> "mother" and "mother" -> "baby" cases. But the "probably-kind-of-alignment-technique" from nature should be only in the latter.
Most neurons dedicated to babies are focused on the mouth, I believe, up until around 7 months. Neuronal pathways evolve over time, and it seems that a developmental approach to AI presents its own set of challenges. When growing an AI that already possesses mature knowledge and has immediate access to it, traditional developmental caring mechanisms may not fully address the enormity of the control problem. However, if the AI gains knowledge through a gradual, developmental process, this approach could be effective in principle.
Suppose your AI includes an LLM. You can just prompt it with "You love, and want to protect, all humans in the same way as a parent loves their children." Congratulations — you just transferred the entire complex behavior pattern. Now all you need to do it tune it.
TL;DR: This post is about value of recreating “caring drive” similar to some animals and why it might be useful for AI Alignment field in general. Finding and understanding the right combination of training data/loss function/architecture/etc that allows gradient descent to robustly find/create agents that will care about other agents with different goals could be very useful for understanding the bigger problem. While it's neither perfect nor universally present, if we can understand, replicate, and modify this behavior in AI systems, it could provide a hint to the alignment solution where the AGI “cares” for humans.
Disclaimers: I’m not saying that “we can raise AI like a child to make it friendly” or that “people are aligned to evolution”. Both of these claims I find to be obvious errors. Also, I will write a lot about evolution, as some agentic entity, that “will do that or this”, not because I think that it’s agentic, but because it’s easier to write this way. I think that GPT-4 have some form of world model, and will refer to it a couple of times.
Nature's Example of a "Caring Drive"
Certain animals, notably humans, display a strong urge to care for their offspring.
I think that part of one of the possible “alignment solutions” will look like the right set of training data + training loss that allow gradient to robustly find something like a ”caring drive” that we can then study, recreate and repurpose for ourselves. And I think we have some rare examples of this in nature already. Some animals, especially humans, will kind-of-align themselves to their presumable offspring. They will want to make their life easier and better, to the best of their capabilities and knowledge. Not because they “aligned to evolution” and want to increase the frequency of their genes, but because of some strange internal drive created by evolution.
The set of triggers tuned by evolution, activated by events associated with the birth will awake the mechanism. It will re-aim the more powerful mother agent to be aligned to the less powerful baby agent, and it just so happens that their babies will give them the right cues and will be nearby when the mechanism will do its work.
We will call the more powerful initial agent that changes its behavior and tries to protect and help its offspring “mother” and the less powerful and helpless agent “baby”. Of course the mechanism isn’t ideal, but it works well enough, even in the modern world, far outside of initial evolutionary environment. And I’m not talking about humans only, stray urban animals that live in our cities will still adapt their “caring procedures” to this completely new environment, without several rounds of evolutionary pressure. If we can understand how to make this mechanism for something like a “cat-level” AI, by finding it via gradient descend and then rebuild it from scratch, maybe we will gain some insides into the bigger problem.
The rare and complex nature of the caring drive in contrast to simpler drives like hunger or sleep.
What do I mean by “caring drive”? Animals, including humans, have a lot of competing motivations, “want drives”, they want to eat, sleep, have sex, etc. It seems that the same applies to caring about babies. But it seems to be much more complicated set of behaviors. You need to:
correctly identify your baby, track its position, protect it from outside dangers, protect it from itself, by predicting the actions of the baby in advance to stop it from certain injury, trying to understand its needs to correctly fulfill them, since you don’t have direct access to its internal thoughts etc.
Compared to “wanting to sleep if active too long” or “wanting to eat when blood sugar level is low” I would confidently say that it’s a much more complex “wanting drive”. And you have no idea about “spreading the genes” part. You just “want a lot of good things to happen” to your baby for some strange reason. I’m yet not sure, but this complex nature could be the reason why there is an attraction basin for more “general” and “robust” solution. Just like LLM will find some general form of “addition” algorithm instead of trying to memorize a bunch of examples seen so far, especially if it will not see them again too often. I think that instead of hardcoding a bunch of britle optimized caring procedures, evolution repeatedly finds the way to make mothers “love” their babies, outsourcing a ton of work to them, especially if situations where it’s needed aren’t too similar.
And all of it is a consequence of a blind hill climbing algorithm. That’s why I think that we might have a chance of recreating something similar with gradient descend. The trick is to find the right conditions that will repeatedly allow gradient descend to find the same caring-drive-structure, find similarities, understand the mechanism, recreate it from scratch to avoid hidden internal motivations, repurpose it for humans and we are done! Sounds easy (it’s not)
Characteristics and Challenges of a Caring Drive
It’s rare: most animals don’t care, because they can’t or don’t need to.
A lot of times, it’s much more efficient to just make more babies, but sometimes they must provide some care, simply because it was the path that evolution found that works. And even if they will care about some of them, they may choose one, and left others die, again, because they don’t have a lot of resources to spare and evolution will tune this mechanism to favor the most promising offspring if it is more efficient. And not all animals could become such caring parents: you can’t really care and protect something else if you are too dumb for example. So there is also some capability requirements for animals to even have a chance of obtaining such adaptation. I expect the same capability requirements for AI systems. If we want to recreate it, we will need to try it with some advanced systems, otherwise I don’t see how it might work at all.
It’s not extremely robust: give enough brain damage or the wrong tunings and the mechanism will malfunction severely
Which is obvious, there is nothing surprising in “if you damage it, it could break”, this will apply to any solution to some degree. It shouldn’t be surprising that drug abusing or severely ill parents will often fail to care about their child at all. However, If we will succeed at building aligned AGI stable enough for some initial takeoff time, then the problem of protecting it from damage should not be ours to worry at some moment. But we still need to ensure initial stability.
It’s not ideal from our AI -> humans view
Evolution has naturally tuned this mechanism for optimal resource allocation, which sometimes means shutting down care when resources needed to be diverted elsewhere. Evolution is ruthless because of the limited resources, and will eradicate not only genetic lines that care too less, but also the ones that care too much. We obviously don’t need that part. And a lot of times you can just give up on your baby and instead try to make a new one, if the situation is too dire, which we also don’t want to happen to us. Which means that we need to understand how it works, to be able to construct it in the way we want.
But it’s also surprisingly robust!
Of course, there are exceptions, all people are different and we can’t afford to clone some “proven to be a loving mother” woman hundreds of times to see if the underlying mechanism triggers reliably in all environments. But it seems to work in general, and more so: it continues to work reliably even with our current technologies, in our crazy world, far away from initial evolution environment. And we didn’t had to live through waves of birth declines and rises as evolution tries to adapt us to the new realities, tuning brains of new generation of mothers to find the ones that will start to care about their babies in the new agricultural or industrial or information era.
Is this another “anthropomorphizing trap”?
For what I know, it is possible to imagine alternative human civilization, without any parental care, so instead our closest candidate for such behavior would be some other intelligent species. Intelligent enough to be able to care in theory and forced by their weak bodies to do so in order to have any descendants at all, maybe it could be some mammals, birds, or whatever, it doesn’t matter. The point I’m making here is that: I don’t think that it is some anthropic trap to search for inspiration or hints in our own behavior, it just so happens that we are smart, but have weak babies that require a lot of attention so that we received this mechanism from evolution as a “simplest solution”. You don’t need to search for more compact brains that will allow for longer pregnancy, or hardwire even more knowledge into the infants brains if you can outsource a lot of stuff to the smart parents, you just need to add the “caring drive” and it will work fine. We want AI to care about us, not because we care about our children, and want the same from AI, we just don’t want to die, and we would want AI to care about us, even if we ourselves would lack this ability.
Potential flaws:
I’m not saying that it’s a go-to solution that we can just copy, but the step in right direction from my view. Replicating similar behavior and studying its parts could be a promising direction. There are a few moments that might make this whole approach useless, for example:
Overall I’m pretty sure that this do in fact work, certainly good enough to be a viable research direction.
Technical realization: how do we actually make this happen?
I have no concrete idea. I have a few, but I’m not sure about how practically possible they are. And since nobody knows how this mechanism works as far as I know, it’s hard to imagine having the concrete blueprint to create one. So the best I can give is: we try to create something that looks right from the outside and see if there is anything interesting in the inside. I also have some ideas about “what paths could or couldn’t lead to the interesting insides”.
First of all, I think this “caring drive” couldn’t run without some internal world model. Something like: it's hard to imagine far generalized goals without some far generalized capabilities. And world model could be obtained from highly diverse, non repetitive dataset, which forces the model to actually “understand” something and stop memorizing.
Maybe you can set an environment with multiple agents, similar to Deepmind’s here (https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play), initially reward an agent for surviving by itself, and then introduce the new type of task: baby-agent, that will appear near the original mother-agent (we will call it “birth”), and from that point of time, the whole reward will come purely from how long baby will survive? Baby will initially have less mechanical capabilities, like speed, health, etc and then “grow” to be more capable by itself? I’m not sure what should be the “brain” of baby-agent, another NN or maybe the same that were found from training the mother agent? Maybe creating a chain of agents: agent 1 at some point gives birth to the agent 2 and receive reward for each tick that agent 2 is alive, which itself will give birth to the agent 3 and receive reward fot each tick that agent 3 is alive, and so on. Maybe it will produce something interesting? Obviously the “alive time” is a proxy, and given enough optimization power we should expect Goodhart horrors beyond our comprehension. But the idea is that maybe there is some “simple” solution that will be found first, which we can study. Recreating the product of evolution, not using the immense “computational power” of it could be very tricky.
But if it seems to work, and mother-agent behave in a seemingly “caring way”, then we can try to apply interpretability tools, try to change the original environment drastically, to see how far it will generalize, try to break something and see how well it works, or manually override some parameters and study the change. However, I’m not qualified to make this happen anyway, so if you find this idea interesting, contact me, maybe we can do this project together.
How the good result might look like?
Let’s imagine that we’ve got some agent that can behave with care toward the right type of “babies”. For some yet unknown reason, from outside view it behaves as if it cares about its baby-agent, it finds the creative ways to do so in new contexts. Now the actual work begins: we need to understand where are the parts that make this possible located, what is the underlying mechanism, what parts are crucial and what happens when you break them, how can we re-write the “baby” target, so that our agent will care about different baby-agents, under what conditions gradient descent will find an automatic off switch (I expect this to be related to the chance of obtaining another baby and given only 1 baby per “life”, gradient will never find the switch, since it will have no use). Then we can actually start to think about recreating it from scratch. Just like what people did with modular addition: https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking . Except this time we don’t know how the algorithm could work or look like. But “intentions”, “motivations” and “goals” of potential AI systems are not magic, we should be able to recreate and reverse-engineer them.