I agree with most of the factual claims made in this post about evolution. I agree that "IGF is the objective" is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of "IGF is the objective" play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.
I don't really care about defending the usage of "fitness as the objective" specifically, and so I don't think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when "fitness" can be reasonably described as the objective, and when it can't be:
Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.
I notice I'm having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there's a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.
I'll think about it some more and get back to you.
So I think the issue is that when we discuss what I'd call the "standard argument from evolution", you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn't managed to explicitly distinguish them.
The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:
I agree with this form of the argument and have no objections to it. I don't think that the points in my post are particularly relevant to that claim. (I've even discussed a form of inner optimization in humans that causes value drift that I don't recall anyone else discussing in those terms before.)
However, I think that many formulations are actually implying, if not outright stating a stronger claim:
So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don't yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.
(Possibly you don't feel that anyone is actually implying the stronger version. If you don't and you would already disagree with the stronger version, then great! We are in agreement. I don't think it matters whether the implication "really is there" in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)
You wrote:
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically.
If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version - where our behavior is implied to be completely at odds with our original behavior - has to implicitly assume that things like an art-creation drive are something novel.
Now I don't think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it's just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being "somewhat out of distribution", but... to me that intuitively feels like it's less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we've only been in the time of falling birthrates for a relatively short time and it's not clear whether it'll continue for very long.)
I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by "evolution doesn't consistently optimize for any single thing" just as well as they could be explained by "we've experienced a left turn from what evolution originally optimized for".
However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded.
I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it's just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits - either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.
And I don't think that that kind of a situation is even particularly rare - anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an "either/or" choice.
Now you could say that "just like AlphaGo is still rewarded for winning games of Go and it's just the strategies that differ, the organism is still rewarded for reproducing and it's just the strategies that differ". But I think the difference is that for AlphaGo, the rewards are consistently shaping its "mind" towards having a particular optimization goal - one where the board is in a winning state for it.
And one key premise on which the "standard argument from evolution" rests is that evolution has not consistently shaped the human mind in such a direct manner. It's not that we have been created with "I want to have surviving offspring" as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning. Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal - from "I want to be in a state where I produce great art" (quite indirect) to "I want to have surviving offspring" (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.
This is also a bit hard to put a finger on, but I feel like there's some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn't start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned... even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!
I think the main crux is that in my mind, the thing you call the "weak version" of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.
My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawing on cave walls (maybe this helped communicate the whereabouts of food in the ancestral environment). In the modern environment, this may mean passing up a more lucrative job opportunity to be an artist, but it still means painting on some other surface. Thus, the art drive, taking almost the same kinds of actions it ever did (maybe we use acrylic paints from the store instead of grinding plants into dyes ourselves), no longer results in the same consequences in amount of communicating food locations or surviving and having children or whatever it may be. But this is distinct from a sharp left turn, where the actions also change drastically (from helping humans to killing humans).
I agree this is more true for some drives. However, I claim that the association between drives and behaviors is not true in general. I claim humans have a spectrum of different kinds of drives, which differ in how specifically the drive specifies behavior. At one end of the spectrum, you can imagine stuff like breathing or blinking where it's kind of hard to even say whether we have a "breathing goal" or a clock that makes you breath regularly--the goal is the behavior, in the same way a cup has the "goal" of holding water. At this end of the spectrum it is valid to use goal/drive and behavior interchangeably. At the other end of the spectrum are goals/drives which are very abstract and specify almost nothing about how you get there: drives like desire for knowledge and justice and altruism and fear of death.
The key thing that makes these more abstract drives special is that because they do not specifically prescribe actions, the behaviors are produced by the humans reasoning about how to achieve the drive, as opposed to behaviors being selected for by evolution directly. This means that a desire for knowledge can lead to reading books, or launching rockets, or doing crazy abstract math, or inventing Anki, or developing epistemology, or trying to build AGI, etc. None of these were specifically behaviors that evolution could have reinforced in us--the behaviors available in the ancestral environment were things like "try all the plants to see which ones are edible". Evolution reinforced the abstract drive for knowledge, and left it up to individual human brains to figure out what to do, using the various Lego pieces of cognition that evolution built for us.
This means that the more abstract drives can actually suddenly just prescribe really different actions when important facts in the world change, and those actions will look very different from the kinds of actions previously taken. To take a non-standard example, for the entire history of the existence of humanity up until quite recently, it just simply has not been feasible for anyone to contribute meaningfully to eradicating entire diseases (indeed, for most of human history there was no understanding of how diseases actually worked, and people often just attributed it to punishment of the gods or otherwise found some way to live with it, and sometimes, as a coping mechanism, to even think the existence of disease and death necessary or net good). From the outside it may appear as if for the entire history of humanity there was no drive for disease eradication, and then suddenly in the blink of an evolutionary timescale eye a bunch of humans developed a disease eradication drive out of nowhere, and then soon thereafter suddenly smallpox stopped existing (and soon potentially malaria and polio). These will have involved lots of novel (on evolutionary timescale) behaviors like understanding and manufacturing microscopic biological things at scale, or setting up international bodies for coordination. In actuality, this was driven by the same kinds of abstract drives that have always existed like curiosity and fear of death and altruism, not some new drive that popped into being, but it involved lots of very novel actions steering towards a very difficult target.
I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals (and there could be multiple). I think there may be a general communication issue where there is a type of person that likes to boil problems down to their core, which is usually some very simple setup, but then neglects to actually communicate why they believe this particular abstraction captures the thing that matters.
I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment. (And winning winning states of the board always looking like having more territory encircled seems analogous to surviving and reproducing always looking like having a lot of children)
I think there is also a disagreement about what AlphaGo does, though this is hard to resolve without better interpretability -- I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go. In fact, the biggest gripe I have with most empirical alignment research is that I think models today fail to have sufficiently abstract drives, quite possibly for reasons related to why they are kind of dumb today and why things like AutoGPT mysteriouly have failed to do anything useful whatsoever. But this is a spicy claim and I think not that many other people would endorse this.
I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals
I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:
The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)
15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. (AGI Ruin: A List of Lethalities)
Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.
To respond to your argument itself:
I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?
I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.
At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.
It's true that humans couldn't eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so... then the people of that time would not understand all of the technical details, and maybe they'd wonder why we'd bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don't think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they'd lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.
Similarly, humans have always gone on voyages of exploration - e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages - so they'd probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.
So I agree that the same fundamental values and drives can create very different behavior in different contexts... but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.
This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives - being a successful warrior was the way you got what you wanted back in that day - and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying "deep values" are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person's innate temperament that may be more hardcoded.)
I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn't bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get... then he might come to see our society as not that horrible after all.
I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment.
I don't think the actual victory states look substantially different? They're all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.
I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.
Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).
I agree that "IGF is the objective" is somewhat sloppy shorthand.
It’s used a lot in the comment sections. Do you know a better refutal than this post?
This is a great post! Thank you for writing it.
There's a huge amount of ontological confusion about how to think of "objectives" for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like "deliberately steering towards certain abstract notions" as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their "true objectives", when there really isn't any such thing.
Optimization processes (or at least, evolution and RL) are better thought of in terms of what sorts of behavioral patterns were actually selected for in the history of the process. E.g., @Kaj_Sotala's point here about tracking the effects of evolution by thinking about what sorts of specific adaptations were actually historically selected for, rather than thinking about some abstract notion of inclusive genetic fitness, and how the difference between modern and ancestral humans seems much smaller from this perspective.
I want to make a similar point about reward in the context of RL: reward is a measure of update strength, not the selection target. We can see as much by just looking at the update equations for REINFORCE (from page 328 of Reinforcement Learning: An Introduction):
The reward[1] is literally a (per step) multiplier of the learning rate. You can also think of it as providing the weights of a linear combination of the parameter gradients, which means that it's the historical action trajectories that determine what subspaces of the parameters can potentially be explored. And due to the high correlations between gradients (at least compared to the full volume of parameter space), this means it's the action trajectories, and not the reward function, that provides most of the information relevant for the NN's learning process.
From Survival Instinct in Offline Reinforcement Learning:
on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn't incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process's predictability/controllability, you should not be comparing some abstract notion of the process's "true outer objective" to the result's "true inner objective". Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing's future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
For RL agents, I am also arguing that thinking in terms of the historical action trajectories that were actually reinforced during training implies greater consistency, as compared to thinking of things in terms of some "true goal" of the training process. E.g., Goal Misgeneralization in Deep Reinforcement Learning trained a mouse to navigate to cheese that was always placed in the upper right corner of the maze and found that it would continue going to the upper right even when the cheese was moved.
This is actually a high degree of consistency from the perspective of the historical action trajectories. During training, the mouse continually executed the action trajectories that navigated it to the upper right of the board, and continued to do the exact same thing in the modified testing environment.
Technically it's the future return in this formulation, and current SOTA RL algorithms can be different / more complex, but I think this perspective is still a more accurate intuition pump than notions of "reward as objective", even for setups where "reward as a learning rate multiplier" isn't literally true.
I think this is really lucid and helpful:
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn't incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process's predictability/controllability, you should not be comparing some abstract notion of the process's "true outer objective" to the result's "true inner objective". Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing's future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
This seems to be making the same sort of deepity that Turntrout is making in his 'reward is not the optimization target', in taking a minor point about model-free RL approaches not necessarily building in any explicit optimization/planning for reward into their policy, and then people not understanding it because it ducks the major issue, while handwaving a lot of points. (Especially bad: infanticide is not a substitute for contraception because pregnancy is outrageously fatal and metabolically expensive, which is precisely why the introduction of contraception has huge effects everywhere it happens and why hunter-foragers have so many kids while contemporary women have fewer than they want to. Infanticide is just about the worst possible form of contraception short of the woman dying. I trust you would not argue that 'suicide is just as effective contraceptive as infanticide or condoms' using the same logic - after all, if the mother is dead, then there's definitely no more kids...)
In particular, this fundamentally does not answer the challenge I posed earlier by pointing to instances of sperm bank donors who quite routinely rack up hundreds of offspring, while being in no way special other than having a highly-atypical urge to have lots of offspring. You can check this out very easily in seconds and verify that you could do the same thing with less effort than you've probably put into some video games. And yet, you continue to read this comment. Here, look, you're still reading it. Seconds are ticking away while you continue to forfeit (I will be generous and pretend that a LWer is likely to have median number of kids) much more than 10,000% more fitness at next to no cost of any kind. And you know this because you are a model-based RL agent who can plan and predict the consequences of actions based solely on observations (like of text comments) without any additional rewards, you don't have to wait for model-free mechanisms like evolution to slowly update your policy over countless rewards. You are perfectly able to predict that if the status quo lasted for enough millennia, this would stop being true; men would gradually be born with a baby-lust, and would flock to sperm donation banks (assuming such things even still existed under the escalating pressure); you know what the process of evolution would do and is doing right now very slowly, and yet, using your evolution-given brain, you still refuse to reap the fitness rewards of hundreds of offspring right now, in your generation, with yourself, for your genes. How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)? Certainly if AGI were as well-aligned with human values as we are with inclusive fitness, that doesn't seem to bode very well for how human values will be fulfilled over time as the AGI-environment changes ever more rapidly & at scale - I don't know what the 'masturbation, porn, or condom of human values' is, and I'd rather not find out empirically how diabolically clever reward hacks can be when found by superhuman optimization processes at scale targeting the original human values process...
This seems to entirely ignore the actual point that is being made in the post. The point is that "IGF" is not a stable and contentful loss function, it is a misleadingly simple shorthand for "whatever traits are increasing their own frequency at the moment." Once you see this, you notice two things:
The main problem I have with this type of reasoning is an arbitrary drawn ontological boundaries. Why IGF is "not real" and ML objective function is "real", while if we really zoom in training process, the verifiable in positivist brutal way real training goal is "whatever direction in coefficient space loss function decreases on current batch of data" which seems to me pretty corresponding to "whatever traits are spreading in current environment"?
infanticide is not a substitute for contraception
I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.
How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?
"Inner optimizers diverging from the optimization process's reward function" sounds to me like humans were already donating to sperm banks in the EEA, only for an inner optimizer to wreak havoc and sidetrack us from that. I assume you mean something different, since under that interpretation of what you mean the answer would be obvious - that we don't need to invoke inner optimizers because there were no sperm banks in the EEA, so "that's not the kind of behavior that evolution selected for" is a sufficient explanation.
The "why aren't men all donating to sperm banks" argument assumes that 1.) evolution is optimizing for some simple reducible individual level IGF objective, and 2.) that anything less than max individual score on that objective over most individuals is failure.
No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect - and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.
Evolution is a population optimization algorithm that explores a solution landscape via huge N number of samples in parallel, where individuals are the samples. Successful species with rapidly growing populations will naturally experience growth in variance/variation (ala adaptive radiation) as the population grows. Evolution only proceeds by running many many experiments, most of which must be failures in a struct score sense - that's just how it works.
Using even the median sample's fitness would be like faulting SGD for every possible sample of the weights at any point during a training process. For SGD all that matters is the final sample, and likewise all that 'matters' for evolution is the tiny subset of most future fit individuals (which dominate the future distribution). To the extent we are/will use evolutionary algorithms for AGI design, we also select only the best samples to scale up, so only the alignment of the best samples is relevant for similar reasons.
So if we are using individual human samples as our point of analogy comparison, the humans that matter for comparing the relative success of evolution at brain alignment are the most successful: modern sperm donors, genghis khan, etc. Evolution has maintained a sufficiently large sub population of humans who do explicitly optimize for IGF even in the modern environment (to the extent that makes sense translated into their ontology), so its doing very well in that regard (and indeed it always needs to maintain a large diverse high variance population distribution to enable quick adaptation to environmental changes).
We aren't even remotely close to stressing brain alignment to IGF. Most importantly we don't observe species going extinct because they evolved general intelligence, experienced a sharp left turn, and then died out due to declining populations. But the sharp left turn argument does predict that, so its mostly wrong.
No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect - and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.
The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn't about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).
(You could argue humans haven't fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven't actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)
The vector dot product model seems importantly false, for basically the reason sketched out in this comment;
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
You could argue humans haven't fully made that phase transition yet, and I would have some sympathy for that argument.
From the perspective of evolutionary fitness, humanity is the penultimate runaway success - AFAIK we are possibly the species with the fastest growth in fitness ever in the history of life. This completely overrides any and all arguments about possible misalignment, because any such misalignment is essentially epsilon in comparison to the fitness gain brains provided.
For AGI, there is a singular correct notion of misalignment which actually matters: how does the creation of AGI - as an action - translate into differential utility, according to the utility function of its creators? If AGI is aligned to humanity about the same as brains are aligned to evolution, then AGI will result in an unimaginable increase in differential utility which vastly exceeds any slight misalignment.
You can speculate all you want about the future and how brains may be become misaligned in the future, but that is just speculation.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record:
We aren't even remotely close to stressing brain alignment to IGF. Most importantly we don't observe species going extinct because they evolved general intelligence, experienced a sharp left turn, and then died out due to declining populations. But the sharp left turn argument does predict that, so its mostly wrong.
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn't weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are "close" under that metric.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?
But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven't actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!
If you actually believe the sharp left turn argument holds water, where is the evidence? As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment;
Given any practical and reasonably aligned agent, there is always some set of conceivable OOD environments where that agent fails. Who cares? There is a single success criteria: utility in the real world! The success criteria is not "is this design perfectly aligned according to my adversarial pedantic critique".
The sharp left turn argument uses the analogy of brain evolution misaligned to IGF to suggest/argue for doom from misaligned AGI. But brains enormously increased human fitness rather than the predicted decrease, so the argument fails.
In worlds where 1. alignment is very difficult, and 2. misalignment leads to doom (low utility) this would naturally translate into a great filter around intelligence - which we do not observe in the historical record. Evolution succeeded at brain alignment on the first try.
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail
I think this entire line of thinking is wrong - you have little idea what environmental changes are plausible and next to no idea of how brains would adapt.
On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
When you move the discussion to speculative future technology to support the argument from a historical analogy - you have conceded that the historical analogy does not support your intended conclusion (and indeed it can not, because homo sapiens is an enormous alignment success).
It sounds like you're arguing that uploading is impossible, and (more generally) have defined the idea of "sufficiently OOD environments" out of existence. That doesn't seem like valid thinking to me.
Of course i'm not arguing that uploading is impossible, and obviously there are always hypothetical "sufficiently OOD environments". But from the historical record so far we can only conclude that evolution's alignments of brains was robust enough compared to the environment distribution shift encountered - so far. Naturally that could all change in the future, given enough time, but piling in such future predictions is clearly out of scope for an argument from historical analogy.
These are just extremely different:
It's like I'm arguing that given that we observed the sequence 0,1,3,7 the pattern is probably 2^N-1, and you arguing that it isn't because you predict the next digit is 31.
Regardless uploads are arguably sufficiently categorically different that its questionable how they even relate to evolutionary success of homo sapien brain alignment to genetic fitness (do sims of humans count for genetic fitness? but only if DNA is modeled in some fashion? to what level of approximation? etc.)
How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?
I think it's inappropriate to use technical terms like "reward function" in the context of evolution, because evolution's selection criteria serve vastly different mechanistic functions from eg a reward function in PPO.[1] Calling them both a "reward function" makes it harder to think precisely about the similarities and differences between AI RL and evolution, while invalidly making the two processes seem more similar. That is something which must be argued for, and not implied through terminology.
And yes, I wish that "reward function" weren't also used for "the quantity which an exhaustive search RL agent argmaxes." That's bad too.
Yeah.
The fact that we don't have standard mechanistic models of optimization via selection (which is what evolution and moral mazes and inadequate equilibria and multipolar traps essentially are) is likely a fundamental source of confusion when trying to get people on the same page about the dangers of optimization and how relevant evolution is, as an analogy.
>You can check this out very easily in seconds and verify that you could do the same thing with less effort than you've probably put into some video games.
Indeed. Donating sperm over the Internet costs approximately $125 per donation (most of which is Fedex overnight shipping costs, and often the recipient will cover these) and has about a 10% pregnancy success rate per cycle.
See: https://www.irvinesci.com/refrigeration-medium-tyb-with-gentamicin.html
and https://www.justababy.com/
I agree that humans are not aligned with inclusive genetic fitness, but i think you could look at evolution as a bunch of different optimizers at any small stretch in time and not just a singel optimizer. If not getting killed by spiders is necessary for IGF for example, then evolution could be though off as both an optimizer for IGF and not getting killed by spiders. Some of these optimizers have created mesaoptimizers that resemble the original optimizer to a strong degree. Most people really care about their own biological children not dying for example. I think that thinking about evolution as multiple optimizers, makes it seem more likely that gradient descent is able to instill correct human values sometimes rather than never.
Pregnancy is certainly costly (and the abnormally high miscarriage rate appears to be an attempt to save on such costs in case anything has gone wrong), but it's not that fatal (for the mother). A German midwife recorded one maternal death out of 350 births.
I expect you to be making a correct and important point here, but I don't think I get it yet. I feel confused because I don't know what it would mean for this frame to make false predictions. I could say "Evolution selected me to have two eyeballs" and I go "Yep I have two eyeballs"? "Evolution selected for [trait with higher fitness]" and then "lots of people have trait of higher fitness" seems necessarily true?
I feel like I'm missing something.
Oh. Perhaps it's nontrivial that humans were selected to value a lot of stuff, and (different, modern) humans still value a lot of stuff, even in today's different environment? Is that the point?
Perhaps it's nontrivial that humans were selected to value a lot of stuff
I prefer the reverse story: humans are tools in the hand of the angiosperms, and they’re still doing the job these plants selected them for: they defend angiosperm at all cost. If superIA destruct 100% of the humans along with 99% of life on earth, they’ll call that the seed phase and chill for the new empty environment they would have made us clean for them.
Oh. Perhaps it's nontrivial that humans were selected to value a lot of stuff, and (different, modern) humans still value a lot of stuff, even in today's different environment? Is that the point?
Sort of, but I think it is more specific than that. As I point out in my AI pause essay:
An anthropologist looking at humans 100,000 years ago would not have said humans are aligned to evolution, or to making as many babies as possible. They would have said we have some fairly universal tendencies, like empathy, parenting instinct, and revenge. They might have predicted these values will persist across time and cultural change, because they’re produced by ingrained biological reward systems. And they would have been right.
I take this post to be mostly negative, in that it shows that "IGF" is not a unified loss function; its content is entirely dependent on the environmental context, in ways that ML loss functions are not.
As I point out in my AI pause essay:
Nitpick in there
I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it’s clear that the benefits of doing so would significantly outweigh the costs.
I find hard to grant something that would have make our response to pandemics or global warming even slower than they are. By the same reasoning, we would not have the Montreal protocol and the UV levels would be public concerns.
Some people will end up valuing children more, for complicated reasons; other people will end up valuing other things more, again for complicated reasons.
Right, because somewhere pretty early in evolutionary history, people (or animals) which valued stuff other than having children for complicated reasons eventually had more descendants than those who didn't. Probably because wanting lots of stuff for complicated reasons (and getting it) is correlated with being smart and generally capable, which led to having more descendants in the long run.
If evolution had ever stumbled upon some kind of magical genetic mutation that resulted in individuals directly caring about their IGF (and improved or at least didn't damage their general reasoning abilities and other positive traits) it would have surely reached fixation rather quickly. I call such a mutation "magical" because it would be impossible (or at least extremely unlikely) to occur through the normal process of mutation and selection on Earth biology, even with billions of chances throughout history. Also, such a mutation would necessarily have to happen after the point at which minds that are even theoretically capable of understanding an abstract concept like IGF already exist.
But this seems more like a fact about the restricted design space and options available to natural selection on biological organisms, rather than a generalizable lesson about mind design processes.
I don't know what the exact right analogies between current AI design and evolution are. But generally capable agents with complex desires are a useful and probably highly instrumentally convergent solution to the problem of designing a mind that can solve really hard and general problems, whether the problem is cast as image classification, predicting text, or "caring about human values", and whether the design process involves iterative mutation over DNA or intelligent designers building artificial neural networks and training them via SGD.
To the degree that current DL-paradigm techniques for creating AI are analogous to some aspect of evolution, I think that is mainly evidence about whether such methods will eventually produce human-level general and intelligent minds at all.
I think this post somewhat misunderstands the positions that it summarizes and argues against, but to the degree that it isn't doing that, I think you should mostly update towards current methods not scaling to AGI (which just means capabilities researchers will try something else...), rather than updating towards current methods being safe or robust in the event that they do scale.
A semi-related point: humans are (evidently, through historical example or introspection) capable of various kinds of orthogonality and alignment failure. So if current AI training methods don't produce such failures, they are less likely to produce human-like (or possibly human-level capable) minds at all. "Evolution provides little or no evidence that current DL methods will scale to produce human-level AGI" is a stronger claim than I actually believe, but I think it is a more accurate summary of what some of the claims and evidence in this post (and others) actually imply.
If evolution had ever stumbled upon some kind of magical genetic mutation that resulted in individuals directly caring about their IGF (and improved or at least didn't damage their general reasoning abilities and other positive traits) it would have surely reached fixation rather quickly.
CRISPR gene drives reach fixation even faster, even if they seriously harm IGF.
Indeed, when you add an intelligent designer with the ability to precisely and globally edit genes, you've stepped outside the design space available to natural selection, and you can end up with some pretty weird results! I think you could also use gene drives to get an IGF-boosting gene to fixation much faster than would occur naturally.
I don't think gene drives are the kind of thing that would ever occur via iterative mutation, but you can certainly have genetic material with very high short-term IGF that eventually kills its host organism or causes extinction of its host species.
Some animals species are able to adopt contraception-like practices too. For example birds of preys typically let some of their offsprings die of hunger when preys are space.
Compare two pairs of statements: "evolution optimizes for IGF" and "evolution optimizes for near-random traits"; "we optimize for aligned models" and "we optimize for models which get good metrics on training dataset".
I feel like a lot of your examples could be captured perfectly well by game theory, and thus can hardly be considered counterexamples. For instance the bush/tree example - it's common in game theory that one individual benefitting leads to another individual doing worse, that doesn't mean that the individuals aren't optimizing for utility.
In the sections before that, I argued that there’s no single thing that evolution selects for; rather, the thing that it’s changing is constantly changing itself.
"The thing that it's selecting for is itself constantly changing"?
Thanks, edited:
I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.
Forager societies didn't have below-replacement fertilities, which are now common for post-industrial societies.
Having children wasn't a paying venture, but people had kids anyway for the same reason other species expend energy on offspring.
Although I enjoyed thinking about this post, I don't currently trust the reasoning in it, and decided not to update off it, for reasons I summarize as:
More detailed comments, in order of appearance in the post:
Observing this selection process, we can calculate the IGF of traits currently under selection, as a measure of how strongly those are being selected. But evolution is not optimizing for this measure; evolution is optimizing for the traits that have currently been chosen for optimization.
If IGF is how many new copies of the gene pop up in the next generation, can't I say at the same time that increasing IGF is a good general level description of what's going on, and at the same time look at the details? Why "but?" Maybe I'm picking words though. Also the second sentence confuses me, although after reading the post I think I understand what you mean.
Rather, they cautioned against thinking of evolution as an active agent that “does” anything in the first place.
I expect this sentence in the textbook to be meant as advice against antropomorphizing, putting oneself into the shoes of evolution and using one's own instinct and judgement to see what to do. I think it is possible to analyze evolution as an agent, if one is careful to employ abstract reasoning, without ascribing "goodness" or "justice" or whatever to the agent without realizing.
If we were modeling evolution as a mathematical function, we could say that it was first selecting for light coloration in moths, then changed to select for dark, then changed to select for light again.
To me this looks poor if considered a model; the previous paragraph shows that you understand how the thing goes as the environment changing which genes get selected, which still means that what's going on is increasing IGF; you would predict other similar scenarios by thinking that such and such environmental factor makes such and such genes be the ones increasing their presence. Looking at which genes were selected at which point because of what precisely then doesn't give you automatically a better model than thinking in terms of IGF + external knowledge about the laws of reality.
This leads to the trees becoming more common than the bushes. But since trees need to spend much more energy on producing and maintaining their trunk, they don’t have as much energy to spend on growing fruit. When trees were rare and mostly stealing energy from the bushes, this wasn’t as much of a problem; but once the whole population consists of trees, they can end up shading each other. At this point, they end up producing much less fruit from which new trees could grow, so have fewer offspring and thus a lower mean fitness.
This story is still compatible with the description that, at each point, evolution is following IGF locally, though not globally. I think this checks out with "reward is not the utility function" and such, and also with "selecting for IGF does not produce IGF-maximizing brains". Though all this also makes me suspect that I could be misunderstanding too many things at once.
Effective contraception is a relatively recent innovation. Even hunter-gatherers have access to effective “contraception” in the form of infanticide, which is commonly practiced among some modern hunter-gatherer societies.
The first "effective" here is bigger than the second. See gwern's comment.
Particularly sensitive readers may want to skip the following paragraphs from The Anthropology of Childhood°:
I expect this examples to be cherry-picked. I do not expect ancient society to intentionally kill on average 54/141 = 1 out of 3 kids. I'm volatile due to my ignorance in the matter though.
Also, even though the share of voluntarily childfree people is increasing, it’s still not the predominant choice. One 2022 study found that 22% of the people polled neither had nor wanted to have children - which is a significant amount, but still leaves 78% of people as ones who either have or want to have children. There’s still a strong drive to have children that’s separate from the drive to just have sex.
This is getting distracted by a subset of details when you can look at fertility rates, and possibly at how they relate to wealth, and then at the trajectory of the world. I have the impression there's scientific consensus that fertility will probably go down even in countries which currently have a high one as they get richer.
It’s a novel cultural development that we prioritize things other-than-having-children so much. Anthropology of Childhood spends significant time examining the various factors that affect the treatment of children in various cultures. It quite strongly argues that the value of children has always also been strongly contingent on various cultural and economic factors - meaning that it has always been just one of the things that people care about. (In fact, a desire to have lots of children may be more tied to agricultural and industrial societies, where the economic incentives for it are abnormally high.)
How much is "so much"? Why is not much enough for you?
To me, the simplest story here looks something like “evolution selects humans for having various desires, from having sex to having children to creating art and lots of other things too; and all of these desires are then subject to complex learning and weighting processes that may emphasize some over others, depending on the culture and environment”.
I understand this "looks like the story", but not "simplest", in the context of taking this as model, which I think is the subtext.
But it doesn’t look to me like evolution selected us to desire one thing, and then we developed an inner optimizer that ended up doing something completely different. Rather, it looks like we were selected to desire many different things, with a very complicated function choosing which things in that set of doings each individual ends up emphasizing. Today’s culture might have shifted that function to weigh our desires in a different manner than before, but everything that we do is still being selected from within that set of basic desires, with the weighting function operating the same as it always has.
I agree with the first sentence: evolution selected on IGF, not on desiring IGF. "Selecting on IGF" is itself an abstraction of what's going on, which with humans involved some specific details we know or guess about. In particular, a brain was coughed up that ends up, compared to its clearly visible general abilities, not optimizing IGF as main goal. If you decide not to consider the description of what happened as "selecting on IGF", that's a question of how well that works as a concept to make predictive models.
So I think I mostly literally agree with this paragraph, but not with the avversative: it's not an argument against the subject under debate.
Alternative title: "Evolution suggests robust rather than fragile generalization of alignment properties."
A frequently repeated argument goes something like this:
My argument is that premise 1 is a verbal shorthand that’s technically incorrect, and premise 2 is at least misleading. As for the overall conclusion, I think that the case from evolution might be interpreted as weak evidence for why AI should be expected to continue optimizing human values even as its capability increases.
Summary of how premise 1 is wrong: If we look closely at what evolution does, we can see that it selects for traits that are beneficial for surviving, reproducing, and passing one’s genes to the next generation. This is often described as “optimizing for IGF”, because the traits that are beneficial for these purposes are usually the ones that have the highest IGF. (This has some important exceptions, discussed later.) However, if we look closely at that process of selection, we can see that this kind of trait selection is not “optimizing for IGF” in the sense that, for example, we might optimize an AI to classify pictures.
The model that I’m sketching is something like this: evolution is an optimization function that, at any given time, is selecting for some traits that are in an important sense chosen at random. At any time, it might randomly shift to selecting for some other traits. Observing this selection process, we can calculate the IGF of traits currently under selection, as a measure of how strongly those are being selected. But evolution is not optimizing for this measure; evolution is optimizing for the traits that have currently been chosen for optimization. Resultingly, there is no reason to expect that the minds created by evolution should optimize for IGF, but there is reason to expect that they would optimize for the traits that were actually under selection. This is something that we observe any time that humans optimize for some biological need.
In contrast, if we were optimizing an AI to classify pictures, we would not be randomly changing the selection criteria the way that evolution does. We would keep the selection criteria constant: always selecting for the property of classifying pictures the way we want. To the extent that the analogy to evolution holds, AIs should be much more likely to just do the thing they were selected for.
Summary of how premise 2 is misleading: It is often implied that evolution selected humans to care about sex, and then sex led to offspring, and it was only recently with the evolution of contraception that this connection was severed. For example:
This seems wrong to me. Contraception may be a very recent invention, but infanticide or killing children by neglect is not; there have always been methods for controlling the population size even without contraception. According to the book Anthropology of Childhood, family sizes and the economic value of having children have always been correlated. Children are more of a burden on foragers and foragers correspondingly have smaller family sizes, whereas children are an asset for farmers who have larger family sizes.
Rather than evolution having selected humans for IGF and this linkage then breaking with the invention of contraception, evolution has selected humans to have an optimization function that weighs various factors in considering how many children to have. In forager-like environments, this function leads to a preference for fewer children and smaller family sizes; in farmer-like environments, this functions leads to a preference for more children and larger family sizes. @RobinHanson has suggested that modern society is more forager-like than farmer-like and that our increased wealth is causing us to revert to forager-like ways and psychology. To the extent that this argument is true, there has been no breakage between what evolution “intended” and how humans behave; rather, the optimization function that evolution created continues operating the way it always has.
The invention of modern forms of contraception may have made it easier to limit family sizes in a farmer-type culture that had evolved cultural taboos against practices like infanticide. But rather than creating an entirely new evolutionary environment, finding a way to bypass those taboos brought us closer to how things had been in our original evolutionary environment.
If we look at what humans were selected to optimize for, it looks like we are mostly continuing to optimize for those same things. The reason why a minority of people are choosing not to have children is because our evolved optimization function also values things other than children, and we have “stayed loyal” to this optimization function. In the case of an AI that was trained to act according to something like “human values” and nothing else, the historical example seems to suggest that its alignment properties might generalize even more robustly than ours, as it had not been selected for a mixture of many competing values.
Evolution as a force that selects for traits at random
For this post, I skimmed two textbooks on evolution: Evolution (4th edition) by Futuyama & Kirkpatrick and Evolutionary Analysis (5th edition) by Herron & Freeman. The first one was selected based on Googling “what’s the best textbook on evolutionary biology” and the second was selected because an earlier edition was used in an undergraduate course on evolutionary psychology that I once took and I recalled it being good.
As far as I could tell, neither one talked about evolution as a process that optimizes for genetic fitness (though this was a rather light skim so I may have missed it even if it was there). Rather, they cautioned against thinking of evolution as an active agent that “does” anything in the first place. Evolution does increase a population’s average adaptation to its environment (Herron & Freeman, p. 107), but what this means can constantly change as the environment itself changes. At one time in history, a region may have a cold climate, selecting the species there for an ability to deal with the cold; and then the climate may shift to a warmer one, and previously beneficial adaptations like fur may suddenly become a liability.
Another classic example is that of peppered moth evolution. Light-colored moths used to be the norm in England, with dark-colored ones being very rare, as a light coloration was a better camouflage against birds than a dark one. With the Industrial Revolution and the appearance of polluting factories, some cities became so black that dark color became better camouflage, leading to an increase in dark-colored moths relative to the light-colored ones. And once pollution was reduced, the light-colored moths came to dominate again.
If we were modeling evolution as a mathematical function, we could say that it was first selecting for light coloration in moths, then changed to select for dark, then changed to select for light again.
The closest that one gets to something like “evolution optimizing for genetic fitness” is what’s called “the fundamental theorem of natural selection”, which among other things implies that natural selection will cause the mean fitness of a population to increase over time. However, here we are assuming that the thing we are selecting for remains constant. Light-colored moths will continue to become more common over time, up until a dark coloration becomes the trait with higher fitness and the dark coloration starts becoming more common. In both situations we might say that the “mean fitness of the population is increasing”, but this means a different thing in those two situations: in one situation it means selecting for white coloration, and in another situation, it means selecting for dark coloration. The thing that was first being selected for, is then being selected against, even as our terminology implies that the same thing is being selected for.
What happened was that the mean fitness of the population went up as a particular coloration was selected for, then a random change (first the increased pollution, then the decreased pollution) caused the mean fitness to fall, and then it started climbing again.
Even taking this into account, evolution does not even consistently increase the mean fitness of the population: sometimes evolution ends up selecting for a decrease in the mean fitness of the population.
An example of frequency-dependent selection leading to lower mean fitness is the case of a bush that produces many fruits (Futuyama & Kirkpatrick, p. 129). Some bushes then evolve a trunk that causes them to cast shade over their neighbors. As a result, those neighbors weaken and die, allowing the bushes that have become trees to get more water and nutrients.
This leads to the trees becoming more common than the bushes. But since trees need to spend much more energy on producing and maintaining their trunk, they don’t have as much energy to spend on growing fruit. When trees were rare and mostly stealing energy from the bushes, this wasn’t as much of a problem; but once the whole population consists of trees, they can end up shading each other. At this point, they end up producing much less fruit from which new trees could grow, so have fewer offspring and thus a lower mean fitness.
This kind of frequency-dependent selection is common. Another example (Futuyama & Kirkpatrick, p. 129) is that of bacteria that evolve both toxins that kill other bacteria, while also evolving an antidote against the toxin. Both cost energy to produce, but as long as these bacteria are rare, it’s worth the cost as the toxicity allows them to kill off their competitors.
But once these toxic bacteria establish themselves, there’s no longer any benefit to producing the toxin - all the surviving bacteria are immune to it - so continuing to spend energy on producing it means there’s less energy available for replication. It now becomes more beneficial to keep the antidote production but lose the toxin production: the toxin production goes from being selected for, to being selected against.
Once this selection process has happened for long enough and non-toxin-producing bacteria have come to predominate, the antidote production also becomes an unnecessary liability. Nobody is producing the toxin anymore, so there’s no reason to waste energy on maintaining a defense against it, so the antidote also goes from being selected for to being selected against.
But then what happens once none of the bacteria are producing the toxin or the antidote anymore? Now that nobody has a defense against the toxin, it becomes advantageous to start producing the toxin + antidote combination again, thus killing all the other bacteria that don’t have the antidote… and thus the cycle repeats.
In this section, I have argued that to the extent that evolution is “optimizing a species for fitness”, this actually means different things (selecting for different traits) in different circumstances; and also evolution optimizing for fitness is more of a rough heuristic rather than a literal law anyway since there are many circumstances where evolution ends up lowering the fitness of a population. This alone should make us suspicious of the argument that “evolution selected humans for IGF”; what that means isn't that there's a single thing that was being optimized for, but rather that there was a wide variety of traits that were selected for at different times.
What exactly is fitness, again?
So far I’ve been talking about fitness in general terms, but let’s recap some of the technical details. What exactly is inclusive genetic fitness, again?
There are several different definitions; here’s one set of them.
A simple definition of fitness is that it’s the number of offspring that an individual leaves for the next generation[1]. Suppose that 1% of a peppered moth’s offspring survive to reproductive age and that the surviving moths have an average of 300 offspring. In this case, the average fitness of these individuals is 0.01 * 300 = 3.
For evolution by natural selection to occur, fitness differences among individuals need to be inherited. In biological evolution, inheritance happens through genes, so we are usually interested in genetic fitness - the fitness of genes. Suppose that these are all light-colored moths in a polluted city. Suppose a gene allele for dark coloration increases the survivability by 0.33 percentage points, for an overall fitness of 0.0133 * 300 = 4. The fitnesses of the alleles are now 3 and 4.
Image from Futuyama & Kirkpatrick. Caption in the original: Genotype A has a fitness of 3, while genotype B has a fitness of 4. Both genotypes start with 10 individuals. (A) The population size of genotype B grows much more rapidly. (B) Plotting the frequencies of the two genotypes shows that genotype B, which starts at a frequency of 0.5, makes up almost 90% of the population just 7 generations later.
Often what matters is the difference in fitness between two alleles: for example, an allele with a fitness of 2 may become more common in the population if its competitor has a fitness of 1, but will become more rare if its competitor has a fitness of 3. Thus it’s common to indicate fitness relative to some common reference, such as the average fitness of the population or the genotype with the highest absolute fitness.
Genetic fitness can be divided into two components. An individual can pass a gene directly onto their offspring - this is called direct fitness. They can also carry a genetic adaptation that causes them to help others with the same adaptation, increasing their odds of survival. For example, a parent may invest extra effort in taking care of their offspring. This is called indirect fitness. The inclusive fitness of a genotype is the sum of its direct and indirect fitness.[2]
Biological evolution can be defined as “inherited change in the properties of organisms over the course of generations” (Futuyama & Kirkpatrick, p. 7). Evolution by natural selection is when the relative frequencies of a genotype change across generations due to differences in fitness. Note that genotype frequencies can also change across generations for reasons other than natural selection, such as random drift or novel mutations.
Fitness as a measure of selection strength
Let’s look at a case of intentional animal breeding. The details of the math that follows aren’t that important, but I wanted to run through them anyway, just to make it more concrete what “fitness” actually means. Still, you can just skim through them if you prefer.
Suppose that I happen to own a bunch of peppered moths of various colors and happen to like a light color, so I decide to breed them towards being lighter. Now I don’t know the details of how the genetics of peppered moth coloration works - I assume that it might very well be affected by multiple genes. But for the sake of simplicity, let’s just say that there is a single gene with a “light” allele and a “dark” allele.
Call the “light” allele B1 and the “dark” allele B2. B1B1 moths are light, B2B2 moths are dark, and B1B2 / B2B1 moths are somewhere in between (to further simplify things, I’ll use “B1B2” to refer to both B1B2 and B2B1 moths).
Suppose that the initial population has 100 moths. I have been doing breeding for a little bit already, so we start from B1 having a frequency of 0.6, and B2 a frequency of 0.4. The moths have the following distribution of genotypes:
B1B1 = 36
B1B2 = 48
B2B2 = 16
To my eye, all of the moths with the B1B1 genotype look pleasantly light, so I choose to have them all breed. 75% of the moths with the B1B2 genotype look light enough to my eye, and so do 50% of the B2B2 ones (maybe their coloration is also affected by environmental factors or other genes). The rest don’t get to breed.
This gives us, on average, a frequency of 0.675 for the B1 alleles and 0.325 for the B2 alleles in the next generation[3]. Assuming that each of the moths contributed a hundred gametes to the next generation, we get the following fitnesses for the alleles:
B1: Went from 120 (36 + 36 + 48) to 5400 copies, so the fitness is 5400/120 = 45.
B2: Went from 80 (48 + 16 + 16) to 2600 copies, so the fitness is 2600/80 = 32.5.
As the proportion of B1 increases, the average fitness of the population will increase! This is because the more B1 alleles you carry, the more likely it is that you are selected to breed, so B1 carriers have a higher fitness… which means that B1 becomes more common… which increases the average fitness of the mouse population as a whole. So in this case, the rule that the average fitness of the population tends to increase over time does apply.
But now… wouldn’t it sound pretty weird to describe this process as optimizing for the fitness of the moths?
I am optimizing for having light moths; what the fitness calculation tells us is how much of an advantage the lightness genes have - in other words, how much I am favoring the lightness genes - relative to the darkness genes.
Because we were only modeling the effect of fitness and not e.g. random drift, all of the difference in gene frequencies came from the difference in fitness. This is tautological - it doesn’t matter what you are selecting (optimizing) for, anything that gets selected ends up having the highest fitness, by definition.
Rather than saying that we were optimizing for high fitness, it seems more natural to say that we were optimizing for the trait of lightness and that lightness gave a fitness advantage. The other way around doesn’t make much sense - we were optimizing for fitness and that gave an advantage to lightness? What?
This example used artificial selection because that makes it the most obvious what the actual selection target was. But the math works out the same regardless of whether we’re talking artificial or natural selection. If we say that instead of me deciding that some moths don’t get to breed, the birds and other factors in the external environment are doing it… well, nothing changes about the equations in question.
Was natural selection optimizing for the fitness of the moths? There's a sense in which you could say that since the dark-colored moths ended up having increased fitness compared to the light-colored ones. But it would also again feel a little off to describe it this way; it feels more informative and precise to say that the moths were optimized for having dark color, or to put it more abstractly, for having the kind of a color that fits their environment.
From coloration to behavior
I’ve just argued that if we look at the actual process of evolution, it looks more like optimizing for having specific traits (with fitness as a measure of how strongly they’re selected) rather than optimizing for fitness as such. This is so even though the process of selection can lead to the mean fitness of the population increasing - but as we can see from the math, this just means “if you select for something, then you get more of the thing that you are selecting for”.
In the sections before that, I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.
I think these arguments are sufficient to conclude that the claim “evolution optimized humans for fitness [thus humans ought to be optimizing for fitness]” is shaky.
So far, I have mostly been talking about relatively “static” traits such as coloration, rather than cognitive traits that are by themselves optimizers. So let's talk about cognition. While saying that “evolution optimized humans for genetic fitness, thus humans ought to be optimizing for fitness” seems shaky, the corresponding argument does work if we talk about specific cognitive behaviors that were selected for.
For example, if we say that “humans were selected for caring about their offspring, thus humans should be optimizing for ensuring the survival of their offspring”, then this statement does generally speaking hold - a lot of humans do put quite a lot of cognitive effort into ensuring that their children survival. Or if we say that “humans were selected for exhibiting sexual jealousy in some circumstances, so in some circumstances, they will optimize for preventing their mates from having sex with other humans”, then clearly that statement does also hold.
This gets to my second part of the argument: while it’s claimed that we are now doing something that goes completely against what evolution selected for, contraception at least is a poor example of that. For the most part, we are still optimizing for exactly the things that evolution selected us to optimize for.
Humans still have the goals we were selected for
The desire to have sex was never sufficient for having babies by itself - or at least not for having ones that would survive long enough to reproduce themselves in turn. It was always only one component, with us having multiple different desires relating to children:
Eliezer wrote, in “AGI Ruin: A List of Lethalities” that
This quote seems to imply that
All of these premises seem false to me. Here’s why:
Effective contraception is a relatively recent innovation. Even hunter-gatherers have access to effective “contraception” in the form of infanticide, which is commonly practiced among some modern hunter-gatherer societies. Particularly sensitive readers may want to skip the following paragraphs from The Anthropology of Childhood:
It takes years for a newborn to get to a point where they can take care of themselves, so a simple lack of active caretaking is enough to kill an infant, no modern-age contraceptive techniques required.
It’s the desire for sex alone that’s the predominant driver for there being children. Again, see infanticide, which doesn’t need to be an active act as much as a simple omission. One needs an active desire to keep children alive.
Also, even though the share of voluntarily childfree people is increasing, it’s still not the predominant choice. One 2022 study found that 22% of the people polled neither had nor wanted to have children - which is a significant amount, but still leaves 78% of people as ones who either have or want to have children. There’s still a strong drive to have children that’s separate from the drive to just have sex.
It’s a novel cultural development that we prioritize things other-than-having-children so much. Anthropology of Childhood spends significant time examining the various factors that affect the treatment of children in various cultures. It quite strongly argues that the value of children has always also been strongly contingent on various cultural and economic factors - meaning that it has always been just one of the things that people care about. (In fact, a desire to have lots of children may be more tied to agricultural and industrial societies, where the economic incentives for it are abnormally high.)
To me, the simplest story here looks something like “evolution selects humans for having various desires, from having sex to having children to creating art and lots of other things too; and all of these desires are then subject to complex learning and weighting processes that may emphasize some over others, depending on the culture and environment”.
Some people will end up valuing children more, for complicated reasons; other people will end up valuing other things more, again for complicated reasons. This was the case in hunter-gatherer times and this is the case now.
But it doesn’t look to me like evolution selected us to desire one thing, and then we developed an inner optimizer that ended up doing something completely different. Rather, it looks like we were selected to desire many different things, with a very complicated function choosing which things in that set of doings each individual ends up emphasizing. Today’s culture might have shifted that function to weigh our desires in a different manner than before, but everything that we do is still being selected from within that set of basic desires, with the weighting function operating the same as it always has.
As I mentioned in the introduction, Robin Hanson has suggested that modern society is more forager-like than farmer-like and that our increased wealth is causing us to revert to forager-like ways and psychology. This would then mean that our evolved weighting function is now exhibiting the kind of behavior that it was evolved to exhibit in a forager-like environment.
We do engage in novel activities like computer games today, but it seems to me like the motivation to play computer games is still rooted in the same kinds of basic desires as the first hunter-gatherers had - e.g. to pass the time, enjoy a good story, socialize, or experience a feeling of competence.
So what can we say about AI?
Well, I would be cautious around reasoning by analogy. I’m not sure we can draw particularly strong claims about the connection to AI. I think that there are more direct and relevant arguments that one can make that do seem worrying, rather than trying to resort to evolutionary analogies.
But it does seem to me that e.g. the evolutionary history for the “sharp left turn” implies the opposite than previously argued. Something like “training an AI for recognizing pictures” or “training an AI for caring about human values” looks a lot more like “selecting humans to care about having offspring” than it looks like “optimizing humans for genetic fitness”. Caring about having offspring is a property that we still seem to pretty robustly carry; our alignment properties continued to generalize even as our capabilities increased.
To the extent that we do not care about our offspring, or even choose to go childfree, it’s just because we were selected to also care about other things - if a process selects humans to care about a mix of many things, them sometimes weighing those other things more does not by itself represent a failure of alignment. This is again in sharp contrast to something like an AI that we tried to exclusively optimize for caring about human well-being. So there’s reason to expect that an AI’s alignment properties might generalize even more than those of existing humans.
Thanks to Quintin Pope, Richard Ngo, and Steve Byrnes for commenting on previous drafts of this essay.
Futuyama & Kirkpatrick, p. 60.
Futuyama & Kirkpatrick, p. 300.
Each B1B1 moth has a 100% chance to “pick” a B1 allele for producing a gamete, each B1B2 moth has a 50% chance to pick a B1 gamete and a 50% chance to pick a B2 gamete, and each B2B2 moth has a 100% to pick a B2 allele for producing a gamete. Assuming that each moth that I’ve chosen to breed contributes 100 gametes to the next generation, we get an average of 3600 B1 gametes from the 36 B1B1 moths chosen to breed, 1800 B1 and 1800 B2 gametes from the 360 B1B2 moths chosen to breed, and 800 B2B2 gametes from the 8 B2B2 moths chosen to breed.
This makes for 3600 + 1800 = 5400 B1 gametes and 1800 + 800 = 2600 B2 gametes, for a total of 8000 gametes. This makes for a frequency of 0.675 for B1 and 0.325 for B2.