You write:
The utility function is fitness: gene replication count (of the human defining genes)[1]. And by this measure, it is obvious that humans are enormously successful. If we normalize so that a utility score of 1 represents a mild success - the expectation of a typical draw of a great apes species, then humans' score is >4 OOM larger, completely off the charts.[2]
Footnote 1 says:
Nitpick arguments about how you define this specifically are irrelevant and uninteresting.
Excuse me, what? This is not evolution's utility function. It's not optimizing for gene count. It does one thing, one thing only, and it does it well: it promotes genes that increase their RELATIVE FREQUENCY in the reproducing population.
The failure of alignment is witnessed by the fact that humans very very obviously fail to maximize the relative frequency of their genes in the next generation, given the opportunities available to them; and they are often aware of this; and they often choose to do so anyway. The whole argument in this post is totally invalid.
I don't understand why you say promoting genes relative frequency is how it should be defined. Wouldn't a gene drive like thing then max out that measure?
Also promoting genes that caused species extinction would also count as a win by that metric. I think that can happen sometimes - i.e. larger individuals are more mate-worthy and the species gets ever bigger (or suchlike) until it doesn't fit its niche and goes extinct. These seem like failure modes rather than the utility function. Are there protections against gene drives in organizations populations?
IIUC a lot of DNA in a lot of species is gene-drive-like things.
These seem like failure modes rather than the utility function.
By what standard are you judging when something is a failure mode or a desired outcome? I'm saying that what evolution is, is a big search process for genes that increase their relative frequency given the background gene pool. When evolution built humans, it didn't build agents that try to promote the relative frequency of the genes that they are carrying. Hence, inner misalignment and sharp left turn.
I don't see how this detail is relevant. The fact remains that humans are, in evolutionary terms, much more successful than most other mammals.
What do you mean by "in evolutionary terms, much more successful"?
Subpopulations which do this are expected to disappear relatively quickly in evolutionary time scales. Natural selection is error correcting. This can mean people get less intelligent again, or they start to really love getting children rather than enjoying sex.
Say you have a species. Say you have two genes, A and B.
Gene A has two effects:
A1. Organisms carrying gene A reproduce slightly MORE than organisms not carrying A.
A2. For every copy of A in the species, every organism in the species (carrier or not) reproduces slightly LESS than it would have if not for this copy of A.
Gene B has two effects, the reverse of A:
B1. Organisms carrying gene B reproduce slightly LESS than organisms not carrying B.
B2. For every copy of B in the species, every organism in the species (carrier or not) reproduces slightly MORE than it would have if not for this copy of B.
So now what happens with this species? Answer: A is promoted to fixation, whether or not this causes the species to go extinct; B is eliminated from the gene pool. Evolution doesn't search to increase total gene count, it searches to increase relative frequency. (Note that this is not resting specifically on the species being a sexually reproducing species. It does rest on the fixedness of the niche capacity. When the niche doesn't have fixed capacity, evolution is closer to selecting for increasing gene count. But this doesn't last long; the species grows to fill capacity, and then you're back to zero-sum selection.)
Ok so the point is that the vast vast majority of optimization power coming from {selection over variation in general} is coming more narrowly from {selection for genes that increase their relative frequency in the gene pool} and not from {selection between different species / other large groups}. In arguments about misalignment, evolution refers to {selection for genes that increase their relative frequency in the gene pool}.
If you run a big search process, and then pick a really extreme actual outcome X of the search process, and then go back and say "okay, the search process was all along a search for X", then yeah, there's no such thing as misalignment. But there's still such a thing as a search process visibly searching for Y and getting some extreme and non-Y-ish outcome, and {selection for genes that increase their relative frequency in the gene pool} is an example.
For evolution in general, this is obviously pattern measure, and truly can not be anything else.
This sure sounds like my attempt elsewhere to describe your position:
There's no such thing as misalignment. There's one overarching process, call it evolution or whatever you like, and this process goes through stages of creating new things along new dimensions, but all the stages are part of the overall process. Anything called "misalignment" is describing the relationship of two parts or stages that are contained in the overarching process. The overarching process is at a higher level than that misalignment relationship, and the misalignment helps compute the overarching process.
Which you dismissed.
The analogy from historical evolution is the misalignment between human genes and human minds, where the rise of the latter did not result in extinction of the former. It plausibly could have, but that is not what we observe.
The analogy is that the human genes thing produces a thing (human minds) which wants stuff, but the stuff it wants is different from what what the human genes want. From my perspective you're strawmanning and failing to track the discourse here to a sufficient degree that I'm bowing out.
I think there’s benefit from being more specific about what we’re arguing about.
CLAIM 1: If there’s a learning algorithm whose reward function is X, then the trained model will not necessarily explicitly desire X.
I think everyone agrees that this is true, and that evolution provides an example. Most people don’t even know what inclusive genetic fitness is, and those who do, and who also know that donating eggs / sperm would score highly on that metric, nevertheless often don’t donate eggs / sperm.
CLAIM 2: If there’s a learning algorithm whose reward function is X, then the trained model cannot possibly explicitly desire X.
I think everyone agrees that this is false—neither Nate nor almost anyone else (besides Yamploskiy) thinks perfect AGI alignment is impossible. I think everyone probably also agrees that evolution provides a concrete counterexample—it’s a big world, people have all kinds of beliefs and desires, there’s almost certainly at least one person somewhere who knows what IGF is and explicitly wants to maximize theirs.
CLAIM 3: If there’s a learning algorithm whose reward function is X, and no particular efforts are taken to ensure alignment (e.g. freezing the model occasio...
I think Nate’s post "Humans aren’t fitness maximizers" discusses this topic more directly than does his earlier "Sharp Left Turn" post. It also has some lively discussion in the comments section.
I won't argue with the basic premise that at least on some metrics that could be labeled as evolution's "values", humans are currently doing very well.
But, the following are also true:
There exists a clear misalignment between the principles of evolution and human behavior. This discrepancy is evident in humans' inclination towards pursuing immediate gratification, often referred to as "wireheading," rather than prioritizing the fundamental goals of replication and survival.
An illustrative example of this misalignment can be observed in the unfortunate demise of certain indigenous communities due to excessive alcohol consumption. This tragic outcome serves as a poignant reminder of the consequences of prioritizing immediate pleasure over...
evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! [...]
If Evolution had a lot more time to align humans to relative-gene-replication-count, before humans put an end to biological life, then sure, seems plausible that Evolution might be able to align humans very robustly. But Evolution does not have infinite time or "retries" --- humanity is in the process of executing something like a "sharp left turn", and seems likely to succeed long before the human gene pool is taken over by sperm bank donors and such.
In general I think maximum values are weird because they are potentially nearly unbounded, but it sounds like we may then be in agreement absent terminology.
But in general I do not think of anything "less than 1% of the maximum value" as failure in most endeavors. For example the maximum attainable wealth is perhaps $100T or something, but I don't think it'd be normal/useful to describe the world's wealthiest people as failures at being wealthy because they only have ~$100B or whatever.
And regardless the standard doom arguments from EY/MIRI etc are very much "AI will kill us all!", and not "AI will prevent us from attaining over 1% of maximum future utility!"
I agree that the optimization process of natural selection likely irons out any cases of temporary misalignment (some human populations having low fertility) over a medium time span. People who tend to not have children eventually get replaced by people who love children, or people who tend to forget to use contraceptives etc. This is basically the force Scott Alexander calls Moloch.
Unfortunately this point doesn't obviously generalize to AI alignment. Natural selection is a simple, natural optimization process, which optimizes a simple "goal". But getting...
There are several endgame scenarios for evolution:
From the POV of evolution, it's as if we initiated ASI and thought, "Well, it hasn't killed us yet."
The utility function is fitness: gene replication count (of the human defining genes) [1]
Seems like humans are soon going to put an end to DNA-based organisms, or at best relegate them to some small fraction of all "life". I.e., seems to me that the future is going to score very poorly on the gene-replication-count utility function, relative to what it would score if humanity (or individual humans) were actually aligned to gene-replication-count.
Do you disagree? (Do you expect the post-ASI future to be tiled with human DNA?)
Obviously Evolution doesn
Given that we're not especially powerful optimizers relative to what's possible (we're only powerful relative to what exists on Earth…for now), this is at best an existence proof for the possibility of alignment for optimizers of fairly limited power. This is to say I don't think this result is very relevant to the discussion of a sharp left turn in AI because, even if someone buys your argument, AI are not necessarily like humans in relevant ways that will be likely to make them aligned with anything in particular.
For the evolution of human intelligence, the optimizer is just evolution: biological natural selection.
Really? Would your argument change if we could demonstrate a key role for sexual selection, primate wars or the invention of cooking over fire?
Some people like to use the evolution of homo sapiens as an argument by analogy concerning the apparent difficulty of aligning powerful optimization processes:
The much confused framing of this analogy has lead to a protracted debate about it's applicability.
The core issue is just misaligned mesaoptimization. We have a powerful optimization process optimizing world stuff according to some utility function. The concern is that a sufficiently powerful optimization process will (inevitably?) lead to internal takeover by a selfish mesa-optimizer unaligned to the outer utility function, resulting in a bad (low or zero utility) outcome.
In the AGI scenario, the outer utility function is CEV, or external human empowerment, or whatever (insert placeholder, not actually relevant). The optimization process is the greater tech economy and AI/ML research industry. The fear is that this optimization process, even if outer aligned, could result in AGI systems unaligned to the outer objective (humanity's goals), leading to doom (humanity's extinction). Success here would be largenum utility, and doom/extinction is 0. So the claim is mesaoptimization inner alignment failure leads to 0 utility outcomes: complete failure.
For the evolution of human intelligence, the optimizer is just evolution: biological natural selection. The utility function is something like fitness: ex gene replication count (of the human defining genes)[1]. And by any reasonable measure, it is obvious that humans are enormously successful. If we normalize so that a utility score of 1 represents a mild success - the expectation of a typical draw of a great apes species, then humans' score is >4 OOM larger, completely off the charts.[2]
So evolution of human intelligence is an interesting example: of alignment success. The powerful runaway recursive criticality that everyone feared actually resulted in an enormous anomalously high positive utility return, at least in this historical example. Human success, if translated into the AGI scenario, corresponds to the positive singularity of our wildest dreams.
Did it have to turn out this way? No!
Due to observational selection effects, we naturally wouldn't be here if mesaoptimization failure during brain evolution was too common across the multiverse.[3] But we could have found ourselves in a world with many archaeological examples of species achieving human general technocultural intelligence and then going extinct - not due to AGI of course, but simply due to becoming too intelligent to reproduce. But we don't, as far as I know.
And that is exactly what'd we necessarily expect to see in the historical record if mesaoptimization inner misalignment was a common failure mode: intelligent dinosaurs that suddenly went extinct, ruins of proto pachyderm cities, the traces of long forgotten underwater cetacean atlantis, etc.
So evolution solved alignment in the only sense that actually matters: according to its own utility function, the evolution of human intelligence enormously increased utility, rather than imploding it to 0.
So back to the analogy - where did it go wrong?
Nate's critique is an example of the naive engineer fallacy. Nate is critiquing a specific detail of evolution's solution, but failing to notice that all that matters is the score, and humans are near an all time high score success[5]. Evolution didn't make humans explicitly just optimize mentally for IGF because that - by itself - probably would have been a stupid failure of a design, and evolution is a superhuman optimizer whose designs are subtle, mysterious, and often beyond human comprehension.
Instead evolution created a solution with many layers and components- a defense in depth against mesaoptimization misalignment. And even though all of those components will inevitably fail in many individuals - even most! - that is completely irrelevant at the species level, and in fact just part of the design of how evolution explores the state space.
And finally, if all else fails, evolution did in fact find some weird way to create humans who rather obviously consciously optimize for IGF! And so if the other mechanisms had all started to fail too frequently, the genes responsible for that phenotype would inevitably become more common.
On further reflection, much of premodern history already does look like at least some humans consciously optimizing for something like IGF: after all, "be fruitful and multiply" is hardly a new concept. What do you think was really driving the nobility of old, with all their talk of bloodlines and legacies? There already is some deeper drive to procreate at work in our psyche (to varying degrees); we are clearly not all just mere byproducts of pleasure's pursuit[6].
The central takeway is that evolution adapted the brain's alignment mechanisms/protections in tandem with our new mental capabilities, such that the sharp left turn led to an enormous runaway alignment success.
Nitpick arguments about how you define this specifically are irrelevant and uninteresting. Homo sapiens is enormously successful! If you really think you know the true utility function of evolution, and humans are a failure according to that metric, you have simply deluded yourself. My argument here does not depend on the details of the evolutionary utility function. ↩︎
We are unarguably the most successful recent species, probably the most successful mammal species ever, and all that despite arriving in a geological blink of an eye. The dU/dt for homo sapiens is probably the highest ever, so we are tracking to be the most successful species ever, if current trends continue (which of course is another story). ↩︎
Full consideration of the observational selection effects also leads to an argument for alignment success via the simulation argument, as future alignment success probably creates many historical sims, whereas failures do not. ↩︎
Condom analogs are at least 5000 years old; there is ample evidence contraception was understand and used in various ancient civilizations, and many premodern tribal people understood herbal methods, so humans have probably had this knowledge since the beginning, in one form or another. (Although memetic evolution would naturally apply optimization pressure against wide usage) ↩︎
Be careful anytime you find yourself defining peak evolutionary fitness as anything other than the species currently smiling from on top a giant heap of utility. ↩︎
I say this as I am about to have a child myself, planned for reasons I cannot fully yet articulate. ↩︎