Can we simulate human evolution to create a somewhat aligned AGI?

Thomas Kwa

Epistemic status: less than 20% chance that this is a good idea. I've spent about 25 hours thinking about this.

TL;DR: If AI alignment is intractable, but human evolution is robust in producing something close to human values, we could try to simulate/mimic human evolution to create a superintelligent successor AI. This plan would have various problems, especially training competitiveness.

Many people I know put significant credence in the following three statements:

AI alignment is very difficult, and has a <<50% chance of success, largely because we have no idea how to put human values into an AI.
Even if AI alignment is intractable, we know of one process that definitely creates general intelligence with human values: the evolution and cultural development of humans.
Medium-sized perturbations to the human evolutionary process wouldn't destroy human values. I have heard MIRI people guess that an AI aligned with elephants, or with aliens that evolved in a similar environment to humans ("humanlike aliens"), would be much better than an AI with a random goal chosen through SGD, perhaps 50% as good as an AI aligned with human CEV.

If you're one of these people, this post describes a different approach to making the future go well which you might find plausible. While alignment requires some way to specify human values and put them into an AI, a successor species plan requires creating a system that generates values similar to ours, from scratch, by some process similar to human evolution. The resulting superintelligence will not be totally aligned with human values. However, if the plan is sufficiently tractable while alignment is intractable, it may have higher expected value than aiming for aligned AI. I believe that such plans are neglected because many AI pessimists are uncomfortable thinking about this backup plan.

How to create a successor AI by simulating evolution?

If we had infinite compute and perfect knowledge of the evolutionary history of humans, we could just simulate the evolution of humans and end up with simulated humans, who could then either become superintelligent through self-modification, or solve the AI alignment problem by thinking for thousands of subjective years, likely achieving human CEV. If we had infinite compute but not perfect knowledge, we could execute Paul Christiano's plan of simulating the entire evolutionary process of some intelligent aliens, and achieve humanlike-alien CEV.

Sadly, we have neither infinite compute nor perfect knowledge. In practice, plans to create a successor species should start by deciding which features involved in the evolutionary and cultural processes leading up to current human values are necessary to generate values at least ~50% as good as human CEV ("values-important"). For example, maybe it's necessary that brain architecture is determined by genes, that it's possible to infer others' mental states by interrogating them, or that humans pair-bond allowing romantic love. If there are few enough values-important features, we can design a training procedure with as many of the values-important features as possible, but which is still training-competitive. This will hopefully get us agents, which I refer to as "simulated aliens", which are somewhat aligned with humans.

Here's my sketch of a path towards a successor AI:

Find likely values-important features in human evolution. (This step seems hard, but likely easier than alignment).
Alignment looks intractable, so we develop a procedure (e.g. a series of RL environments) that replicates the evolution of social beings, that includes the most likely values-important features.
We get further evidence that alignment is intractable, so we implement the training procedure.
As the simulated agents approach human capability levels, we might modify or selectively breed them to be slightly closer to human values.
The simulated agents evolve/train into superintelligences, break out of the simulation, and take over Earth before humans can develop a misaligned AI through other means.

Will this actually work?

A few things have to be true for a successor AI plan to work, and I analyze each of these below.

The CEV of humanlike aliens must be substantially closer to our values than unaligned goals, such that we think it's at least ~50% as good as creating an aligned AGI (where the baseline is an unaligned AGI). That is, human values must not be sensitive to the differences between human evolution and an infinite-compute simulation of human evolution based on our best knowledge.
Developing this plan is more tractable than alignment, assuming it is possible.
The training process must approximate alien CEV (and there's no interaction between the error here and the error from (1) that destroys all the value.)
The plan is sufficiently training-competitive.
The agent that is deployed doesn't create an unaligned AGI of its own.
The plan does not involve humans developing dangerous capabilities that lead to unaligned AGI.

Overall, it seems fairly unlikely that a plan like this works, so I'm mostly posting this so that people can start thinking in the direction of related, better plans, or think about the underlying moral questions.

1. Aren't human values fragile?

Complexity and fragility of value have been written about on LW since Eliezer's values sequences:

Complexity of value is the thesis that human values have high Kolmogorov complexity; that our preferences, the things we care about, cannot be summed by a few simple rules, or compressed. Fragility of value is the thesis that losing even a small part of the rules that make up our values could lead to results that most of us would now consider as unacceptable (just like dialing nine out of ten phone digits correctly does not connect you to a person 90% similar to your friend). For example, all of our values except novelty might yield a future full of individuals replaying only one optimal experience through all eternity.

(from the LW wiki)

I think the successor AI plan is consistent with a weak version of value fragility, something like "our CEV cannot be summed up by a few simple rules". My view is roughly that while an actual list of the things valued by human CEV is fragile (removing one small piece like novelty can remove most of the value of the future), the process that produced human values is not necessarily fragile (making a small change to the evolutionary and cultural processes that created our values might retain most of the value of the future). I put significant credence on the possibility that reaching 50% of the value of human CEV requires fewer than 100 values-important features of human evolution. I consider the moral questions here very important, but I'm very confused about them and detailed reasoning about this position is outside the scope of this post.

However, a stronger value-fragility thesis could turn out to be true; if you take the Kolmogorov complexity claim in the quote literally, then human values cannot be compressed into any short program, even if the program is "simulate evolution with the top 100 value-important features". Maybe human values depend on a large number of incidental steps in evolution that we can't possibly identify, in which case your intuition that alien or elephant CEV is ok would be wrong, and this plan would be doomed.

2. Will this be more tractable than alignment?

Unknown; it seems hard to identify which features of human evolution are important in creating human values, but AI pessimists claim that we have no idea how to solve alignment either. In any case, if we disregard plans for partially aligned successor AI, and continue to frame the problem of maximizing the expected value of the future as maximizing the probability of fully solving alignment, we could be leaving value on the table.

3. Will the training process approximate alien CEV?

After we identify features of evolution that might be necessary to yield humanlike values, we need to actually design the training process, and decide which features to keep vs discard. Many features will probably add inefficiencies into the training process, so we will have to include enough value-important features that the AI is alien-aligned, but not so many that the plan is impractically uncompetitive. If you're really pessimistic about coordination, then maybe you think the version of the tradeoff we're heading for is that none of the value-important features of evolution are simulated and none of alien CEV is retained.

However, it might be possible to do better than alien CEV in some respects. Depending on the amount of compute and coordination we have, we might be able to do selective breeding on the last few generations for inclination to cooperate with aliens, wide moral circles, caution about AGI-like activities, etc. If performed for a short period at the end before they surpass humans, deception is probably not a huge concern. In an optimistic scenario, this has as much selection pressure as the process that turned wolves into dogs (thousands of bits), and so we only need the generators of humanlike values to not be vanishingly unlikely in the space of agents generated by the simulation.

4. Will this be competitive?

Probably not.

First, note that simulating the evolution of an entire species is uncompetitive by many orders of magnitude. But real life evolution is extremely inefficient at creating intelligence in ways we can immediately fix:

Humans take ~20 years to grow up and produce children, while in simulation we can probably download memories onto newborn simulated aliens;
Humans must carry around an entire body with ~50 times the mass of the brain;
To prevent inbreeding depression, any isolated population must have >500 individuals;
(speculative) Evolution and SGD might be similar optimization algorithms in that they optimize locally, with the biggest difference being that evolution operates on the L1 norm, so we can probably replace evolution with steepest descent on the L1 norm and possibly by SGD.

These inefficiencies can all be removed in a simulation, so I'm confident that we can do basically the same thing as evolution with several orders of magnitude less compute. This is likely still not competitive enough, but we might find more competitive plans if the most compute-intensive parts of human evolution turn out to not be values-important. For example, perhaps at subhuman capability level, the aliens' values will become crystallized and we can train the aliens from elephant-level towards superintelligence using standard RL techniques.

That said, this plan will probably have a competitiveness disadvantage compared to totally unaligned AI, so we would have to have good enough coordination to make the successor AI. For example, if the successor AI plan takes 3x more compute than totally unaligned AGI and is technically simple, we might be able to coordinate around creating the successor AI anyway. If it's 1000x, maybe not.

5. Won't the simulated aliens create unaligned AGI?

The successor AI is already an artificial superintelligence. It is possible that it will need to solve a version of the alignment problem itself, but this doesn't seem like an issue: if a superintelligence can't solve alignment, then we couldn't either. And the simulated aliens probably won't develop their own misaligned AIs before becoming superintelligent, because:

we can warn them that the alignment problem is hard
they can likely more easily self-modify to become superintelligent than design AGI from scratch
we can construct the environment so they don't have an AI hardware overhang, or economic incentives towards AI capabilities
we can stop the simulation if we see them trying to develop AGIs

6. Won't implementing this plan require dangerous capabilities?

It would be irresponsible to develop new, more capable RL architectures just for the successor AI plan. Other elements of the plan, like research into values-important features of human evolution, seem fine; and if the training procedure can be adapted to new RL architectures easily, only the actual implementation of the plan will involve cutting-edge capabilities. However, the risk from implementation seems large; due to the uncompetitiveness of the training process, the risk seems somewhat worse than the risk of whole brain emulation causing unaligned neuromorphic AI, which is already quite large.

If we're really lucky, there could be an RL architecture uniquely good at creating a successor species but that does not advance AGI timelines, but I wouldn't count on it.

---

Thanks to Tamera Lanham, Sydney Von Arx, Malo Bourgon, John Wentworth, Oliver Habryka, Jack Ryan, Drake Thomas, and others for helpful feedback.

I personally place much higher likelihood on the thesis that recovering basic cooperative values (where an ASI is nice to humans and gives us some of what we wants) requires way way less than simulating "evolution" - most human values seem like they may be emergent behaviors in repeated positive-sum multi-agent games. It seems like, at least to prevent treacherous turns, we mostly need (1) bias towards multi-agent positive-sum solutions, (2) dislike of defection, (3) the "golden rule" of treating other agents as you would like to be treated (4) respect for (and gaining utility from the utility of) lesser life-forms/animals.

The primary outlier is "respect for lesser life-forms", which I wouldn't assume would emerge from standard cooperative multi-agent games. That seems like it might be elicitable in a repeated game of either emerging or not emerging from a Rawlsian veil of ignorance (being an animal or a human in each round).

Obviously, it'd also be good if we could transmit lots of other concepts like beauty and novelty intact to an ASI. Thankfully, people have already thought about a lot of this; there's a whole field of "evolutionary psychology" which can be thought of as people coming up with hypotheses for the conditions of multi-agent RL environments under which different observed human/non-human behavioral patterns may emerge. We don't know whether they're right in practice (they primarily rely on observational evidence) but these are empirically-testable hypotheses once you have reasonably-general RL agents.

Note that a few extremely challenging concepts do remain, like "beauty". I'm personally very skeptical that even a good simulation of all of evolution would reliably end up with the human concept of beauty - do we know if animals have any related concepts? But we may still get substantial leverage just from an ASI having sympathy for us and knowing we care about beauty.

Concretely, it'll be useful to see people continuing to try and elicit as many such behaviors as possible in multi-agent RL, and progress on that will give us a pretty good sense of how good an alignment heuristic this would be. It could be very valuable to write out a "theory of impact" for this agenda, outlining exactly what type of success indicators would be valuable to alignment and what the components of porting a good solution would be.

If it helps, I have some discussion on this topic here (Section 8.3 and especially 8.3.3.1).

This is a nice post and I was mostly nodding along.

I expect it’s moot because of the training competitiveness issue.

I also happen to believe that this evolutionary scenario would only count as “success” if we have a very very low bar for what constitutes successful alignment (e.g. “not worse than a hot-tempered psycho human who grew up on an alien planet”), and if we have that low a bar for “success”, then I’m actually pretty optimistic about our prospects for non-evolutionary alignment “success”.

I also think I'm less optimistic than you about the simulated evolved aliens creating unaligned AGIs (and/or blowing each other to smithereens in other ways). Your Section 5 arguments are not convincing to me because (1) this could happen after they break out of the simulation into the real world, (2) competition could favor AGIs that lack social instincts and other things that make for a good life worth living, and if so, it doesn't matter whether they build such AGIs from scratch or self-modify into them. Or something like that, I guess.

I think that to pull this off well, you would need to match pretty closely to reality.

Genome based AI, start with the human genome, simulate that growing into a person, sounds easier.

Once you replace evolution with SGD, replace DNA and proteins with something easier to simulate, replace learning memories with downloading them, replace the ancestral environment with some video game. Then the approximation is so crude that you are basically training a neural net to do things that seem nice, and hoping for the best.

If you could rerun evolution starting from chimps, you may well get creatures with fairly similar values. If you rerun evolution, and then post select on various pieces of text, very similar values.

If you start from the first RNA, getting near human values is hard.

Then consider that human values can vary by culture a fair bit.

Consider the question of whether or not simulations of human minds are morally important.

Answer yes and you get endless virtual utopia. The person who answered no sees humanity wiped out and the universe filled with worthless computers.

Answer no and you get a smaller and less fun real utopia, plus people simulating whatever they feel like. Quite possibly the vast majority of human minds live unpleasant lives as characters in violence filled video games.

Now consider that you will probably find both positions on lesswrong. This isn't a cultural difference between us and ancient mongols. This is a cultural difference between people that are very culturally similar.

Now you can say that one side is right. You can optimize some combination and get a world that both sides like.

On a sufficiently basic level, most humans value tasty food (some people will refuse it for all sorts of reasons)

Far from the day to day world, human values are unconstrained by survival constraints. (Evolution so far has not selected for any particular view on whether simulations are morally you.)

There may be a single truth that all humans are converging towards. But maybe not.

If you just simulate the whole world, and put an "exit simulation" button that only an ASI could press, then these aliens have no better shot at alignment than us.

If you zoom in on the world, picking out the alien equivalent of MIRI, and giving them extra help over the careless aliens creating UFAI, then you need to locate the alien MIRI, when the aliens speak an alien language. They still might screw up anyway.

I’m honestly really confused why more effort isn’t being put into contingency alignment plans; it seems quite likely to me that partial alignment should be easier and faster to develop than full alignment, and it isn’t inevitable that alignment will be an all-or-nothing endeavor. Thanks for the thought-provoking analysis!

No. Humans do major harm to each other, often even when they are trying to help. And that's if things go right; an AI based on human behavior has a high chance of causing harm deliberately.

The way you have explained this idea assumes a certain model of ethics/friendliness -- that ethics is human value,and all human value indifferently. Other models make the problem a lot simpler.
It's started already. Current technologies already share an ecosystem with humans and are being selected for some kind of friendliness.
It would probably be stymied by rapid takeoff, but so would all the alternatives....rapid takeoff towards ASI is the hard problem.