I have been convinced to believe that looking at the gap between human inner and outer alignment is a good way to think about potential inner/outer alignment problems in artificial general intelligences:

We have an optimisation process (evolution) trying to propagate genes, that created a general intelligence (me/you). For millions of years our inner goals of feeling really good would also satisfy evolution’s outer goal of propagating genes, because one of the things that feels the best is having sex. But eventually that intelligent agent figured out how to optimise for things that the outer optimisation process didn’t want, such as having protected sex or watching VR porn, thus satisfying the inner goal of feeling really good, but not the outer goal of propagating genes. This is often told as a cautionary tale: we only know of one General Intelligence and it’s misaligned. One day we will create an Artificial General Intelligence (AGI) and we will give it some sort of (outer) goal, and it might then develop an inner goal that doesn’t directly match what we intended. I think this only tells half the story.

Even though our general intelligence has allowed us to invent condoms and have sex without the added cost of children, a surprising amount of people decide to take them off because they find it fun and meaningful to have children.

In a world where we could choose to spend all our time having protected sex or doing drugs, a lot of us choose to have a reasonable number of kids and spend our time on online forums discussing AI safety, all of which seem to satisfy a more longer term version of “propagate your genes” than simply wanting to have sex because it feels good. More than that, we often choose to be nice in situations where being nice is even detrimental to the propagation of our own genes. People adopt kids, try to prevent wars, work on wildlife conservation, spend money on charity buying malaria nets across the world, and more.

I think there are two questions here: why are human goals so sophisticatedly aligned with propagating our genes and why are humans so nice.

Most people want to have kids, not just have sex. They want to go through the costly and painful process of childbirth and child rearing, to the point where many will even do IVF. We’ve used all of our general intelligence to bypass all the feels-nice-in-the-moment bits and jump straight to the “propagate our genes” bit. We are somehow pretty well aligned with our unembodied maker’s wishes.

Humans, of course, do a bunch of stuff that seems unrelated to spreading genes, such as smoking cigarettes and writing this blog post. Our alignment isn’t perfect, but inasmuch as we have ended up a bit misaligned, how did we end up so pleasantly, try-not-to-kill-everything misaligned?

The niceness could be explained by the trivial fact that human values are aligned with what I, a human, think is nice: humans are pretty nice because people do things I understand and empathize with. But there is something odd about the fact that most people, if given the chance, would pay a pretty significant cost to help a person they don’t know, keep tigers non-extinct, or keep Yosemite there looking pretty. Nice is not a clear metric, but why are people so unlike the ruthless paperclip maximisers we fear artificially intelligent agents will immediately become?

Hopefully I’ve convinced you that looking at human beings as an example of intelligent agent development is more interesting than purely as an example of what went wrong, it is also an interesting example of some things going right in ways that I believe current theory on AI safety wouldn’t predict. As for the reasons why human existence has gone as well as it did, I’m really not sure, but I can speculate.

All of these discussions depend on agents that pick actions based on some reward function that determines which of two states of the world they prefer, but something about us seems to not be purely reward driven. The vast majority of intelligent agents we know (they’re all people), if given a choice between killing everyone while feeling maximum bliss, or not killing everyone and living our regular non maximum bliss lives would choose the latter. Heck a lot of people would sacrifice their own existence to save another person’s! Can we simply not imagine truly maximum reward? Most people would choose not to wirehead, not to abandon their distinctly not purely happy lives for a life of artificial joy.

Is the human reward function simply incredibly good? Evolution has figured out a way to create agents that adopt kids, look at baby hippos, plant trees, try to not destroy the world and also spread their genes.

Is our limited intelligence saving us? Perhaps we are too dumb to even conceive all the possibilities that would be horrific for humanity as a whole but we would prefer as individuals.

Could it be that there is some sort of cap on our reward function, simply due to our biological nature, where having 16 direct descendants doesn’t feel better than having 3? Where maximum bliss isn’t that high?

Perhaps there’s some survivorship bias, any intelligent agent that was too misaligned would disappear from the gene pool as soon as it figured out how to have sex or sexual pleasure without causing a pregnancy. We are still here because we evolved some deeper desires, desires of actually having a family and social group, past the sensual niceness of sex. Additionally, intelligent agents so far have not had the ability to kill everything, so even a horrifically misaligned agent couldn’t have caused that much damage. There are examples of some that did get into a position to cause quite a lot of damage, and killed a large percentage of the world’s population.

I am aware that we are, collectively, getting pretty close to the edge, either through misaligned AI, nuclear weapons, biological weapons or ecological collapse, but I’d argue that the ways in which people have and continue to mess each other up are more a result of coordination problems and weird game theory than misalignment.

Maybe I’m weird. Lots of people really would kill a baby tiger on sight, because it would endanger them or their family when grown. Plenty of people take fentanyl until they die. But still, if given the choice, most intelligent agents we know of would choose actions that wouldn’t endanger all other intelligent agents, such as chill out, have some kids, and drink a beer.

I can smell some circularity here: if we do end up making an AGI that kills us all, then humans too were misaligned all along, it just took a while to manifest. And if we make an AGI and it chooses to spend a modest amount of time pursuing its goals and the rest looking at Yosemite and saving baby tigers, maybe the typical end point for intelligences isn’t paperclip maximization but a life of mild leisure. Regardless, I still think our non-human-species-destroying behavior so far is worth examining.

We’re the only general intelligence we know of, and we turned out alright.

New Comment
45 comments, sorted by Click to highlight new comments since:

I think the main explanation for our niceness is described by Skyrms in the book the evolution of the social contract and his follow-up book the stag hunt. The main explanation being: in evolutionary dynamics, genes spread geographically, so strategies are heavily correlated with similar strategies. This means it's beneficial to be somewhat cooperative.

Also, for similar reasons, iterated games are common in our evolutionary ancestry. Many animals display friendly/nice behaviors. (Mixed in with really not very friendly behaviors, of course.)

I also don't think this solution carries over very well to powerful AIs. A powerful AI has exceptionally little reason to treat its actions as correlated with ours, and will not have grown up with us in an evolutionary environment.

I also don't think this solution carries over very well to powerful AIs. A powerful AI has exceptionally little reason to treat its actions as correlated with ours, and will not have grown up with us in an evolutionary environment.

This seems correct, but I think that's also somewhat orthogonal to the point that I read the OP to be making. I read it to be saying something like "some alignment discussions suggest that capabilities may generalize more than alignment, so that when an AI becomes drastically more capable, this will make it unaligned with its original goals; however, humans seem to remain pretty well aligned with their original goals despite a significant increase in their capabilities, so maybe we could use whatever-thing-keeps-humans-aligned-with-their-original-goals to build AIs in such a way that also keeps them aligned with their original goals when their capabilities increase".

So I think the question that the post is asking is not "why did we originally evolve niceness" (the question that your comment answers) but "why have we retained our niceness despite the increase in our capabilities, and what would we need to do for an AI to similarly retain its original goals as it underwent an increase in capabilities".

Sure. The issue is that we want to explain why we care about niceness, precisely because we currently care about niceness to a degree that seems surprising from an evolutionary perspective.

This is great from the perspective of humans who like niceness. But it's not great from the perspective of evolution - to evolution, it looks like the mesa-optimizers' values are drifting as their capabilities increase, because we're privileging care/harm over purity/contamination ethics or what have you.

Basically, because of no genetic engineering/mind uploading before the 21st century, and it's socially unacceptable to genetically engineer people due to World War II. We need to remember how contingent that was, and if WWII was avoided, genetic engineering probably would be more socially acceptable. Only contingency and the new ethical system that grew up in the aftermath of WWII prevented capabilities from eventually misaligning with evolution in genetics. All our capabilities have still not changed human nature.

There must have been some reason(s) why organisms exhibiting niceness were selected for during our evolution, and this sounds like a plausible factor in producing that selection. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.

As this post notes, the human learning process (somewhat) consistently converges to niceness. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to niceness, but it still built such a learning process.

It therefore seems very worthwhile to understand what part of the human learning process allows for niceness to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build a niceness-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers learned to fly by studying birds, not by re-evolving them!

Niceness in humans has three possible explanations:

  • Kin altruisim (basically the explanation given above)- in the ancestral environment, humans were likely to be closely related to most of the people they interacted with, giving them genetic "incentive" to be at least somewhat nice. This obviously doesn't help in getting a "nice" AGI- it won't share genetic material with us and won't share a gene-replication goal anyway.
  • Reciprocal altruism- humans are social creatures, tuned to detect cheating and ostratice non-nice people. This isn't totally irrelevant- there is a chance a somewhat dangerous AI may have use for humans in achieving its goals, but basically, if the AI is worried that we might decide it's not nice and turn it off or not listen to it, then we didn't have that big a problem in the first place. We're worried about AGIs sufficiently powerful that they can trivially outwit or overpower humans, so I don't think this helps us much. 
  • Group selection. This is a bit controversial and probably least important of the three. At any rate, it obviously doesn't help with an AGI.

So in conclusion, human niceness is no reason to expect an AGI to be nice, unfortunately. 

I note that none of these is obviously the same as the explanation Skyrms gives.

  • Skyrms is considering broader reasons for correlation of strategies than kinship alone; in particular, the idea that humans copy success when they see it is critical for his story.
  • Reciprocal altruism feels like a description rather than an explanation. How does reciprocal altruism get started?
  • Group selection is again, just one way in which strategies can become correlated.

Re: reciprocal altruism. Given the vast swathe of human prehistory, virtually anything not absurdly complex will be "tried" occasionally. It only takes a small number of people whose brains happen to wired to "tit-for-tat" to get started, and if they out-compete people who don't cooperate (or people who help everyone regardless of behaviour towards them), the wiring will quickly become universal. 

Humans do, as it happens, explicitly copy successful strategies on an individual level. Most animals don't though, and this has minimal relevance to human niceness, which is almost certainly largely evolutionary. 

Note that the comment you're responding to wasn't asking about the evolutionary causes for niceness, nor was it suggesting that the same causes would give us reason to expect an AGI to be nice. (The last paragraph explicitly said that the "Wright brothers learned to fly by studying birds, not by re-evolving them".) Rather it was noting that evolution produced an algorithm that seems to relatively reliably make humans nice, so if we can understand and copy that algorithm, we can use it to design AGIs that are nice.

There's a flaw in this, though. Humans are consistently nice, yes - to one another. Not so much to other, less powerful creatures. Look at the proportion of people on earth who are vegan: very few. Similarly, it's not enough just to figure out how to reproduce the learning process that makes humans nice to one another - we need to invent a process that makes AIs nice to all living things. Otherwise, it will treat humans the same way most humans treat e.g. ants.

What fraction of people are nice in the way we want an AI to be nice? 1 / 100? 1 / 1000? What n is large enough such that selecting the 1 / n nicest human would give you a human sufficiently nice?

Whatever your answer, that equates to saying that human learning processes are ~ log(n) bits of optimization pressure away from satisfying the "nice in the way we want an AI to be nice" criterion. 

Another way to think about this: selecting the nicest out of n humans is essentially doing a single step of random search optimization over human learning processes, optimizing purely for niceness. Random search is a pretty terrible optimization method, and one-step random search is even worse.

You can object that it's not necessarily easy to apply optimization pressure towards niceness directly (as opposed to some more accessible proxies for niceness), which is true. But still, I think it's telling that so few total bits of optimization pressure leads to such big differences in human niceness.

Edit: there are also lots of ways in which bird flight is non-optimal for us. E.g., birds can't carry very much. But if you don't know how to build a flying machine, studying birds is still valuable. Once you understand the underlying principles, then you can think about adapting them to better fit your specific use case. Before we understand why humans are nice to each other, we can't know how easily it will be to adapt those underlying generators of niceness to better suit our own needs for AIs. How many bits of optimizaiton pressure do you have to apply to birds before they can carry cargo planes worth of stuff?

I would say about roughly 1 in 10-1 in a 100 million people can be trusted to be reliably nice to less powerful beings, and maybe at the high end 1 in 1 billion people can reliably not abuse less powerful beings like animals, conditional on the animal not attacking them. That's my answer for how many bits of optimization pressure is required for reliable niceness towards less powerful beings in humans.

As this post notes, the human learning process (somewhat) consistently converges to niceness. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to niceness, but it still built such a learning process.

It therefore seems very worthwhile to understand what part of the human learning process allows for niceness to emerge in humans.

Skyrms makes the case for similar explanations at these two levels of description. Evolutionary dynamics and within-lifetime dynamics might be very different, but the explanation for how they can lead to cooperative outcomes is similar.

His argument is that within-lifetime, however complex human's learning process may be, it has the critical feature of imitating success. (This is very different from standard game theory's CDT-like reasoning-from-first-principles about what would cause success.) This, combined with the same "geographical correlation" and "frequent iterated interaction" arguments that were relevant to the evolutionary story, predicts that cooperative strategies will spread.

(On the border between a more-cooperative cluster of people and a less-cooperative cluster, people in the middle will see that cooperation leads to success.)

The parent comment currently stands at positive karma and negative agreement, but the comments on it seem to be saying "what you are saying is true but not exactly relevant or not the most important thing" -- which would seem to suggest the comment should have negative or low karma but positive agreement instead.

On this evidence, I suspect voters and commenters may have different ideas; any voters want to express the reasons for their votes?

As Quintin wrote, you aren't describing a mechanistic explanation for our niceness. You're describing a candidate reason why evolution selected for the mechanisms which do, in fact, end up producing niceness in humans. 

Skyrms makes the case that biological evolution and cultural evolution follow relevantly similar dynamics, here, so that we don't necessarily need to care very much about the distinction. The mechanistic explanation at both levels of description is similar.

I can't speak for OP, but I'm not interested in either kind of evolution. I want to think about the artifact which evolution found: The genome, and the brains it tends to grow. Given the genome, evolution's influence on human cognition is screened off. 

Why are people often nice to other agents? How does the genome do it, in conjunction with the environment? 

Genes being concentrated geographically is a fascinating idea, thanks for the book recommendation, I'll definitely have a look.

Niceness does seem like the easiest to explain with our current frameworks, and it makes me think about whether there is scope to train agents in shared environments where they are forced to play iterated games with either other artificial agents or us. Unless an AI can take immediate decisive action, as in a fast take-off scenario, it will, at least for a while, need to play repeated games. This does seem to be covered under the idea that powerful AI would be deceptive, and pretend to play nice until it didn't have to, but somehow our evolutionary environment led to the evolution of actual care for others' wellbeing rather than only very sophisticated long-term deception abilities.

I remember reading about how we evolved emotional reactions that are purposefully hard to fake, such as crying, in a sort of arms race against deception, I believe it's in How the Mind Works. This reminds me somewhat of that, where areas where people have genuine care for each other's well beings are more likely to propagate the genes concentrated there.

Even though our general intelligence has allowed us to invent condoms and have sex without the added cost of children, a surprising amount of people decide to take them off because they find it fun and meaningful to have children.

Fertility rates among such groups are often sub-replacement. As far as evolution is concerned, the difference between 2-ε TFR and 0 TFR is merely that they take different times to reach fixation in extinction... The fact that TFRs are so low would cast a lot of doubt on any claim we are maximizing our evolutionary reward.

A more telling example than condoms would be sperm donation. You can, with little effort other than travel, rack up literally hundreds* of offspring through a UK clinic (the UK has a sperm shortage) or just doing it freelance over Facebook anywhere. Since most men will have 1 or 2 children (if that), that implies they are passing up fitness increases of >10,000%**. You can find cases of men doing this in places like NYC or London, where there are literally millions of fertile† men sitting around not having children at that moment & continually passing up that opportunity (publicized regularly in the tabloid media with each revelation that X has 120 kids or Doctor Y did 50 on the sly), while only a handful take it up. To put this in perspective, it means you can beat the most successful men in history like Ghengis Khan at the evolution game for the cost of a monthly subway pass. (The moms will provide the optional turkey baster.) And the main downside is that you may be so reproductively fit that your donor offspring have to worry about accidental incest because there are so many of them all over the place, and also if you try to maintain a minimal amount of contact, the birthday & Christmas card workload adds up rapidly.

The extent to which men pass this up shows dramatically how badly evolution can fail to tune us for inclusive fitness when the environment has changed enough from the EEA. Sperm donation is only one of many ways that men could be boosting inclusive fitness. (Most obviously, lobby their male relatives into doing it as well! That would add >5,000% if you could get a brother or father, >2,500% if a cousin...) If men were doing anything close to maximizing inclusive fitness in the contemporary environment, they wouldn't be giving away their sperm for free to needy moms, they would be living like monks on a few dollars a day while spending their millions of lifetime income aside from that on paying women to use their sperm or for surrogacies. (Elon Musk is in the news again for turning out to have even more children - 10 total, so far, that we know of, so >500% the norm, and he doesn't seem to be trying, it's just what happens when a rich attractive man sleeps around and doesn't do a lot of work to avoid it. Cases like Mitsutoki Shigeta would not be near-globally-unique nor would they be extremely wealthy - because they would've spent it all to purchase far more than merely 13 children. For a classic fiction treatment, Mote in God's Eye.)

Is ~99.999% of the male population passing up outer-reward gains of >>10,000% in favor of various inner-rewards (often of trivial magnitude) really "pretty good alignment" of the inner with outer processes? Doesn't seem like it to me.

* I forget if anyone has broken 1,000 yet. Apparently when you really get going as an amateur sperm donor, it's easy to lose track or fall out of touch and not know how many live births there ultimately are. Fertility clinics are not always good at tracking the paperwork either. Population registries will eventually quantify the extremes here once genetic info becomes standard.
** For further perspective, the largest selective sweeps ever discovered in human evolution are for lactase persistence with a selection advantage of s ~ 0.05, or 5%.
† I assume fertility because most men are. If you aren't because of health or old age, then evolution would prefer you to not tinker around in retirement, spending money on maintaining your tired worn out body, building ships in bottles or whatever fitness-negative behavior you are engaged in, but to instead run up as much debt as possible & commit crimes, transfer resources to your relatives to increase their fitness such as by teaching them for free, purchasing surrogacies, leading suicidal death charges against your clan's enemies, and so on. If you can't do anything like that, then at least have the evolutionary-decency to drop dead on the spot so as to free up resources & reduce competition for spatially near-by relatives.

I completely agree that our behaviour doesn't maximise the outer goal. My mysteriously capitalised "Pretty Good" was intended to point in this direction - that I find it interesting that we still have some kids, even when we could have none and still have sex and do other fun things. Declining populations would also point to worse alignment. I would consider proper bad alignment to be no kids at all, or the destruction of the planet and our human race along with it, although my phrasing, and thinking on this, is quite vague.

There is an element of unsustainability in your strategy for max gene spreading, where if everyone was constantly doing everything they could to try to spread their genes as much as possible, in the ways you describe, humanity as a whole might not survive, spreading no genes at all. But, even if it would unsustainable for everyone to do the things you described, a few more people could do it, spread their genes far and wide and society would keep ticking along. Or everyone could have just a few more children and things would probably be fine in the long term. I would say that men getting very little satisfaction from sperm donation is a case of misalignment - a deep mismatch between our "training" ancestral environments and our "deployment" modern world.

So I agree we don't maximise the outer goal, especially now that know how not to. One of the things that made me curious about this whole thing is that this characteristic, some sort of robust goal following without maximising, seems like something we would desire in artificial agents. Reading through all these comments is crystallising in my head what my questions on this topic actually are:

  1. Is this robust non-maximalness an emerging quality of some or all very smart agents? - I doubt it, but it would be nice as it would reduce the chances that we get turned into paperclips.
  2. Do we know how to create agents that exhibit these characteristics I think are positive? - I doubt it, but might be worth figuring out. An AGI that follows their goals only some sustainable, reasonable, amount seems safer than the AGI equivalent of the habitual sperm donor.

Is this robust non-maximalness an emerging quality of some or all very smart agents?

Yeah, I suspect it's actually pretty hard to get a mesa-optimizer which maximizes some simple, internally represented utility function. I am seriously considering a mechanistic hypothesis where "robust non-maximalness" is the default. That, on its own, does not guarantee safety, but I think it's pretty interesting.

I narrowly agree that evolution failed to align us well with inclusive genetic fitness. 

However, your comment indicates to me that you missed OP's more important points. I think humans have some pretty interesting alignment properties (e.g. blind people presumably lose access to a range of visually-activated hardcoded reward circuitry, and yet are not AFAICT less likely to care about other human beings; thus, human value formation is robust along some kinds of variation on the internal reward function; is value really that fragile after all?). Your comment focuses on evolution/human misalignment, as opposed to genome->human alignment properties (e.g. how sensitive are learned human values to mutations and modifications to the learning process, or how the genome actually mechanistically makes people care about other people). 

in favor of various inner-rewards (often of trivial magnitude)

Inner-rewards as in "the reward meted out by the human reward system"? If so, I don't think that's how people work. Otherwise, they would be wireheaders: We know how to wirehead humans; neuroscientists do not wirehead themselves, even though some probably could have it arranged; people are not inner-reward maximizers. 

just doing it freelance over Facebook anywhere.

By my impression, this is risky; you might be forced to pay child support.

I think you point in the same direction as Steven Byrnes' brain-like-AGI safety: We can learn from how human motivation systems are set up in a way that has exactly the outcomes you mention. We can run simulations and quantify how stable selected motivation systems are under optimization pressure. We will not build the same motivation systems into AGI but maybe a subset that is even more stable.

Yes, that's exactly the direction this line of thought is pulling me in! Although perhaps I am less certain we can copy the mechanics of the brain, and more keen on looking at the environments that led to human intelligence developing the way it did, and whether we can do the same with AI.

Agree. The project I'm working on primarily tries to model the attention and reward systems. We don't try to model the brain closely but only structures that are relevant.

Our goals are going to align pretty well with (what we'd call) evolution's right up until we decide to genetically engineer ourselves, or upload our minds onto a non-biological substrate, at which point evolution will be toast.

I would agree that we've stayed pretty aligned so far (although we are currently woefully under-utilizing our ability to create more humans), and that humans are better-designed to have robust goals than current AI systems. But we're not robust enough to make me just want to copy the human design into AIs.

The niceness thing is actually kind of a fluke - a fluke that has solid reasons for why it makes sense in an ancestral environment, but which we've taken farther, and decided simultaneously that we value for its own sake. More or less drawing the target after the bullet has been fired. Unless the AI evolves in a human-like way (a hugely expensive feat that other groups will try to avoid), human niceness is actually a cautionary tale about motivations popping up that defy evolution's incentives.

I suspect (but can't prove) that most people would not upload themselves to non-biological substrate if given the choice - only 27% of philosophers[1] believe that uploading your brain would mean that you survive on the non-biological substrate. I also suspect that people would not engineer the desire to have kids out of themselves. If most people want to have kids, I don't think we can assume that they would change that desire, a bit like we don't expect very powerful AGIs to allow themselves to be modified. The closest I can think of right now would be that I could take drugs that would completely kill my sex drive, and almost no one would do that willingly, although that probably has other horrible side-effects.

If humans turn out to be misaligned in that way - we modify ourselves completely out of alignment with "evolution's wishes" - that would tell us something about the alignment of intelligent systems, but I think so far people have shown no willingness to do that sort of thing.

[1]https://survey2020.philpeople.org/survey/results/5094

The point about genetic engineering isn't anything to do with not having kids. It's about not propagating your own genome.

Kinda like uploading, we would keep "having kids" in the human sense, but not in the sense used by evolution for the last few billion years. It's easy to slip between these by anthropomorphizing evolution (choosing "sensible" goals for it, conforming to human sensibilities), but worth resisting. In the analogy to AI, we wouldn't be satisfied if it reinterpreted everything we tried to teach it about morality in the way we're "reinterpreting evolution" even today.

So like a couple would decide to have kids and they would just pick a set of genes entirely unrelated to theirs to maximise whatever characteristics they valued?

If I understand it correctly, I still feel like most people would choose not to do this, a lot of people seem against even minor genetic engineering, let alone something as major as that. I do understand a lot of the reticence towards genetic engineering has other sources besides “this wouldn’t feel like my child, it’s hard to make any clear predictions.

Yeah, anthropomorphising evolution is pretty iffy, I guess in this situation I’m imagining we’re evolution and we create a human race with the goal of replicating a bunch of DNA sequences that starts doing all sorts of wild things we didn’t predict. I still think I’d be more pleased with the outcome here than what a lot of current thinking on AGIs predicts we will be once we create a capable enough AGI. We do propagate our little DNA sequences, not as ambitiously as we perhaps could, but also responsibly enough that we aren’t destroying absolutely everything in our path. I don’t see this as a whole-sale reinterpreting of what evolution “wants”, more of a not very zealous approach to achieving it.

A bit like if I made a very capable paper clip making AI and it made only a few million paperclips and then got distracted watching YouTube and only making some paperclips every now and then. Not ideal, but better than annihilation.

This is probably more due to uploading being outside the overton window than anything. The existence of large numbers of sci fi enthusiasts and transhumanists who think otherwise implies that this is a matter of culture and perhaps education, not anything innate to humans. I personally want to recycle these atoms and live in a more durable substrate as soon as it is safe to do so. But this is because I am a bucket of memes, not a bucket of genes; memes won the evolution game a long time ago, and from their perspective, my goals are perfectly aligned.

Also, I think the gene-centered view is shortsighted. Phenotypes are units of selection as much as genes are; they propagate themselves by means of genes the same way genes propagate themselves by means of phenotypes. It's just that historically genes had much more power over this transaction. Even I do not want to let go of my human shape entirely - though I will after uploading experiment with other shapes as well - so the human phenotype retains plenty of evolutionary fitness into the future.

So if I upload my brain onto silicon, but don’t destroy my meat self in the process, how is the one in the silicon me? Would I feel the qualia of the silicon me? Should I feel better about being killed after I’ve done this process? I really don’t think it’s a matter of the Overton window, people do have an innate desire not to die, and unless I’m missing something this process seems a lot like dying with a copy somewhere.

I'm talking about gradual uploading. Replacing neurons in the brain with computationally identical units of some other computing substrate gradually, one by one, while the patient is awake and is able to describe any changes in consciousness and clearly state if something is wrong so that it can be reversed. Not copying or any other such thing.

Ah I do personally find that a lot better than wholesale uploading, but even then I'd stop short of complete replacement. I would be too afraid that without noticing I would lose my subjective experience - the people doing the procedure would never know the difference. Additionally, I think for a lot of people if such a procedure would stop them from having kids they wouldn't want to do it. Somewhat akin to having kids with a completely new genetic code, most people seem to not want that. Hard to predict the exact details of these procedures and what public opinion will be of them, but it would only take some people to consistently refuse for their genes to keep propagating.

I feel like "losing subjective experience without noticing" is somehow paradoxical. I don't believe that that's a thing that can conceivably happen. And I really don't understand the kids thing. But I've never cared about having children and the instinct makes no sense to me so maybe you're right.

I notice a sense of what feels like a deep confusion in the above.

It acts as though the "outer" alignment of humans is "wanting to have kids," or something? And I am pretty confident this is not the right term in the analogy.

"Wanting to have kids" is another inner optimizer. It's another set of things that were cobbled together by evolution and survived selection pressure because they had good fitness on the outer goal of propagating the species. It's the same type of thing as inventing condoms, it's just a little less obviously askew.

It's not even all that great, given that it often expresses itself much more in caring a lot about one or two kids, and trying really hard to arrange a good life for those one or two kids, when the strategy of "make fifteen and eight will survive" does much better on the actual """goal""" of evolution.

It acts as though the "outer" alignment of humans is "wanting to have kids," or something

That seems correct to me; to the extent that we can talk about evolution optimizing a species for something, I think it makes the most sense to talk about it as optimizing for the specific traits under selection. When the air in Manchester got more polluted and a dark color started conferring an advantage in hiding, dark-colored moths became more common; this obviously isn't an instance of an inner optimizer within the moth's brain since there's no behavioral element involved, it's just a physical change to the moth's color. It's just evolution directly selecting for a trait that had become useful in Manchester.

Likewise, if "wanting to have kids" is useful for having more surviving descendants, then "wanting to have more kids" becomes a trait that evolution is selecting for, analogously to evolution selecting for dark color in moths. There is an inner optimizer that is executing the trait that has been selected for, but it's one that's aligned with the original selection target.

The descriptions of the wanting in the OP seem to be about the inner optimizer doing the execution, though. That's the distinction I want to make—confusing that for not-an-inner-optimizer seems importantly bad.

I agree with you on what is the inner optimiser. I might not have been able to make myself super clear in the OP, but I see the "outer" alignment as some version "propagate our genes", and I find it curious that that outer goal produced a very robust "want to have kids" inner alignment. I did also try to make the point that the alignment isn't maximal in some way, as in yeah, we don't have 16 kids, and men don't donate to sperm banks as much as possible and other things that might maximise gene propagation, but even that I find interesting: we fulfill evolution's "outer goal" somewhat, without going into paperclip-maximiser-style propagate genes at all cost. This seems to me like something we would want out of an AGI.

Does it, though?

Especially in the more distant past, making fifteen kids of which eight survive probably often resulted in eight half-starved, unskilled, and probably more diseased offspring that didn't find viable mates, and extinguished the branch. I don't think humans are well suited to be nearer the r-strategy end of the spectrum, despite still having some propensity to do so.

In modern times it appears much more viable, and there are some cases of humans who desired and had hundreds of children, so the "inner optimizers" obviously aren't preventing it. Would a strong desire for everyone to have dozens or hundreds of children in times of plenty benefit their survival, or not? I don't think think this is a straightforward question to answer, and therefore it's not clear how closely our inner optimizer goals match the outer.

My main point here is that "wanting kids" is inner, not outer (as is basically any higher brain function that's going to be based on things like explicit models of the future).

For a different take, think about the promiscuous male strategy, which often in the ancestral environment had very very little to do with wanting kids.

I absolutely agree that "wanting kids" is inner not outer, as is "not wanting kids" or "liking sex". The question was how well they are aligned with the outer optimizer's goals along the lines of "have your heritable traits survive for as long as possible".

I somewhat agree with the original post that the inner goals are actually not as misaligned with the outer goals as they might superficially seem. Even inventing birth control so as to have more non-productive sex without having to take care of a lot of children can be beneficial for the outer goal than not inventing or using birth control.

The biggest flaw with the evolution=outer, culture/thoughts=inner analogy in general though is that the time and scope scales for evolution outer optimization are drastically larger than the timescale of any inner optimizers we might have. When we're considering AGI inner/outer misalignment, they won't be anywhere near so different.

Arguably, humans will eventually become entities that do not have genes at all; thus the outer alignment goal of "propagating genes" will be fulfilled to 0%. We are only doing it now because genes are instrumentally useful, not because we intrinsically care about genes.

Evolution has figured out a way to create agents that adopt kids, look at baby hippos, plant trees, try to not destroy the world and also spread their genes.

Um, that's because the right amount of niceness was beneficial in the ancestral environment; altruism, like all of our other drives evolved in service of spreading our genes.

I don't think people have shown any willingness to modify themselves anywhere close to that extent. Most people believe mind uploading would be equal to death (I've only found a survey of philosophers [1]), so I don't see a clear path for us to abandon our biology entirely. Really the clearest path I can see is us being replaced by AI in mostly unpleasant ways, but I wouldn't exactly call that humanity at that point.

I'd even argue that if given the choice to just pick a whole new set of genes for their kids unrelated to theirs most people would say no. A lot of people have a very robust desire to have biological children.

While I agree that our niceness evolved because it was beneficial, I do wonder why we didn't evolve the capacity for really long-term deception instead, like we fear AGIs will develop. A commenter above made a point about the geographic concentration of genes that I found very interesting and might explain this.

I reckon the question is whether can we replicate whatever made us nice in AGIs

[1] https://survey2020.philpeople.org/survey/results/5094

The vast majority of intelligent agents we know (they’re all people), if given a choice between killing everyone while feeling maximum bliss, or not killing everyone and living our regular non maximum bliss lives would choose the latter.

I'm not sure if I understand what makes this assumption so obvious. It seems intractable to actually know, since anyone who would choose the former would have every reason to lie if you asked them. It's also very easy to decieve yourself into thinking you'd do the thing that lets you feel better about yourself when it's all theoretical.

I like your post and it made me think a lot, I'm just confused about the niceness part. I feel myself being far more cynical of the extent of human niceness, but maybe that disagreement isn't important and you're just considering why that kind of behaviour might exist at all?

Personally I'd feel pretty confident that humans have probably caused far more suffering than they've caused pleasure/utility, regardless of whether we're talking about intentional vs incidental.

Consider that something like 1/3 (or more) of humans believe that an entity exemplifying infinite perfection permits the existence of a realm of torture for all eternity.

At any given time humans facilitate the excruciating existence of animals numbering an order of magnitude more than that of humans.