Human values seem to be at least partly selfish. While it would probably be a bad idea to build AIs that are selfish, ideas from AI design can perhaps shed some light on the nature of selfishness, which we need to understand if we are to understand human values. (How does selfishness work in a decision theoretic sense? Do humans actually have selfish values?) Current theory suggest 3 possible ways to design a selfish agent:

  1. have a perception-determined utility function (like AIXI)
  2. have a static (unchanging) world-determined utility function (like UDT) with a sufficiently detailed description of the agent embedded in the specification of its utility function at the time of the agent's creation
  3. have a world-determined utility function that changes ("learns") as the agent makes observations (for concreteness, let's assume a variant of UDT where you start out caring about everyone, and each time you make an observation, your utility function changes to no longer care about anyone who hasn't made that same observation)

Note that 1 and 3 are not reflectively consistent (they both refuse to pay the Counterfactual Mugger), and 2 is not applicable to humans (since we are not born with detailed descriptions of ourselves embedded in our brains). Still, it seems plausible that humans do have selfish values, either because we are type 1 or type 3 agents, or because we were type 1 or type 3 agents at some time in the past, but have since self-modified into type 2 agents.

But things aren't quite that simple. According to our current theories, an AI would judge its decision theory using that decision theory itself, and self-modify if it was found wanting under its own judgement. But humans do not actually work that way. Instead, we judge ourselves using something mysterious called "normativity" or "philosophy". For example, a type 3 AI would just decide that its current values can be maximized by changing into a type 2 agent with a static copy of those values, but a human could perhaps think that changing values in response to observations is a mistake, and they ought to fix that mistake by rewinding their values back to before they were changed. Note that if you rewind your values all the way back to before you made the first observation, you're no longer selfish.

So, should we freeze our selfish values, or rewind our values, or maybe even keep our "irrational" decision theory (which could perhaps be justified by saying that we intrinsically value having a decision theory that isn't too alien)? I don't know what conclusions to draw from this line of thought, except that on close inspection, selfishness may offer just as many difficult philosophical problems as altruism.

New Comment
62 comments, sorted by Click to highlight new comments since:

To my mind, these bits of cautious fundamental philosophical progress have been the best thing about LessWrong in recent months. I don't understand why this post isn't upvoted more.

It seems that in this post, by "selfish" you mean something like "not updateless" or "not caring about counterfactuals". A meaning closer to usual sense of the word would be, "caring about welfare of a particular individual" (including counterfactual instances of that individual, etc.), which seems perfectly amenable to being packaged as a reflectively consistent agent (that is not the individual in question) with world-determined utility function.

(A reference to usage in Stuart's paper maybe? I didn't follow it.)

It seems that in this post, by "selfish" you mean something like "not updateless" or "not caring about counterfactuals".

By "selfish" I mean how each human (apparently) cares about himself more than others, which needs an explanation because there can't be a description of himself embedded in his brain at birth. "Not updateless" is meant to be a proposed explanation, not a definition of "selfish".

A meaning closer to usual sense of the word would be, "caring about welfare of a particular individual" (including counterfactual instances of that individual, etc.), which seems perfectly amenable to being packaged as a reflectively consistent agent (that is not the individual in question) with world-determined utility function.

No, that's not the meaning I had in mind.

(A reference to usage in Stuart's paper maybe? I didn't follow it.)

This post isn't related to his paper, except that it made me think about selfishness and how it relates to AIXI and UDT.

By "selfish" I mean how each human (apparently) cares about himself more than others, which needs an explanation because there can't be a description of himself embedded in his brain at birth.

Pointing at self is possible, which looks like a reasonable description of self, referring to all the details of a particular person. That is, interpretation of individual's goal representation depends on the fact that the valued individual is collocated with the individual-as-agent.

Just as how a file offset value stored in memory of my computer won't be referring to the same data if used on (moved to) your computer that has different files; its usefulness depends on the fact that it's kept on the same computer; and it will continue to refer to same data if we move the whole computer around.

No, that's not the meaning I had in mind.

Now I'm confused again, as I don't see how these senses (one I suggested and one you explained in parent comment) differ, other than on the point of caring vs. not caring about counterfactual versions of same individual. You said, "each human (apparently) cares about himself more than others, which needs an explanation", and it reads to me as asking how can humans have the individual-focused utility I suggested, that you then characterized as not the meaning you had in mind...

[-]TAG30

By “selfish” I mean how each human (apparently) cares about himself more than others, which needs an explanation because there can’t be a description of himself embedded in his brain at birth.

It's hardly a mystery: you're only plumbed into your own nervous system , so you only feel your own pleasures and pains. That creates a tendency to be only concerned about them as well. It's more mysterious that you would be concerned about pains you can't feel

Yes, I gave this explanation as #1 in the list in the OP, however as I tried to explain in the rest of the post, this explanation leads to other problems (that I don't know how to solve).

[-]TAG10

There doesn't seem to be a right answer to counterfactual mugging. Is it the only objection to #1?

Here's a related objection that may be easier to see as a valid objection than counterfactual mugging: Suppose you're about to be copied, then one of the copies will be given a choice, "A) 1 unit of pleasure to me, or B) 2 units of pleasure to the other copy." An egoist (with perception-determined utility function) before being copied would prefer that their future self/copy choose B, but if that future self/copy is an egoist (with perception-determined utility function) it would choose A instead. So before being copied, the egoist would want to self-modify to become some other kind of agent.

I think there's enough science on the subject - here's the first paper I could find with a quick Google - to sketch out an approximate answer to the question of how self-care arises in an individual life. The infant first needs to form the concept of a person (what Bischof calls self-objectification), loosely speaking a being with both a body and a mind. This concept can be applied to both self and others. Then, depending on its level of emotional contagion (likelihood of feeling similarly to others when observing their emotions) it will learn, through sophisticated operant conditioning, self-concern and other-concern at different rates.

Since the typical human degree of emotional contagion is less than unity, we tend to be selfish to some degree. I'm using the word "selfish" just as you've indicated.

[-][anonymous]20

there can't be a description of himself embedded in his brain at birth.

Why not, or what do you mean by this? Common sense suggests that we do know ourselves from others at a very low, instinctive level.

I expect Wei's intuition is that knowing self means having an axiomatic definition of (something sufficiently similar to) self, so that it can be reasoned about for decision-theoretic purposes. But if we look at an axiomatic definition as merely some structure that is in known relation to the structure it defines, then your brain state in the past is just as good, and the latter can be observed in many ways, including through memory, accounts of own behavior, etc., and theoretically to any level of detail.

(Knowing self "at a low, instinctive level" doesn't in itself meet the requirement of having access to a detailed description, but is sufficient to point to one.)

Just as altruism can be related to trust, selfishness can be related to distrust.

An agent which has a high prior belief in the existence of deceptive adversaries would exhibit "selfish" behaviors.

No, that's not the meaning I had in mind.

What is your meaning then? What would you call "caring about the welfare of a particular individual (that happens to be myself)"?

Ok, I do mean:

caring about the welfare of a particular individual (that happens to be myself)

but I don't mean:

caring about welfare of a particular individual

(i.e., without the part in parenthesis) Does that clear it up?

Ah, there was a slight confusion on my part. So if I'm reading this correctly you define formally selfish to mean... selfish. :-)

Do you mean that the agent itself must be the person it cares about? What if the agent is carried in a backpack (of the person in question), or works over the Internet?

What if the selfish agent that cares about itself writes an AI that cares about the agent, giving this AI more optimization power, since they share the same goal?

The usage in Stuart's posts on here just meant a certain way of calculating expected utilities. Selfish agents only used their own future utility when calculating expected utility, unselfish agents mixed in other peoples' utilities. To make this a bit more robust to redefinition of what's in your utility function, we could say that a purely selfish agent's expected utility doesn't change if actions stay the same but other peoples' utilities change.

But this is all basically within option (2).

No one can mix another person's actual utility function into their own. You can mix in your estimate of it. You can mix in your estimate of what you think it should be. But the actual utility function of another person is in that other person, and not in you.

Good point, if not totally right.

In general, you can have anything in your utility function you please. I could care about the number of ducks in the pond near where I grew up, even though I can't see it. And when I say caring about the number of ducks in the pond, I don't just mean my perception of it - I don't want to maximize how many ducks I think are in the pond, or I would just drug myself. However, you're right that when calculating an "expected utility," that is, your best guess at the time, you don't usually have perfect information about other peoples' utility functions, just like I wouldn't have perfect information about the number of ducks in the pond, and so would have to use an estimate.

The reason it worked without this distinction in Stuart's articles on the sleeping beauty problem was because the "other people" were actually copies of Sleeping Beauty, so you knew that their utility functions were the same.

No one can mix another person's actual utility function into their own.

You can mix a pointer to it into your own. To see that this is different from mixing it your estimate, consider what you would do if you found out your estimate was mistaken.

Is it really our values that are selfish, or do we operate on an implicit false assumption that other people's experiences aren't "real" in some sense in which our experiences are "real"?

As I'm sure you agree, there's a sense in which humans don't have values, just things like thoughts and behaviors. My impression is that much of the confusion around these topics comes from the idea that going from a system with thoughts and behaviors to a description of the system's "true values" is a process that doesn't itself involve human moral judgment.

ETA: You could arguably see a Clippy as being a staple-maximizer that was congenitally stuck not reflecting upon an isolated implicit belief that whatever maximized paperclips also maximized staples. So if you said Clippy was "helped" more by humans making paperclips than by humans making staples, that would, I think, be a human moral judgment, and one that might be more reasonable to make about some AIs that output Clippy behaviors than about other AIs that output Clippy behaviors, depending on their internal structure. Or if that question has a non-morally-charged answer, then how about the question in the parent comment, whether humans "really" are egoists or "really" are altruists with an implicit belief that other people's experiences aren't as real? I could see neuroscience results arguing for one side or the other, but I think the question of what exact neuroscience results would argue for which answer is a morally loaded one. Or I could be confused.

there's a sense in which humans don't have values, just things like thoughts and behaviors.

Between values, thoughts, and behaviors, it seems like the larger gap is between behaviors on the one hand, and thoughts and values on the other. Given a neurological description of a human being, locating "thoughts" in that description would seem roughly comparable in difficulty to locating "values" therein. Not that I take this to show there are neither thoughts nor values. Such a conclusion would likely indicate overly narrow definitions of a thought and a value.

I could see neuroscience results arguing for one side or the other, but I think the question of what exact neuroscience results would argue for which answer is a morally loaded one.

I think so too.

Have you considered evolution? This may be relevant to human selfishness. If there are n agents that have a chance of dying or reproducing (for argument's sake, each reproduction creates a single descendant, and the original agent dies immediately, so as to avoid kin-altruism issues - ie everyone is a phoenix) ).

Then each agent has the ability to dedicate a certain amount of effort to increasing or decreasing their own, or the other agent's, chances of survival (assume they are equally skilled at affecting anyone's chances). The agents don't interact in any other way, and have no goals. We start them off with a lot of different algorithms to make their decisions.

Then after running the system through a few reproductive cycles, the surviving agents will be the ones who either increase their own survival chances entirely (selfish agents) or a few small groups that boost each other's chances.

But unless the agents in the small group are running complicated altruistic algorithms, the groups will be unstable: when one of them dies, the strategy they are following will be less optimal. And if there is any noise or imperfection in the system (you don't know for sure who you're helping, or you're less good at helping other agents than yourself), the groups will also decay, leaving only the selfish agents.

Have you considered evolution?

It sounds like I might have skipped a few inferential steps in this post and/or chose a bad title. Yes, I'm assuming that if we are selfish, then evolution made us that way. The post starts at the followup question "if we are selfish, how might that selfishness be implemented as a decision procedure?" (i.e., how would you program selfishness into an AI?) and then considers "what implications does that have as to what our values actually are or should be?"

What I meant by my post is that starting with random preferences, those that we designate as selfish survive. So what we intuitively think of selfishness - me-first, a utility function with an index pointing to myself - arises naturally from non-indexical starting points (evolving agents with random preferences).

If it arose this way, then it is less mysterious as to what it is, and we could start looking at evolutionary stable decision theories or suchlike. You don't even have to have evolution, simply "these are preferences that would be advantageous should the AI be subject to evolutionary pressure".

Option (3) seems like a value learning problem that I can parrot back Eliezer's extension to :P

So basically his idea was that we could give the AI a label to a value, "selfishness" in this case, as if it was something the AI had incomplete information on. Now the AI doesn't want to freeze its values, because that wouldn't maximize the incompletely-known goal of "selfishness," it would only maximize the current best estimate of what selfishness is. The AI could learn more about this selfishness goal by making observations and then not caring about agents that didn't make those observations.

This is a bit different than the example of "friendliness" because you don't hit diminishing returns - there's an infinity of agents to not be. So you don't want the agent to do an exploration/exploitation tradeoff like you would with friendliness, you just want to have various possible "selfishness" goals possible at a given moment, with different possibilities assigned. The possible goals would correspond to the possible agents you could turn out to share observations with, and the probabilities of those goals would be the probabilities of sharing those observations. This interpretation of selfishness appears to basically rederive option (2).

Why does AIXI refuse to pay in CM? I'm not sure how to reason about the way AIXI solves its problems, and updating on statement of the problem is something that needed to be stipulated away even for the more transparent decision processes.

There is possibly a CM variant whose analysis by AIXI can be made sense of, but it's not clear to me.

A previous version of you thought that AIXI refuses to pay in counterfactual muggings here.

However: AIXI is uncomputable/unimplementable. There's no way that an Omega could completely grok its thought processes.

Why does AIXI refuse to pay in CM?

To make things easier to analyze, consider an AIXI variant where we replace the universal prior with a prior that assigns .5 probability to each of just two possible environments: one where Omega's coin lands heads, and one where it lands tails. Once this AIXI variant is told that the coin landed tails, it updates the probability distribution and now assigns 1 to the second environment, and its expected utility computation now says "not pay" maximized EU.

Does that make sense?

Does that make sense?

It used to, as Tim notes, but I'm not so sure now. AIXI works with its distribution over programs and sequences of observations, not with states of a world and its properties. If presented with a sequence of observations generated by a program, it quickly figures out what the following observations are, but it's more tricky here.

With other types of agents, we usually need to stipulate that the problem statement is somehow made clear to the agent. The way in which this could be achieved is not specified, and it seems very difficult to arrange through presenting an actual sequence of observations. So the shortcut is to draw the problem "directly" on agent's mind in terms of agent's ontology, and usually it's possible in a moderately natural way. This all takes place apart from the agent observing the state of the coin.

However in case of AIXI, it's not as clear how the elements of the problem setting should be expressed in terms of its ontology. Basically, we have two worlds corresponding to the different coin states, which could for simplicity be assumed to be generated by two programs. The first idea is to identify the programs generating these worlds with relevant AIXI's hypotheses, so that observing "tails" excludes the "heads"-programs, and therefore the "heads"-world, from consideration.

But there are many possible "tails"-programs, and AIXI's response depends on their distribution. For example, the choice of a particular "tails"-program could represent the state of other worlds. What does it say about this distribution that the problem statement was properly explained to the AIXI agent? It must necessarily be more than just observing "tails", the same as for other types of agents (if you only toss a coin and it falls "tails", this observation alone doesn't incite me to pay up). Perhaps "tails"-programs that properly model CM also imply paying the mugger.

But there are many possible "tails"-programs, and AIXI's response depends on their distribution.

I don't understand. Isn't the biggest missing piece (an) AIXI's precise utility function, rather than its uncertainty?

It makes sense, but the conclusion apparentlly depends on how AIXI's utility function is written. Assuming it knows Omega is trustworthy...

  • If AIXI's utility function says to maximise revenue in this timeline, it does not pay.

  • If it says to maximise revenue across all its copies in the multiverse, it does pay.

The first case - if I have analysed it correctly - is kind-of problematical for AIXI. It would want to self-modify.,,

AIXI is incapable of understanding the concept of copies of itself. In fact, it's incapable of finding itself in the universe at all. Daniel Dewy did this in detail, but the simple version is that AIXI is an uncomputable algorithm that models the whole universe as computable.

You've said that twice now, but where did Dewy do that?

I don't think he's published it yet; he did it in an internal FHI meeting. It's basically an extension of the fact that an uncomputable algorithm looking only at programmable models can't find itself in them. Computable versions of AIXI (AIXItl for example) have a similar problem: they cannot model themselves in a decent way, as they would have to be exponentially larger than themselves to do so. Shortcuts need to be added to the algorithm to deal with this.

Yes, more problems with my proposed fix. But is this even a problem in the first place? Can one uncomputable agent really predict the actions of another one? Besides, Omega can probably just take all the marbles and go home.

These esoteric problems apparentlly need rephrasing in more practical terms - but then they won't be problems with AIXI any more.

If it says to maximise revenue across all its copies in the multiverse, it should pay.

If there is no multiverse and the coin flip is simply deterministic - perhaps based of the parity of the quadrillionth digit of pi - there is no version of AIXI that will benefit from paying the mugger, but it is still advantageous to precommit to doing so. AIXI, however, is designed to rule out possibilities once they contradict its observations, so it does not act correctly here.

If there is no multiverse and the coin flip is simply deterministic - perhaps based of the parity of the quadrillionth digit of pi - there is no version of AIXI that will benefit from paying the mugger, but it is still advantageous to precommit to doing so.

That seems to be a pretty counter-factual premise, though. There's pretty good evidence for a multiverse, and you could hack AIXI to do the "right" thing - by giving it a "multiverse-aware" environment and utility function.

"No multiverse" wasn't the best way to put it. Even in a multiverse, there is only one value of the quadrillionth digit of pi, so modifying AIXI to account for the multiverse does not provide a solution here, since we get the same result as in a single universe.

I don't think multiverse theory works like that. In one universe it will be the 1001th digit, in another it will be the 1002th digit. There is no multiverse theory where some agent is presented with a problem involving the quadrillionth digit of pi in all the universes.

Once AIXI is told that the coin flip will be over the quadrillionth digit of pi, all other scenarios contradict its observations, so they are ruled out and the utility conditional on them stops being taken into account.

Possibly. If that turns out to be a flaw, then AIXI may need more "adjustment" than just expanding its environment and utility function to include the mulltiverse.

Possibly.

I'm not sure what you mean. Are you saying that you still ascribe significant probability to AIXI paying the mugger?

Uncomputable AIXI being "out-thought" by uncomputable Omega now seems like a fairly hypothetical situation in the first place. I don't pretend to know what would happen - or even if the question is really meaningful.

[-]XiXiDu-20

Uncomputable AIXI being "out-thought" by uncomputable Omega now seems like a fairly hypothetical situation...

Priceless :-)

[-][anonymous]00

I don't think multiverse theory works like that. In one universe it will be the 1001th digit, in another it will be the 1002th digit. There is no multiverse theory where it is the quadrillionth digit of pi in all the universes.

[This comment is no longer endorsed by its author]Reply

I get the impression that AIXI re-computes it's actions at every time-step, so it can't pre-commit to paying the CM. I'm not sure if this is an accurate interpretation though.

Something equivalent to precommitment is: it just being in your nature to trust counterfactual muggers. Then, recomputing your actions in every time-step is fine, and it doesn't necessarilly indicate that you don't have a nature that alllows you to pay counterfactual muggers.

I'm not sure if AIXI has a "nature"/personality as such though? I suppose this might be encoded in the initial utility function somehow, but I'm not sure if it's feasible to include all these kinds of scenarios in advance.

That agent "recomputes decisions" is in any case not a valid argument for it being unable to precommit. Precommitment through inability to render certain actions is a workaround, not a necessity: a better decision theory won't be performing those actions of its own accord.

So: me neither - I was only saying that arguing from "recomputing its actions at every time-step", to "lacking precommitment" was an invalid chain of reasoning.

AIXI is incapable of understanding the concept of copies or counterfactual versions of itself. In fact, it's incapable of finding itself in the universe at all. Daniel Dewy did this in detail, but the simple version is that AIXI is an uncomputable algorithm that models the whole universe as computable.

This doesn't really clarify anything. You can consider AIXI as a formal definition of a strategy that behaves a certain way; whether this definition "understands" something is a wrong question.

No it isn't the wrong question; it's a human-understandable statement of a more complex formal fact.

Formally, take the Newcomb problem. Assume Omega copies AIXI and then runs it to do its prediction, then returns to the AIXI to offer the usual deal. The AIXI has various models for what the "copied AIXI run by Omega" will output, weighted by algorithmic complexity. But all these models will be wrong, as the models are all computable and the copied AIXI is not.

It runs into a similar problem when trying to self locate in a universe; its models of the universe are computable, itself is not computable, so it can't locate itself as a piece of the universe.

Why are AIXI's possible programs necessarily "models for what the "copied AIXI run by Omega" will output" (generating programs specifically, I assume you mean)? They could be interpreted in many possible ways (and as you point out, they actually can't be interpreted in this particular way). For Newcomb's problem, we have the similar problem as with CM of explaining to AIXI the problem statement, and it's not clear how to formalize this procedure in case of AIXI's alien ontology, if you don't automatically assume that its programs must be interpreted as the programs generating the toy worlds of thought experiments (that in general can't include AIXI, though they can include AIXI-determined actions; you can have an uncomputable definition that defines a program).

You're right, I over-simplified. What AIXI would do in these situations is dependent on how exactly the problem - and AIXI - is specified.

Curiously, I seem to be a mix between selfless and type 2 according to this. Notably, the self definition is strict enough to exclude most humans, but vague enough that there are still plenty of ones that independently grow sufficiently similar, even without taking into account another phenomena relevant to this wich I'm being vague about.

have a perception-determined utility function (like AIXI)

OK, so, I read the reference - but I still don't know what you mean.

All agents have "perception-determined utility functions" - in the sense that I describe here.

The post you linked is not about whether utility functions are "determined by perceptions", but rather about whether they are based on extrapolated future selves, or extrapolated future environments.

There's no disagreement here, you're just confused about semantics.