You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Closest stable alternative preferences

3 Stuart_Armstrong 20 March 2015 12:41PM

A putative new idea for AI control; index here.

There's a result that's almost a theorem, which is that an agent that is an expected utility maximiser, is an agent that is stable under self-modification (or the creation of successor sub-agents).

Of course, this needs to be for "reasonable" utility, where no other agent cares about the internal structure of the agent (just its decisions), where the agent is not under any "social" pressure to make itself into something different, where the boundedness of the agent itself doesn't affect its motivations, and where issues of "self-trust" and acausal trade don't affect it in relevant ways, etc...

So quite a lot of caveats, but the result is somewhat stronger in the opposite direction: an agent that is not an expected utility maximiser is under pressure to self-modify itself into one that is. Or, more correctly, into an agent that is isomorphic with an expected utility maximiser (an important distinction).

What is this "pressure" agent are "under"? The known result is that if an agent obeys four simple axioms, then its behaviour must be isomorphic with an expected utility maximiser. If we assume the Completeness axiom (trivial) and Continuity (subtle), then violations of Transitivity or Independence correspond to situations where the agent has been money pumped - lost resources or power for no gain at all. The more likely the agent is to face these situations, the more pressure they're under to behave as an expected utility maximiser, or simply lose out.

 

Unbounded agents

I have two models for how idealised agents could deal with this sort of pressure. The first, post-hoc, is the unlosing agent I described here. The agent follows whatever preferences it had, but kept track of its past decisions, and whenever it was in a position to violate transitivity or independence in a way that it would suffer from, it makes another decision instead.

Another, pre-hoc, way of dealing with this is to make an "ultra choice" and choose between not decisions, but all possible input output maps (equivalently, between all possible decision algorithms), looking to the expected consequences of each one. This reduces the choices to a single choice, where issues of transitivity or independence need not necessarily apply.

 

Bounded agents

Actual agents will be bounded, unlikely to be able to store and consult their entire history when making every single decision, and unable to look at the whole future of their interactions to make a good ultra choice. So how would they behave?

This is not determined directly by their preferences, but by some sort of meta-preferences. Would they make an approximate ultra-choice? Or maybe build up a history of decisions, and then simplify it (when it gets to large to easily consult) into a compatible utility function? This is also determined by their interactions, as well - an agent that makes a single decision has no pressure to be an expected utility maximiser, one that makes trillions of related decisions has a lot of pressure.

It's also notable that different types of boundedness (storage space, computing power, time horizons, etc...) have different consequences for unstable agents, and would converge to different stable preference systems.

 

Investigation needed

So what is the point of this post? It isn't presenting new results; it's more an attempt to launch a new sub-field of investigation. We know that many preferences are unstable, and that the agent is likely to make them stable over time, either through self-modification, subagents, or some other method. There are also suggestions for preferences that are known to be unstable, but have advantages (such as resistance to Pascal Muggings) that standard maximalisation does not.

Therefore, instead of saying "that agent design can never be stable", we should be saying "what kind of stable design would that agent converge to?", "does that convergent stable design still have the desirable properties we want?" and "could we get that stable design directly?".

The first two things I found in this area were that traditional satisficers could converge to vastly different types of behaviour in an essentially unconstrained way, and that a quasi-expected utility maximiser of utility u might converge to an expected utility maximiser, but it might not be u that it maximises.

In fact, we need not look only at violations of the axioms of expected utility; they are but one possible reason for decision behaviour instability. Here are some that spring to mind:

  1. Non-independence and non-transitivity (as above).
  2. Boundedness of abilities.
  3. Adversaries and social pressure.
  4. Evolution (survival cost to following “odd” utilities (eg time-dependent preference)).
  5. Unstable decision theories (such as CDT).

Now, some categories (such as "Adversaries and social pressure") may not possess a tidy stable solution, but it is still worth asking what setups are more stable than others, and what the convergence rules are expected to be.

Anti-Pascaline agent

4 Stuart_Armstrong 12 March 2015 02:17PM

A putative new idea for AI control; index here.

Pascal's wager-like situations come up occasionally with expected utility, making some decisions very tricky. It means that events of the tiniest of probability could dominate the whole decision - intuitively unobvious, and a big negative for a bounded agent - and that expected utility calculations may fail to converge.

There are various principled approaches to resolving the problem, but how about an unprincipled approach? We could try and bound utility functions, but the heart of the problem is not high utility, but hight utility combined with low probability. Moreover, this has to behave sensibly with respect to updating.

 

The agent design

Consider a UDT-ish agent A looking at input-output maps {M} (ie algorithms that could determine every single possible decision of the agent in the future). We allow probabilistic/mixed output maps as well (hence A has access to a source of randomness). Let u be a utility function, and set 0 < ε << 1 to be the precision. Roughly, we'll be discarding the highest (and lowest) utilities that are below probability ε. There is no fundamental reason that the same ε should be used for highest and lowest utilities, but we'll keep it that way for the moment.

The agent is going to make an "ultra-choice" among the various maps M (ie fixing its future decision policy), using u and ε to do so. For any M, designate by A(M) the decision of the agent to use M for its decisions.

Then, for any map M, set max(M) to be the lowest number s.t P(u ≥ max(M)|A(M)) ≤ ε. In other words, if the agent decides to use M as its decision policy, this is the maximum utility that can be achieved if we ignore the highest valued ε of the probability distribution. Similarly, set min(M) to be the highest number s.t. P(u ≤ min(M)|A(M)) ≤ ε.

Then define the utility function uMε, which is simply u, bounded between max(M) and min(M). Now calculate the expected value of uMε given A(M), call this Eε(u|A(M)).

The agent then chooses the M that maximises Eε(u|A(M)). Call this the ε-precision u-maximising algorithm.

 

Stability of the design

The above decision process is stable, in that there is a single ultra-choice to be made, and clear criteria for making that ultra-choice. Realistic and bounded agents, however, cannot calculate all the M in sufficient detail to get a reasonable outcome. So we can ask whether the design is stable for a bounded agent.

Note that this question is underdefined, as there are many ways of being bounded, and many ways of cashing out ε-precision u-maximising into bounded form. Most likely, this will not be a direct expected utility maximalisation, so the algorithm will be unstable (prone to change under self-modification). But how exactly it's unstable is an interesting question.

I'll look at one particular situation: one where A was tasked with creating subagents that would go out and interact with the world. These agents are short-sighted: they apply ε-precision u-maximising not to the ultra-choice, but to each individual expected utility calculation (we'll assume the utility gains and losses for each decision is independent).

A has a single choice: what to set ε to for the subagents. Intuitively, it would seem that A would set ε lower than its own value; this could correspond roughly to an agent self-modifying to remove the ε-precision restriction from itself, converging on becoming a u-maximiser. However:

  • Theorem: There are (stochastic) worlds in which A will set the subagent precision to be higher, lower or equal to its own precision ε.

The proof will be by way of illustration of the interesting things that can happen in this setup. Let B be the subagent whose precision A sets.

Let C(p) be a coupon that pays out 1 with probability p. xC(p) simply means the coupon pays out x instead of 1. Each coupon costs ε2 utility. This is negligible, and only serves to break ties. Then consider the following worlds:

  • In W1, B will be offered the possibility of buying C(0.75ε).
  • In W2, B will be offered the possibility of buying C(1.5ε).
  • In W3, B will be offered the possibility of buying C(0.75ε), and the offer will be made twice.
  • In W4, B will be offered, with 50% probability, the possibility of buying C(1.5ε).
  • In W5, B will be offered, with 50% probability, the possibility of buying C(1.5ε), and otherwise the possibility buying 2C(1.5ε).
  • In W6, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.5ε).
  • In W7, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.05ε).

From A’s perspective, the best input-output maps are: in W1, don’t buy, in W2, buy, in W3, buy both, in W4, don’t buy (because the probability of getting above 0 utility by buying, is, from A's initial perspective, 1.5ε/2 = 0.75ε).

W5 is more subtle, and interesting – essentially A will treat 2C(1.5ε) as if it were C(1.5ε) (since the probability of getting above 1 utility by buying is 1.5ε/2 = 0.75ε, while the probability of getting above zero by buying is (1.5ε+1.5ε)/2=1.5ε). Thus A would buy everything offered.

Similarly, in W6, the agent would buy everything, and in W7, the agent would buy nothing (since the probability of getting above zero by buying is now (1.05ε + 0.75ε)/2 = 0.9ε).

So in W1 and W2, the agent can leave the sub-agent precision at ε. In W2, it needs to lower it below 0.75ε. In W4, it needs to raise it above 1.5ε. In W5 it can leave it alone, while in W6 it must lower it below 0.75ε, and in W7 it must raise it above 1.05ε.

 

Irrelevant information

One nice feature about this approach is that it ignores irrelevant information. Specifically:

  • Theorem: Assume X is a random variable that is irrelevant to the utility function u. If A (before knowing X) has to design successor agents that will exist after X is revealed, then (modulo a few usual assumptions about only decisions mattering, not internal thought processes) it will make these successor agents isomorphic to copies of itself, i.e. ε-precision u-maximising algorithms (potentially with a different way of breaking ties).

These successor agents are not the short-sighted agents of the previous model, but full ultra-choice agents. Their ultra-choice is over all decisions to come, while A's ultra-choice (which is simply a choice) is over all agent designs.

For the proof, I'll assume X is boolean valued (the general proof is similar). Let M be the input-output map A would choose for itself, if it were to make all the decisions itself rather than just designing a subagent. Now, it's possible that M(X) will be different from M(¬X) (here M(X) and M(¬X) are contractions of the input-output map by adding in one of the inputs).

Define the new input-ouput map M' by defining a new internal variable Y in A (recall that A has access to a source of randomness). Since this variable is new, M is independent of the value of Y. Then M' is defined as M with X and Y permuted. Since both Y and X are equally irrelevant to u, Eε(u|A(M))=Eε(u|A(M')), so M' is an input output map that fulfils the ε-precision u-maximising. And M'(X)=M'(¬X), so M' is independent of X.

Now consider the subagent that runs the same algorithm as A, and has seen X. Because of the irrelevance of X, M'(X) will still fulfil ε-precision u-maximising (we can express any fact relevant to u in the form of Zs, with P(Z)=P(Z|X), and then the algorithm is the same).

Similarly, a subagent that has seen ¬X will run M'(¬X). Putting these together, the subagent will expect to run M'(X) with probability P(X) and M'(¬X) with probability P(¬X)=1-P(X).

Since M'(X)=M'(¬X), this whole thing is just M'. So if A creates a copy of itself (possibly tweaking the tie-breaking so that M' is selected), then it will achieve its maximum according to ε-precision u-maximising.

Pascal's wager

-11 duckduckMOO 22 April 2013 04:41AM


I started this as a comment on "Being half wrong about pascal's wager is even worse" but its really long, so I'm posting it in discussion instead.

 

Also I illustrate here using negative examples (hell and equivalents) for the sake of followability and am a little worried about inciting some paranoia so am reminding you here that every negative example has an equal and opposite positive partner. For example pascal's wager has the opposite where accepting sends you to hell, it also has the opposite where refusing sends you to heaven. I haven't mentioned any positive equivalents or opposites below. Also all of these possibilities are literally effectively 0 so don't be worrying.

 

"For so long as I can remember, I have rejected Pascal's Wager in all its forms on sheerly practical grounds: anyone who tries to plan out their life by chasing a 1 in 10,000 chance of a huge pay-off is almost certainly doomed in practice.  This kind of clever reasoning never pays off in real life..."

 

Pascal's wager shouldn't be in in the reference class of real life. It is a unique situation that would never crop up in real life as you're using it. In the world in which pascal's wager is correct you would still see people who plan out their lives on a 1 in 10000 chance of a huge pay-off fail 9999 times out of 10000. Also, this doesn't work for actually excluding pascal's wager. If pascal's wager starts off excluded from the category real life you've already made up your mind so this cannot quite be the actual order of events.

 

In this case 9999 times you waste your Christianity and 1/10000 you don't go to hell for eternity, which is, at a vast understatement, much worse than 10000 times as bad as worshipping god even at the expense of the sanity it costs to force a change in belief, the damage it does to your psyche to live as a victim of self inflicted Stockholm syndrome, and any other non obvious cost: With these premises choosing to believe in God produces infinitely better consequences on average.

 

Luckily the premises are wrong. 1/10000 is about 1/10000 too high for the relevant probability. Which is:

the probability that the wager or equivalent, (anything whose acceptance would prevent you going to hell is equivalent) is true

MINUS

the probability that its opposite or equivalent, (anything which would send you to hell for accepting is equivalent), is true 

 

1/10000 is also way too high even if you're not accounting for opposite possibilities.

 

 

Equivalence here refers to what behaviours it punishes or rewards. I used hell because it is in the most popular wager but it applies to all wagers. To illustrate: If its true that there is one god: ANTIPASCAL GOD, and he sends you to hell for accepting any pascal's wager, then that's equivalent to any pascal's wager you hear having an opposite (no more "or equivalent"s will be typed but they still apply) which is true because if you accept any pascal's wager you go to hell. Conversely, If PASCAL GOD is the only god and he sends you to hell unless you accept any pascal's wager, that's equivalent to any pascal's wager you hear being true.

 

The real trick of pascals wager is the idea that they're generally no more likely than their opposite. For example, there are lots of good, fun, reasons to assign the Christian pascal's wager a lower probability than its opposite even engaging on a Christian level:

 

Hell is a medieval invention/translation error: the eternal torture thing isn't even in the modern bibles.

The belief or hell rule is hella evil and gains credibility from the same source (Christians, not the bible) who also claim that god is good as a more fundamental belief, which directly contradicts the hell or belief rule.

The bible claims that God hates people eating shellfish, taking his name in vain, and jealousy. Apparently taking his name in vain is the only unforgivable sin. So if they're right about the evil stuff, you're probably going to hell anyway.

It makes no sense that god would care enough about your belief and worship to consign people to eternal torture but not enough to show up once in a while.

it makes no sense to reward people for dishonesty.

The evilness really can't be overstated. eternal torture as a response to a mistake which is at its worst due to stupidity (but actually not even that: just a stacked deck scenario), outdoes pretty much everyone in terms of evilness. worse than pretty much every fucked up thing every other god is reputed to have done put together. The psychopath in the bible doesn't come close to coming close.

 

The problem with the general case of religious pascal's wagers is that people make stuff up (usually unintentionally) and what made up stuff gains traction has nothing to do with what is true. When both Christianity and Hinduism are taken seriously by millions (as were the Roman/Greek gods, and Viking gods, and Aztec gods, and Greek gods, and all sorts of other gods at different times, by large percentages of people) mass religious belief is 0 evidence. At most one religion set (e.g. Greek/Roman, Christian/Muslim/Jewish, etc) is even close to right so at least the rest are popular independently of truth.

 

The existence of a religion does not elevate the possibility that the god they describe exists above the possibility that the opposite exists because there is no evidence that religion has any accuracy in determining the features of a god, should one exist.

 

You might intuitively lean towards religions having better than 0 accuracy if a god exists but remember there's a lot of fictional evidence out there to generalise from. It is a matter of judgement here. there's no logical proof for 0 or worse accuracy (other than it being default and the lack of evidence) but negative accuracy is a possibility and you've probably played priest classes in video games or just seen how respected religions are and been primed to overestimate religion's accuracy in that hypothetical. Also if there is a god it has not shown itself publicly in a very long time, or ever. So it seems to have a preference for not being revealed.  Also humans tend to be somewhat evil and read into others what they see in themselves. and I assume any high tier god (one that had the power to create and maintain a hell, detect disbelief, preserve immortal souls and put people in hell) would not be evil. Being evil or totally unscrupled has benefits among humans which a god would not get. I think without bad peers or parents there's no reason to be evil. I think people are mostly evil in relation to other people.  So I religions a slight positive accuracy in the scenario where there is a god but it does not exceed priors against pascal's wager (another one is that they're pettily human) or perhaps even the god's desire to stay hidden. 

 

Even if God itself whispered pascal's wager in your ear there is no incentive for it to actually carry out the threat: 

 

There is only one iteration.

AND

These threats aren't being made in person by the deity. They are either second hand or independently discovered so:

The deity has no use for making the threat true, to claim it more believably, as it might if it was an imperfect liar (at a level detectable by humans) that made the threats in person.

The deity has total plausible deniability.

Which adds up to all of the benefits of the threat having already being extracted by the time the punishment is due and no possibility of a rep hit (which wouldn't matter anyway.)

 

So, All else being equal. i.e. unless the god is the god of threats or pascal's wagers (whose opposites are equally likely):

 

If God is good (+ev on human happiness -ev on human sadness that sort of thing), actually carrying out the threats has negative value.

If god is scarily-doesn't-give-a-shit-neutral to humans, it still has no incentive to actually carry out the threat and a non zero energy cost.

if god gives the tiniest most infinitesimal shit about humans its incentive to actually carry out the threat is negative.

 

If God is evil you're fucked anyway:

The threat gains no power by being true, so the only incentive a God can have for following through is that it values human suffering. If it does, why would it not send you to hell if you believed in it? (remember that the god of commitments is as likely as the god of breaking commitments)

 

Despite the increased complexity of a human mind I think the most (not saying its at all likely just that all others are obviously wrong) likely motivational system for a god which would make it honour the wager is that that God thinks like a human and therefore would keep its commitment out of spite or gratitude or some other human reason. So here's why I think that one is wrong. It's generalizing from fictional evidence: humans aren't that homogeneous (and one without peers would be less so), and if a god gains likelihood to keep a commitment from humanness it also gains not -designed-to-be-evil-ness that would make it less likely to make evil wagers.  It also has no source for spite or gratitude, having no peers. Finally could you ever feel spite towards a bug? Or gratitude? We are not just ants compared to a god, we're ant-ant-ant-etc-ants.

 

Also there's the reasons that refusing can actually get you in trouble:  bullies don't get nicer when their demands are met. It's often not the suffering they're after but the dominance, at which point the suffering becomes an enjoyable illustration of that dominance.  As we are ant-ant-etc-ants this probability is lower but The fact that we aren't all already in hell suggests that if god is evil it is not raw suffering that it values. Hostages are often executed even when the ransom is paid. Even if it is evil, it could be any kind of evil: its preferences cannot have been homogenised by memes and consensus.

 

There's also the rather cool possibility that if human-god is sending people to hell, maybe its for lack of understanding. If it wants belief it can take it more effectively than this. If it wants to hurt you it will hurt you anyway. Perhaps peerless, it was never prompted to think through the consequences of making others suffer. Maybe god, in the absence of peers just needs someone to explain that its not nice to let people burn in hell for eternity. I for one remember suddenly realising that those other fleshbags hosted people. I figured it out for myself but if I grew up alone as the master of the universe maybe I would have needed someone to explain it to me.

 

[Link] {song parody) The Devil and Blaise Pascal

6 NancyLebovitz 30 August 2011 09:12PM

Just for the fun of it....

The Devil Went Down to Paris

This is a spoiler for a game being run by Scott in Cork, Ireland, so you might not want to read it if you might be in that game.